= Escaping HTML =
The {{{cgi}}} module that comes with Python has an {{{escape()}}} function:
{{{#!python
import cgi
s = cgi.escape( """& < >""" ) # s = "& < >"
}}}
However, it doesn't escape characters beyond {{{&}}}, {{{<}}}, and {{{>}}}. If it is used as {{{cgi.escape(string_to_escape, quote=True)}}}, it also escapes {{{"}}}.
Recent Python 3.2 have [[https://docs.python.org/3/library/html.html| html module]] with {{{html.escape()}}} and {{{html.unescape()}}} functions. {{{html.escape()}}} differs from {{{cgi.escape()}}} by its defaults to {{{quote=True}}}:
{{{#!python
import html
s = html.escape( """& < " ' >""" ) # s = '& < " ' >'
}}}
Here's a small snippet that will let you escape quotes and apostrophes as well:
{{{#!python
html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
}}}
You can also use {{{escape()}}} from {{{xml.sax.saxutils}}} to escape html. This function should execute faster. The {{{unescape()}}} function of the same module can be passed the same arguments to decode a string.
{{{#!python
from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
'"': """,
"'": "'"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}
def html_escape(text):
return escape(text, html_escape_table)
def html_unescape(text):
return unescape(text, html_unescape_table)
}}}
== Unescaping HTML ==
Undoing the escaping performed by {{{cgi.escape()}}} isn't directly supported by the library. This can be accomplished using a fairly simple function, however:
{{{#!python
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
}}}
or alternatively (before [[http://bugs.python.org/issue2927|issue2927]]):
{{{
>>> from HTMLParser import HTMLParser
>>> HTMLParser.unescape.__func__(HTMLParser, 'ss©')
u'ss\xa9'
}}}
Note that this will undo exactly what {{{cgi.escape()}}} does; it's easy to extend this to undo what the {{{html_escape()}}} function above does. Note the comment that converting the {{{&}}} must be last; this avoids getting strings like {{{"<"}}} wrong.
This approach is simple and fairly efficient, but is limited to supporting the entities given in the list. A more thorough approach would be to perform the same processing as an HTML parser. Using the HTML parser from the standard library is a little more expensive, but many more entity replacements are supported "out of the box." The table of entities which are supported can be found in the {{{htmlentitydefs}}} module from the library; this is not normally used directly, but the {{{htmllib}}} module uses it to support most common entities. It can be used very easily:
{{{#!python
import htmllib
def unescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
}}}
This version has the additional advantage that it supports character references (things like {{{A}}}) as well as entity references.
A more efficient implementation would simply parse the string for entity and character references directly (and would be a good candidate for the library, if there's really a need for it outside of HTML data).
== Formal htmlentitydefs ==
Yet another approach available with recent Python takes advantage of htmlentitydefs:
{{{
import re
from htmlentitydefs import name2codepoint
def htmlentitydecode(s):
return re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), s)
}}}
== Builtin HTML/XML escaping via ASCII encoding ==
A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the {{{xmlcharrefreplace}}} encoding error handling:
{{{
>>> a = u"äöüßáà"
>>> a.encode('ascii', 'xmlcharrefreplace')
'äöüßáà'
}}}
Note, that this does only transform ''non''-ASCII characters and therefore leaves {{{<}}}, {{{>}}}, {{{?}}} as they are. However, you can combine this technique with the {{{cgi.escape}}}.
== See Also ==
XML entities are different from, if related to, HTML entities. This page hints at the details:
* [[EscapingXml|EscapingXml]]
John J. Lee discusses still more refinements in implementation in [[http://groups.google.com/group/comp.lang.python/msg/ce3fc3330cbbac0a|this comp.lang.python follow-up]].