Differences between revisions 20 and 21

Client-Side Web Programming

Libraries

utidylib and mxTidy -- Python interfaces to html tidy library to clean up HTML documents.
html5lib A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup
BeautifulSoup -- a permissive HTML parser.
Don't use HTMLParser on HTML that might be invalid! That way lies pain. Either clean it up (using tidy), or use a different parser.
urllib, urllib2, and httplib in the standard library.
ClientCookie, ClientForm, and Mechanize are higher-level libraries for writing a web client.
mechanoid a mechanize fork.
libxml2dom can parse HTML by employing libxml2's liberal HTML parser.

Resources

Grab a document from the web - from the Python Cookbook
Python web-client programming general FAQs.
urllib -- Open arbitrary resources by URL
urllib2 -- extensible library for opening URLs

-  ⇤ ← Revision 20 as of 2008-10-12 09:46:22 → 
  Size: 1841
  Editor: 75-164-152-250
  Comment: Fixed some no longer valid url's
+   ← Revision 21 as of 2008-11-15 14:00:47 → ⇥
  Size: 1877
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 5:
- * [https://utidylib.berlios.de/ utidylib] and [https://www.egenix.com/files/python/mxTidy.html mxTidy] -- Python interfaces to [https://tidy.sourceforge.net/ html tidy] library to clean up HTML documents.
 * [https://code.google.com/p/html5lib html5lib] A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup
 * [https://www.crummy.com/software/BeautifulSoup/ BeautifulSoup] -- a permissive HTML parser.
 * Don't use [https://python.domainunion.de/doc/current/lib/module-HTMLParser.html HTMLParser] on HTML that might be invalid!  That way lies pain.  Either clean it up (using tidy), or use a different parser.
 * [https://docs.python.domainunion.de/library/urllib.html urllib], [https://docs.python.domainunion.de/library/urllib2.html urllib2], and [https://docs.python.domainunion.de/library/httplib.html httplib] in the standard library.
 * [https://wwwsearch.sourceforge.net/ClientCookie/ ClientCookie], [https://wwwsearch.sourceforge.net/ClientForm/ ClientForm], and [https://wwwsearch.sourceforge.net/mechanize/ Mechanize] are higher-level libraries for writing a web client.
 * [https://python.domainunion.de/pypi?:action=display&name=mechanoid&version=0.4.1 mechanoid] a mechanize fork.
 * [https://python.domainunion.de/pypi/libxml2dom libxml2dom] can parse HTML by employing libxml2's liberal HTML parser.
+ * [[https://utidylib.berlios.de/|utidylib]] and [[https://www.egenix.com/files/python/mxTidy.html|mxTidy]] -- Python interfaces to [[https://tidy.sourceforge.net/|html tidy]] library to clean up HTML documents.
 * [[https://code.google.com/p/html5lib|html5lib]] A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup
 * [[https://www.crummy.com/software/BeautifulSoup/|BeautifulSoup]] -- a permissive HTML parser.
 * Don't use [[https://python.domainunion.de/doc/current/lib/module-HTMLParser.html|HTMLParser]] on HTML that might be invalid!  That way lies pain.  Either clean it up (using tidy), or use a different parser.
 * [[https://docs.python.domainunion.de/library/urllib.html|urllib]], [[https://docs.python.domainunion.de/library/urllib2.html|urllib2]], and [[https://docs.python.domainunion.de/library/httplib.html|httplib]] in the standard library.
 * [[https://wwwsearch.sourceforge.net/ClientCookie/|ClientCookie]], [[https://wwwsearch.sourceforge.net/ClientForm/|ClientForm]], and [[https://wwwsearch.sourceforge.net/mechanize/|Mechanize]] are higher-level libraries for writing a web client.
 * [[https://python.domainunion.de/pypi?:action=display&name=mechanoid&version=0.4.1|mechanoid]] a mechanize fork.
 * [[https://python.domainunion.de/pypi/libxml2dom|libxml2dom]] can parse HTML by employing libxml2's liberal HTML parser.
 Line 16:
- * [https://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52199 Grab a document from the web] - from the Python Cookbook
 * [https://wwwsearch.sourceforge.net/bits/clientx.html Python web-client programming general FAQs].
 * [https://docs.python.domainunion.de/library/urllib.html urllib -- Open arbitrary resources by URL]
 * [https://docs.python.domainunion.de/library/urllib2.html urllib2 -- extensible library for opening URLs]
+ * [[https://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52199|Grab a document from the web]] - from the Python Cookbook
 * [[https://wwwsearch.sourceforge.net/bits/clientx.html|Python web-client programming general FAQs]].
 * [[https://docs.python.domainunion.de/library/urllib.html|urllib -- Open arbitrary resources by URL]]
 * [[https://docs.python.domainunion.de/library/urllib2.html|urllib2 -- extensible library for opening URLs]]

Page

User

Client-Side Web Programming

Libraries

Resources