1877
Comment: Add html5lib to libraries section
|
1841
Fixed some no longer valid url's
|
Deletions are marked like this. | Additions are marked like this. |
Line 9: | Line 9: |
* [https://python.domainunion.de/doc/current/lib/module-urllib.html urllib], [https://python.domainunion.de/doc/current/lib/module-urllib2.html urllib2], and [https://python.domainunion.de/doc/current/lib/module-httplib.html httplib] in the standard library. | * [https://docs.python.org/library/urllib.html urllib], [https://docs.python.org/library/urllib2.html urllib2], and [https://docs.python.org/library/httplib.html httplib] in the standard library. |
Line 18: | Line 18: |
* [https://docs.python.domainunion.de/lib/module-urllib.html urllib -- Open arbitrary resources by URL] * [https://docs.python.domainunion.de/lib/module-urllib2.html urllib2 -- extensible library for opening URLs] |
* [https://docs.python.domainunion.de/library/urllib.html urllib -- Open arbitrary resources by URL] * [https://docs.python.domainunion.de/library/urllib2.html urllib2 -- extensible library for opening URLs] |
Client-Side Web Programming
Libraries
[https://utidylib.berlios.de/ utidylib] and [https://www.egenix.com/files/python/mxTidy.html mxTidy] -- Python interfaces to [https://tidy.sourceforge.net/ html tidy] library to clean up HTML documents.
[https://code.google.com/p/html5lib html5lib] A HTML5-compliant library for parsing arbitarily-broken HTML to a range of tree formats including minidom, elementtree (including lxml) and BeautifulSoup
[https://www.crummy.com/software/BeautifulSoup/ BeautifulSoup] -- a permissive HTML parser.
Don't use [https://python.domainunion.de/doc/current/lib/module-HTMLParser.html HTMLParser] on HTML that might be invalid! That way lies pain. Either clean it up (using tidy), or use a different parser.
[https://docs.python.domainunion.de/library/urllib.html urllib], [https://docs.python.domainunion.de/library/urllib2.html urllib2], and [https://docs.python.domainunion.de/library/httplib.html httplib] in the standard library.
[https://wwwsearch.sourceforge.net/ClientCookie/ ClientCookie], [https://wwwsearch.sourceforge.net/ClientForm/ ClientForm], and [https://wwwsearch.sourceforge.net/mechanize/ Mechanize] are higher-level libraries for writing a web client.
[https://python.domainunion.de/pypi?:action=display&name=mechanoid&version=0.4.1 mechanoid] a mechanize fork.
[https://python.domainunion.de/pypi/libxml2dom libxml2dom] can parse HTML by employing libxml2's liberal HTML parser.
Resources
[https://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52199 Grab a document from the web] - from the Python Cookbook
[https://wwwsearch.sourceforge.net/bits/clientx.html Python web-client programming general FAQs].
[https://docs.python.domainunion.de/library/urllib.html urllib -- Open arbitrary resources by URL]
[https://docs.python.domainunion.de/library/urllib2.html urllib2 -- extensible library for opening URLs]