#pragma section-numbers off === Description === A www crawler(robot) program in python. === Information === [[http://freecode.com/projects/harvestman|"Freecode Project Page"]] [[http://harvestman.freezope.org/|"HarvestMan Home Page"]] link gone version:: 1.4 (''<>'') licence:: GNU GPL Python versions:: 2.2, 2.3, 2.4 Platforms :: Any platform supported by python Binaries :: None === How it spins its web === HarvestMan uses a threading model using python threads to achieve a very fast, but highly customizable download of web-sites on the internet. It can be used to download files from intranet servers. It is the first multithreaded, opensource webcrawler written in python. === Features === * Fully Multithreaded * Number of threads configurable by user * Support for robots exclusion protocol * Filtering of urls using regular expressions * Filtering of server names using regular expressions * Control download by specifying depth of fetching * Configure by number of files downloadable * Specify timeout for individual threads * Control download speed by changing thread/depth options. * HTTP/FTP/HTTPS support & support for servers in LAN. * XML project files which can be re-read * Smart reconnection * Support for proxies/firewalls * File limits, server limits * Projects browser page * Command line/config file support * Use as a program or as a web-spider module * OO architecture === Who should use it === HarvestMan is written for the desktop user. It can be used as an internet spidering module also. An API for external users is being written. === Taxonomy === Species: HarvestMan Genus: (Internet) Spiders === Developers === Anand B Pillai,