【文章內(nèi)容簡(jiǎn)介】
the different crawler ponents as different processes. A single URL server process maintains the set of URLs to download。 crawling processes fetch pages。 indexing processes extract words and links。 and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes municate via the file system. For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, municating web crawler processes. Each crawler process is responsible for a subset of all web servers。 the assignment of URLs to crawler processes is based on a hash of the URL’s host ponent. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4. Cho and GarciaMolina’s crawler [13] is similar to Mercator. The system is posed of multiple independent, municating web crawler processes (called “Cprocs”). Cho and GarciaMolina consider different schemes for partitioning the URL space, including URLbased (assigning an URL to a Cproc based on a hash of the entire URL), sitebased (assigning an URL to a Cproc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a Cproc based on some property of the URL, such as its toplevel domain). The WebFountain crawler [16] is also posed of a set of independent, municating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant. UbiCrawler (formerly known as Trovatore) [4, 5] is again posed of multiple independent, municating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates failover to other crawling processes. Shkapenyuk and Suel’s crawler [35] is similar to Google’s。 the different crawler ponents are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFSmounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded. Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the diskbased set. Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are prised of cooperating crawling pr