【正文】
rvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently. The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our remendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.2. CRAWLINGWeb crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of these crawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching. The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save nonlocal URLs to disk。 at the end of a crawl, a batch job adds these URLs to the perhost seed sets of the next crawl. The original Google crawler, described in [7], implements the different crawler ponents as different processes. A single URL server process maintains the set of URLs to download。 crawling processes fetch pages。 indexing processes extract words and links。 and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes municate via the file system. For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, municating web crawler processes. Each crawler process is responsible for a subset of all web servers。 the assignment of URLs to crawler processes is based on a hash of the URL’s host ponent. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4. Cho and GarciaMolina’s crawler [13] is similar to Mercator. The system is posed of multiple independent, municating web crawler processes (called “Cprocs”). Cho and GarciaMolina consider different schemes for partitioning the URL space, including URLbased (assigning an URL to a Cproc based on a hash of the entire URL), sitebased (assigning an URL to a Cproc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a Cproc based on some property of the URL, such as its toplevel domain). The WebFountain crawler [16] is also posed of a set of independent, municating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant. UbiCrawler (formerly known as Trovatore) [4, 5] is again posed of multiple independent, municating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates failover to other crawling processes. Shkapenyuk and Suel’s crawler [35] is similar to Google’s。 the different crawler ponents are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFSmounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded. Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult the diskbased set. Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are prised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.3. CACHINGIn most puter systems, memory is hierarchical, that is, there exist two or more levels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast onchip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right c