【正文】
age Data 35 Highlevel architecture of a scalable universal crawler Several parallel queues to spread load across servers (keep connections alive) DNS server using UDP (less overhead than TCP), large persistent inmemory cache, and prefetching Optimize use of work bandwidth Optimize disk I/O throughput Huge farm of crawl machines Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 36 Universal crawlers: Policy ? Coverage – New pages get added all the time – Can the crawler find every page? ? Freshness – Pages change over time, get removed, etc. – How frequently can a crawler revisit ? ? Tradeoff! – Focus on most “important” pages (crawler bias)? – “Importance” is subjective Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 37 Maintaining a “fresh” collection ? Universal crawlers are never “done” ? High variance in rate and amount of page changes ? HTTP headers are notoriously unreliable – Lastmodified – Expires ? Solution – Estimate the probability that a previously visited page has changed in the meanwhile – Prioritize by this probability estimate Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 38 Preferential crawlers ? Assume we can estimate for each page an importance measure, I(p) ? Want to visit pages in order of decreasing I(p) ? Maintain the frontier as a priority queue sorted by I(p) ? Possible figures of merit: – Precision ~ | p: crawled(p) amp。 } Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 12 Open Source Crawlers ? Reference C implementation of HTTP, HTML parsing, etc – w3clib package from WorldWide Web Consortium: ? LWP (Perl) – – ? Open source crawlers/search engines – Nutch: (Jakarta Lucene: ) – Heretrix: – WIRE: – Terrier: ? Open source topical crawlers, BestFirstN (Java) – ? Evaluation framework for topical crawlers (Perl) – Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 13 Web Crawler Implementation issues ? Fetching ? Parsing ? Stopword Removal and Stemming ? Link Extraction and Canonicalization ? Spider Traps ? Page Repository ? Concurrency Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 14 Implementation issues ? Don’t want to fetch same page twice! – Keep lookup table (hash) of visited pages – What if not visited but in frontier already? ? The frontier grows very fast! – May need to prioritize for large crawls ? Fetcher must be robust! – Don’t crash if download fails – Timeout mechanism ? Determine file type to skip unwanted files – Can try using extensions, but not reliable – Can issue ‘HEAD’ HTTP mands to get ContentType (MIME) headers, but overhead of extra Inter requests Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 15 Implementation issues ? Fetching – Get only the first 10100 KB per page – Take care to detect and break redirection loops – Soft fail for timeout, server not responding, file not found, and other errors Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 16 Implementation issues: Parsing Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 17 Implementation issues: Parsing ? HTML has the structure of a DOM (Document Object Model) tree ? Unfortunately actual HTML is often incorrect in a strict syntactic sense ? Crawlers, like browsers, must be robust/fiving ? Fortunately there are tools that can help – . ? Must pay attention to HTML entities and unicode in text ? What to do with a growing number of other formats? – Flash, SVG, RSS, AJAX… Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 18 Implementation issues ? Stop words – Noise words that do not carry meaning should be eliminated (“stopped”) before they are indexed – . in English: AND, THE, A, AT, OR, ON, FOR, etc… – Typically syntactic markers – Typically the most mon terms – Typically kept in a negative dictionary ? 10–1,000 elements ? . – Parser can detect these right away and disregard them Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 19 Implementation issues Conflation and thesauri ? Idea: improve recall by merging words with same meaning 1. We want to ignore superficial