【正文】
k11 k33 k40 k65 Key Hash N80 is down… Incremental Crawling 問題 ? 有限的資源條件下 ? 網(wǎng)絡(luò)帶寬,存儲空間 ? Crawler系統(tǒng)怎樣和變化的 Web同步? ? 如何估計(jì)網(wǎng)頁變化頻率,來預(yù)測其更新時間? ? 如何度量搜集結(jié)果的優(yōu)劣 ? ? 按預(yù)測到的更新時間去抓取是最優(yōu)策略嗎? Stateofart Crawling Tech Sitemaps: Above and Beyond the Crawl of Duty Sitemaps! Sitemaps! Copyright Uri Schonfeld, April 2022 Uri Schonfeld (Google and UCLA) Narayanan Shivakumar (Google) Dream of the Perfect Crawl 1. Users Have High Expectations: ? Coverage: Every page should be findable ? Freshness: Latest event, viral video,... ? Deep Web: ajax, flash, silverlight,.... 2. Search Engines Dream of the perfect crawl: ? Everything the users want ? …but efficient: ? No 404s ? No duplicates 3. Sitemaps to the rescue... Sitemaps ? UniqueCoverage vs Domain Size ? 46% domains have above 50% UniqueCoverage ? 12% domains have 90% UniqueCoverage. Conclusion and Future Work 1. Large scale study, real data 2. You cannot stop Discovery… yet. 3. Presented metrics for freshness and coverage. 4. Sitemaps evaluated for coverage and freshness. 5. Presented Algorithm to bine Sitemaps amp。 Discovery 6. To Be Done 1. Good news: tons of future work 2. Duplicates not solved on webserver side either. 3. Better Pings. 4. Ranking Sitemaps URLs can be a challenge. Copyright Uri Schonfeld, April 2022 IRLbot: Scaling to 6 Billion Pages and Beyond ? WWW2022 Best Paper Award! ? In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled billion valid HTML pages ($$ billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s) Challenges ? We identified several bottlenecks in building an efficient largescale crawler and presented our solution to these problems ? Scalability ? Lowoverhead diskbased data structures ? Reputation and spam ? NonBFS crawling order ? Politeness ? Realtime reputation to guide the crawling rate 本次課小結(jié) ? Crawler面臨的難題 ? Scalable, fast, polite, robust, continuous ? 實(shí)現(xiàn)高效率的基本技術(shù) ? Cache ? Prefetch ? Concurrency ? 多進(jìn)程 /多線程 ? 異步 I/O ? 有趣的技術(shù) ? Bloom filter ? Consistent Hashing 下次課內(nèi)容 ? Web圖和鏈接分析 ? Homework ? 求以下矩陣的特征值和特征向量 ? 2 3 1 4 ? 0 1 2 1 ? A = 0 1 2 2 ? 0 1 1 2 Thank You! Qamp。A