正文內(nèi)容

webmining網(wǎng)路探勘-文庫吧資料

2024-10-07 19:35本頁面

　　

【正文】 age Data 35 Highlevel architecture of a scalable universal crawler Several parallel queues to spread load across servers (keep connections alive) DNS server using UDP (less overhead than TCP), large persistent inmemory cache, and prefetching Optimize use of work bandwidth Optimize disk I/O throughput Huge farm of crawl machines Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 36 Universal crawlers: Policy ? Coverage – New pages get added all the time – Can the crawler find every page? ? Freshness – Pages change over time, get removed, etc. – How frequently can a crawler revisit ? ? Tradeoff! – Focus on most “important” pages (crawler bias)? – “Importance” is subjective Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 37 Maintaining a “fresh” collection ? Universal crawlers are never “done” ? High variance in rate and amount of page changes ? HTTP headers are notoriously unreliable – Lastmodified – Expires ? Solution – Estimate the probability that a previously visited page has changed in the meanwhile – Prioritize by this probability estimate Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 38 Preferential crawlers ? Assume we can estimate for each page an importance measure, I(p) ? Want to visit pages in order of decreasing I(p) ? Maintain the frontier as a priority queue sorted by I(p) ? Possible figures of merit: – Precision ~ | p: crawled(p) amp。 } Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 12 Open Source Crawlers ? Reference C implementation of HTTP, HTML parsing, etc – w3clib package from WorldWide Web Consortium: ? LWP (Perl) – – ? Open source crawlers/search engines – Nutch: (Jakarta Lucene: ) – Heretrix: – WIRE: – Terrier: ? Open source topical crawlers, BestFirstN (Java) – ? Evaluation framework for topical crawlers (Perl) – Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 13 Web Crawler Implementation issues ? Fetching ? Parsing ? Stopword Removal and Stemming ? Link Extraction and Canonicalization ? Spider Traps ? Page Repository ? Concurrency Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 14 Implementation issues ? Don’t want to fetch same page twice! – Keep lookup table (hash) of visited pages – What if not visited but in frontier already? ? The frontier grows very fast! – May need to prioritize for large crawls ? Fetcher must be robust! – Don’t crash if download fails – Timeout mechanism ? Determine file type to skip unwanted files – Can try using extensions, but not reliable – Can issue ‘HEAD’ HTTP mands to get ContentType (MIME) headers, but overhead of extra Inter requests Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 15 Implementation issues ? Fetching – Get only the first 10100 KB per page – Take care to detect and break redirection loops – Soft fail for timeout, server not responding, file not found, and other errors Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 16 Implementation issues: Parsing Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 17 Implementation issues: Parsing ? HTML has the structure of a DOM (Document Object Model) tree ? Unfortunately actual HTML is often incorrect in a strict syntactic sense ? Crawlers, like browsers, must be robust/fiving ? Fortunately there are tools that can help – . ? Must pay attention to HTML entities and unicode in text ? What to do with a growing number of other formats? – Flash, SVG, RSS, AJAX… Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 18 Implementation issues ? Stop words – Noise words that do not carry meaning should be eliminated (“stopped”) before they are indexed – . in English: AND, THE, A, AT, OR, ON, FOR, etc… – Typically syntactic markers – Typically the most mon terms – Typically kept in a negative dictionary ? 10–1,000 elements ? . – Parser can detect these right away and disregard them Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 19 Implementation issues Conflation and thesauri ? Idea: improve recall by merging words with same meaning 1. We want to ignore superficial

點擊復制文檔內(nèi)容

教學課件相關推薦

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

webmining網(wǎng)路探勘-文庫吧資料

catar-文獻內(nèi)容探勘工具-文庫吧資料

資訊檢索與知識探勘-文庫吧資料

車載網(wǎng)路-文庫吧資料

網(wǎng)路應用-文庫吧資料

客戶關系管理與資料探勘-文庫吧資料

網(wǎng)路基本概念網(wǎng)路router設定-文庫吧資料

網(wǎng)路安全期末報告網(wǎng)路詐騙手法-文庫吧資料

網(wǎng)路通訊世界-文庫吧資料

公民網(wǎng)路報導-文庫吧資料

網(wǎng)路計劃技術ppt課件-文庫吧資料

網(wǎng)路倫理與法律-文庫吧資料

網(wǎng)路著作權-文庫吧資料

我的網(wǎng)路資源-文庫吧資料

網(wǎng)路模型networkmodels-文庫吧資料

通訊網(wǎng)路ch13-企業(yè)網(wǎng)路的建置-文庫吧資料

webmining網(wǎng)路探勘-免費閱讀

webmining網(wǎng)路探勘(存儲版)

webmining網(wǎng)路探勘-文庫吧在線文庫

webmining網(wǎng)路探勘(完整版)

webmining網(wǎng)路探勘(更新版)