正文內(nèi)容

webmining網(wǎng)路探勘-閱讀頁

2024-10-19 19:35本頁面

　　

【正文】 eader to specify crawler developer – Do not disguise crawler as a browser by using their ‘UserAgent’ string ? Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem ? Pay attention to anything that may lead to too many requests to any one server, even unwillingly, .: – redirection loops – spider traps Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 45 Crawler etiquette (important!) ? Spread the load, do not overwhelm a server – Make sure that no more than some max. number of requests to any single server per unit time, say 1/second ? Honor the Robot Exclusion Protocol – A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named ‘’ placed in the HTTP root directory, . – Crawler should always check, parse, and obey this file before sending any requests to a server – More info at: ? ? Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 46 More on robot exclusion ? Make sure URLs are canonical before checking against ? Avoid fetching for each request to a server by caching its policy as relevant to this crawler Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 47 for Useragent: * Disallow: All crawlers… …can go anywhere! Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 48 file for Useragent: * Disallow: /canada/Library/mnp/2/aspx/ Disallow: /munities/ Disallow: /munities/ Disallow: /munities/blogs/ Disallow: /munities/ Disallow: /downloads/ Disallow: /downloads/ Disallow: /france/formation/centres/ Disallow: /france/ Disallow: /germany/library/images/mnp/ Disallow: /germany/ Disallow: /ie/ie40/ Disallow: /info/ Disallow: /info/ Disallow: /intlkb/ Disallow: /isapi/ etc… All crawlers… …are not allowed in these paths… Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 49 for (fragment) Useragent: Googlebot Disallow: /chl/* Disallow: /uk/* Disallow: /italy/* Disallow: /france/* Useragent: slurp Disallow: Crawldelay: 2 Useragent: MSNBot Disallow: Crawldelay: 2 Useragent: scooter Disallow: all others Useragent: * Disallow: / Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off! Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 50 More crawler ethics issues ? Is pliance with robot exclusion a matter of law? – No! Compliance is voluntary, but if you do not ply, you may be blocked – Someone (unsuccessfully) sued Inter Archive over a related issue ? Some crawlers disguise themselves – Using false UserAgent – Randomizing access frequency to look like a human/browser – Example: click fraud for ads Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 51 More crawler ethics issues ? Servers can disguise themselves, too – Cloaking: present different content based on UserAgent – . stuff keywords on version of page shown to search engine crawler – Search engines do not look kindly on this type of “spamdexing” and remove from their index sites that perform such abuse ? Case of made the news Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 52 Gray areas for crawler ethics ? If you write a crawler that unwillingly follows links to ads, are you just being careless, or are you violating terms of service, or are you violating the law by defrauding advertisers? – Is nonpliance with Google’s in this case equivalent to click fraud? ? If you write a browser extension that performs some useful service, should you ply with robot exclusion? Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 53 Summary ? Motivation and taxonomy of crawlers ? Basic crawlers and implementation issues ? Universal crawlers ? Preferential (focused and topical) crawlers ? Crawler ethics and conflicts Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 54 References ? Bing Liu (2020) , “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data,” 2nd Edition, Springer. 5

點擊復(fù)制文檔內(nèi)容

教學(xué)課件相關(guān)推薦

freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

webmining網(wǎng)路探勘-閱讀頁

網(wǎng)路計劃技術(shù)ppt課件-閱讀頁

網(wǎng)路倫理與法律-閱讀頁

網(wǎng)路著作權(quán)-閱讀頁

我的網(wǎng)路資源-閱讀頁

網(wǎng)路模型networkmodels-閱讀頁

通訊網(wǎng)路ch13-企業(yè)網(wǎng)路的建置-閱讀頁

認識網(wǎng)路成癮ppt課件-閱讀頁

爆破網(wǎng)路敷設(shè)ppt課件-閱讀頁

網(wǎng)際網(wǎng)路協(xié)定-閱讀頁

網(wǎng)路安全教育-閱讀頁

網(wǎng)路設(shè)定與管理-閱讀頁

網(wǎng)路檢索工具-閱讀頁

網(wǎng)路安全簡介上-閱讀頁

網(wǎng)路犯罪規(guī)範-閱讀頁

socialmediamarketing社群網(wǎng)路行銷-閱讀頁

webmining網(wǎng)路探勘(完整版)

webmining網(wǎng)路探勘(更新版)

webmining網(wǎng)路探勘(專業(yè)版)

webmining網(wǎng)路探勘(留存版)