freepeople性欧美熟妇, 色戒完整版无删减158分钟hd, 无码精品国产vα在线观看DVD, 丰满少妇伦精品无码专区在线观看,艾栗栗与纹身男宾馆3p50分钟,国产AV片在线观看,黑人与美女高潮,18岁女RAPPERDISSSUBS,国产手机在机看影片

正文內(nèi)容

webmining網(wǎng)路探勘-閱讀頁

2024-10-19 19:35本頁面
  

【正文】 eader to specify crawler developer – Do not disguise crawler as a browser by using their ‘UserAgent’ string ? Always check that HTTP requests are successful, and in case of error, use HTTP error code to determine and immediately address problem ? Pay attention to anything that may lead to too many requests to any one server, even unwillingly, .: – redirection loops – spider traps Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 45 Crawler etiquette (important!) ? Spread the load, do not overwhelm a server – Make sure that no more than some max. number of requests to any single server per unit time, say 1/second ? Honor the Robot Exclusion Protocol – A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named ‘’ placed in the HTTP root directory, . – Crawler should always check, parse, and obey this file before sending any requests to a server – More info at: ? ? Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 46 More on robot exclusion ? Make sure URLs are canonical before checking against ? Avoid fetching for each request to a server by caching its policy as relevant to this crawler Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 47 for Useragent: * Disallow: All crawlers… …can go anywhere! Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 48 file for Useragent: * Disallow: /canada/Library/mnp/2/aspx/ Disallow: /munities/ Disallow: /munities/ Disallow: /munities/blogs/ Disallow: /munities/ Disallow: /downloads/ Disallow: /downloads/ Disallow: /france/formation/centres/ Disallow: /france/ Disallow: /germany/library/images/mnp/ Disallow: /germany/ Disallow: /ie/ie40/ Disallow: /info/ Disallow: /info/ Disallow: /intlkb/ Disallow: /isapi/ etc… All crawlers… …are not allowed in these paths… Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 49 for (fragment) Useragent: Googlebot Disallow: /chl/* Disallow: /uk/* Disallow: /italy/* Disallow: /france/* Useragent: slurp Disallow: Crawldelay: 2 Useragent: MSNBot Disallow: Crawldelay: 2 Useragent: scooter Disallow: all others Useragent: * Disallow: / Google crawler is allowed everywhere except these paths Yahoo and MSN/Windows Live are allowed everywhere but should slow down AltaVista has no limits Everyone else keep off! Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 50 More crawler ethics issues ? Is pliance with robot exclusion a matter of law? – No! Compliance is voluntary, but if you do not ply, you may be blocked – Someone (unsuccessfully) sued Inter Archive over a related issue ? Some crawlers disguise themselves – Using false UserAgent – Randomizing access frequency to look like a human/browser – Example: click fraud for ads Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 51 More crawler ethics issues ? Servers can disguise themselves, too – Cloaking: present different content based on UserAgent – . stuff keywords on version of page shown to search engine crawler – Search engines do not look kindly on this type of “spamdexing” and remove from their index sites that perform such abuse ? Case of made the news Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 52 Gray areas for crawler ethics ? If you write a crawler that unwillingly follows links to ads, are you just being careless, or are you violating terms of service, or are you violating the law by defrauding advertisers? – Is nonpliance with Google’s in this case equivalent to click fraud? ? If you write a browser extension that performs some useful service, should you ply with robot exclusion? Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 53 Summary ? Motivation and taxonomy of crawlers ? Basic crawlers and implementation issues ? Universal crawlers ? Preferential (focused and topical) crawlers ? Crawler ethics and conflicts Source: Bing Liu (2020) , Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data 54 References ? Bing Liu (2020) , “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data,” 2nd Edition, Springer. 5
點擊復(fù)制文檔內(nèi)容
教學(xué)課件相關(guān)推薦
文庫吧 www.dybbs8.com
備案圖鄂ICP備17016276號-1