【正文】
). Academic Search Engine ResearchAside from tremendous growth, the Web has also bee increasingly mercial over time. In 1993, % of web servers were on . domains. This number grew to over 60% in 1997. At the same time, search engines have migrated from the academic domain to the mercial. Up until now most search engine development has gone on at panies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented (seeAppendix A). With Google, we have a strong goal to push more development and understanding into the academic realm.Another important design goal was to build systems that reasonable numbers of people can actually use. Usage was important to us because we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern。 However, the Web of 1997 is quite different. Anyone who has used a search engine recently, can readily testify that the pleteness of the index is not the only factor in the quality of search results. Junk results often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four mercial search engines finds itself (returns its own search page in response to its name in the top ten results). One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user39。Best of the Web 1994 Navigators,). Further, we expect that the cost to index and store text or HTML will eventually decline relative to the amount that will be available (seeSearch Engine Watch). It is foreseeable that by the year 2000, a prehensive index of the Web will contain over a billion documents. At the same time, the number of queries search engines handle has grown incredibly too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, it is likely that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.. Google: Scaling with the WebCreating a search engine which scales even to today39。[McBryan 94]s attention by taking measures meant to mislead automated search engines. We have built a largescale search engine which addresses many of the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a mon spelling of googol, or 10100Yahoo!Keywords: World Wide Web, Search Engines, Information Retrieval, PageRank, Google1. Introduction(Note: There are two versions of this paper a longer full version and a shorter printed version. The full version is available on the web and the conference CDROM.) To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a parable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of largescale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an indepth description of our largescale web search engine the first such detailed public description we know of to date. In this paper, we present Google, a prototype of a largescale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available atThe Anatomy of a LargeScale Hypertextual Web Search EngineSergey Brin and Lawrence Page{sergey, page}Computer Science Department, Stanford University, Stanford, CA 94305Abstract因?yàn)?,?jì)算機(jī)不斷的發(fā)展,人們受限于只能打字和說(shuō)話,文本索引增加的比例比當(dāng)前會(huì)更多。或Harvest通常會(huì)給索引帶來(lái)高效和較好的技術(shù)解決方案,但由于過(guò)高的安裝設(shè)置和管理成本,似乎難說(shuō)服全世界都使用這些系統(tǒng)。當(dāng)然,一個(gè)分布式的系統(tǒng)比如如果我們假設(shè),摩爾定律在未來(lái)一直得到驗(yàn)證,我們只需要10多倍,或15年才能達(dá)到我們索引的美國(guó)每個(gè)人一年中寫(xiě)下的任何事的的目標(biāo)且價(jià)格是一個(gè)小公司可以承擔(dān)的。摩爾定律在1965年被定義為每18個(gè)月翻一番的處理器的功耗。我們還假定索引方法的文字是線性的,或接近線性的復(fù)雜性。這將花費(fèi)空間850 TB。他們寫(xiě)每日平均10 k。我們假設(shè)我們要索引的美國(guó)每個(gè)人一年中寫(xiě)下的任何事。下面是一個(gè)說(shuō)明性的示例。這一切為集中索引提供了令人驚異的可能性。但是,同生產(chǎn)成本較低的文本相比,媒體,如視頻文件,文本可能仍然非常普遍。這些包括諸如可尋址的內(nèi)存,打開(kāi)的文件描述符的數(shù)目,網(wǎng)絡(luò)帶寬和插座,以及其他許多人。我們目前得到了磁盤(pán)和機(jī)器所需的款額,我們也考慮了大部分?jǐn)?shù)據(jù)結(jié)構(gòu)的易擴(kuò)展性。9附錄B但是我們相信這個(gè)論點(diǎn),廣告引起太多問(wèn)題的原因是它對(duì)于搜索引擎的競(jìng)爭(zhēng)是至關(guān)重要的。這就是削弱現(xiàn)有當(dāng)前搜索引擎廣告支持業(yè)務(wù)的原因。一個(gè)優(yōu)秀的搜索引擎不會(huì)把廣告當(dāng)做必需的雖然這可能導(dǎo)致它從航空公司獲得的收益受損。廣告收益的誘惑經(jīng)常導(dǎo)致低質(zhì)量的搜索結(jié)果。比如,搜索引擎可以在那些“友好”的公司的查詢(xún)結(jié)果中添加一個(gè)因子并從其競(jìng)爭(zhēng)對(duì)手中減少因子。這種商業(yè)模式導(dǎo)致了恐慌和OpenText搜索引擎的終結(jié)。一個(gè)典型的例子是OpenText搜索引擎,據(jù)報(bào)道,它向公司出售使特定的查詢(xún)?cè)谒阉鹘Y(jié)果列表前面的權(quán)利。很明顯,這對(duì)于給手機(jī)做廣告的廣告商賺錢(qián)的搜索引擎來(lái)說(shuō)比較困難,因?yàn)槲覀兊南到y(tǒng)返回的那些支付了廣告費(fèi)的頁(yè)面。例如,在我們的搜索引擎的原型中名列前茅的結(jié)果關(guān)于手機(jī)的是”使用手機(jī)對(duì)駕駛員的注意力的影響”,一個(gè)研究作了較為詳細(xì)的解釋了在駕駛時(shí)關(guān)于使用手機(jī)交談的的干擾和風(fēng)險(xiǎn)。廣告及形形色色的動(dòng)機(jī)目前,商業(yè)搜索引擎主要的營(yíng)業(yè)模式是廣告。他的一些研究方向包括web鏈接結(jié)構(gòu)、人機(jī)交互、搜索引擎、可擴(kuò)展性的信息訪問(wèn)接口,個(gè)人數(shù)據(jù)挖掘方法。Lawrence Page生于密歇根州東部的蘭辛市并于1995年獲得了密歇根大學(xué)計(jì)算機(jī)工程的工學(xué)學(xué)士學(xué)位。他是國(guó)家科學(xué)基礎(chǔ)研究生獎(jiǎng)學(xué)金的獲得者。New York, 1996.個(gè)人簡(jiǎn)歷Sergey Brin于1993年獲得美國(guó)馬蘭里大學(xué)帕克分校數(shù)學(xué)與計(jì)算機(jī)專(zhuān)業(yè)理學(xué)學(xué)士學(xué)位。94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images.Gaithersburg,Santa Clara,Chicago,LawrenceGenevaSanta Clara,Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98).Brisbane,Australia, April 1418, 1998.[Cho 98] Junghoo Cho, Hector GarciaMolina,1997.[Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition. Publisher: Beacon, ISBN: 0807061557[Chakrabarti 98] , , , , P. Raghavan and