【正文】
he ponents work together to produce search results. In the architecture, Google implemented Page Rank algorithm to identify the relevancy of the result. Pagerank algorithm will be explained further in the next section. The crawlers work 24/7 traversing all the hyperlinks and downloading Web content into storage. All the contents are parsed and indexed and stored into another storage area. The index is then inverted to allow each single term to be related to many words. The PageRank algorithm ranks the Web pages based on citation principle. The more links referred to a particular link, the higher the point it will have. The weight of Web pages will also be taken into account. If a Web page with a high weight is a reference to a Web page, it will have higher points. The higher point will result in PageRank to rank higher in the result. The PageRank calculates the links from all pages equally and normalizing it. The basic formula is BRIN [3]: PR(A) = (1d) + d (PR(T1)/C(T1) + …… + PR(Tn)/C(Tn)) Where PR(A) is the probability of Web Site A, which contain T1 pages… to Tn. T1 Tn are pages linking to page A. PR(T1) is the PageRank value for page T1. D is the damping factor that 22 can be set to 0 to 1. C(A) defines number of links going out of Web site A. Nutch is an open source search engine which was developed by Doug Cutting[13]. Nutch is an extension of Lucene [14] which is an open source information retrieval system. Most of the Lucene libraries were used in Nutch. Most of the package in the figure provides Nutch functionality such as indexing and searching capabilities. According to Cutting [13], Nutch consist of two main ponents: a) Crawler ? Webdb ? Fetcher ? Indexer ? Segments b) Searcher Webdb is a persistent database that tracks page, relevant link last crawled date and other facts. In addition, Webdb stores image of Web graph. Fetcher on the other hand, is what made crawler it is. Fetcher basically, crawls from one Web site to the other and fetch the content back to the system. The indexer uses the content fetched by fetcher to generate an inverted index. The inverted index is then divided into segments which than can be used by searcher to display query results. Searcher ponents provide the interface for users to conduct search. It requires Tomcat as Servlet container. The details of Nutch’s skeleton is discussed in part IV. Nutch leverages on distributed puting to process large data sets [15]. The distribution file system it’s using is Hadoop [16] which is also used by Yahoo! for its search engine system. Hadoop uses a programming model call MapReduce which was developed by Google [7]. In the 23 model, it uses a set of key,value as putation inputs. This input is used by map function to parse the task and generate intermediate keys. These intermediate keys will be the input for reduce function and merges together similar key and produce an output. Nutch parse task for fetching, crawling and indexing into this set of key and value which will be replicated on various slave machines and will be puted [15]. The result will then be merged together in designated location for usage by the searcher. Liyi Zhang [17] had conducted a research of using ontology to improve search accuracy. The retrieval system is an ECommerce product retrieval system which uses ontologybased adoption Vector Space Model. It modified existing vector space model to treat documents as a collection of concepts instead of documents as collection of keywords. To determine the similarity between the documents and user query, it uses weights that are calculated using tfidf(term frequency, in this case concept frequency and inverted document frequency) scheme. According to Liyi Zhang [ibid], the system conduct parallel searches using OAVSM and SPARQL information retrieval. Both of the result will be matched and ranked and the best result will be presented to the User. Our project focuses on development of search engine model extension named Zenith. This extension is a plugin for Nutch [18] which enables it to function as a semantic search engine. By integrating Zenith and Nutch, they work together as a hybrid semantic search engine which can be used as a proof of concept for our research. METHODOLOGY Zenith development uses a bination of reusable prototyping and ponentbased development as shown in Figure 1. The development begins by conducting literature review. 24 Components that can be reused in this project are also identified. This process is called domain engineering. Domain engineering is a process of identifying the software ponents that is applicable for Zenith’s development [19]. Each Zenith’s function is partmentalized into ponents. In the ponent subphase, reusable prototyping model is implemented. In general, the whole system is basically a reusable prototyping. This methodology is most suited for Zenith development because of a few factors. Zenith architecture is highly modular. Components from other past projects can be reused in Zenith’s development. As mention above, Zenith development is highly unpredictable. This methodology facilitates unpredictability of Zenith development. For instance, this methodology allows developer to experiment with ponents and methods and test selected ponents as proof of concept. The way this methodology flow allows developer to go back to previous phase to conduct modifications. In addition, the risk of development progress being hindered by developer mental block will be reduced as developer can shift the development efforts to other ponents. Figure 1. Zenith Methodology Aside from adapting to the development requirements of the system, this methodology will 25 increase the system maintainability, and scalability. A highly scalable system will be able to cater large number of users in accordance to th