【正文】
序級別的日志是,即日志的訪問語義相關(guān)的對象。例如,應(yīng)用程序級日志是記錄用戶進(jìn)入網(wǎng)站主頁,然后參觀了體育與新聞頁面上足球代表隊,等等。這將需要一個系統(tǒng)模塊監(jiān)測用戶的步驟在語義水平的力度。在這個 ClickWorld 項目中這樣一個模塊被稱為 ClickObserve。不幸地,然而,該模塊是一個可交付的項目,它不適用于在收集數(shù)據(jù)的開始該項目。 因此, 我們決定提取兩個句法和語義信息從網(wǎng)址通過一個半自動的辦法。該辦法包括通過在逆向工程的網(wǎng)址,從網(wǎng)站設(shè)計者說明這意味著每一個 URL 路徑,網(wǎng)頁id 和網(wǎng)頁的參數(shù)。使用 PERL 腳本,從設(shè)計師的描述,我們從原來的提取網(wǎng)址以下信息: 本地網(wǎng)絡(luò)服務(wù)器, 即 或 等 ,這些親志愿給我們一些空間信息的用戶的利益 。第一級分類的網(wǎng)址有 24 種,其中一些是:家庭,新聞,財政,照片,笑話,購物。論壇,酒吧 。第二個級別的網(wǎng)址取決于第一級之一,例如:網(wǎng)址分類版購物可進(jìn)一步分類版的圖書購物或 PC 購 物等 。第三級分類的網(wǎng)址取決于第二級之一,例如網(wǎng)址分類版的圖書購物可進(jìn)一步分類版編程該書敘事購物或購物和書籍等 。參數(shù)信息,還詳細(xì)介紹了三個層次分類,如網(wǎng)址分類版的編程書籍購物可能的 ISBN 書碼作為參數(shù)的深度分類,即一日的網(wǎng)址,如果只有一個第一級別分類,如果網(wǎng)址的第一和第二級分類,等等。 當(dāng)然,采取的辦法主要是其中的一個啟發(fā)式,隨著本次設(shè)計的層次上升。此外,本次設(shè)計不利用任何基于內(nèi)容的分類,即說明新聞分類,如體育新聞的編號為 12345的代碼,即第一級是新聞,并沒有提及的新聞內(nèi)容。 附件 2:外文原文 Preprocessing and Mining Web Log Data for Web Personalization M. Baglioni1, U. Ferrara2, A. Romei1, S. Ruggieri1, and F. Turini1 1 Dipartimento di Informatica, Universit181。a di Pisa, Via F. Buonarroti 2, 56125 Pisa Italy fbaglioni,romei,ruggieri, 2 KSolutions . Via Lenin 132/26, 56017 S. Martino Ulmiano (PI) Italy Abstract. We describe the web usage mining activities of an ongoing project, called ClickWorld3, that aims at extracting models of the navigational behaviour of a web site users. The models are inferred from the access logs of a web server by means of data and web mining techniques. The extracted knowledge is deployed to the purpose of offering a personalized and proactive view of the web services to users. We first describe the preprocessing steps on access logs necessary to clean, select and prepare data for knowledge extraction. Then we show two sets of experiments: the first one tries to predict the sex of a user based on the visited web pages, and the second one tries to predict whether a user might be interested in visiting a section of the site. Keywords: knowledge discovery, web mining, classification. 1 Introduction According to [10], Web Mining is the use of data mining techniques to automatically discover and extract information from web documents and services. A mon taxonomy of web mining defines three main research lines: content mining, structure mining and usage mining. The distinction between those categories is not a clear cut, and very often approaches use bination of techniques from different mining covers data mining techniques to extract models from web object contents including plain text, semistructured documents (., HTML orXML), structured documents (digital libraries), dynamic documents, multimedia documents. The extracted models are used to classify web objects, to extract keywords for use in information retrieval, to infer structure of semistructured or unstructured objects. Structure Mining aims at finding the underlying topology of the interconnections between web objects. The model built can be used to categorize and to rank web sites, and also to find out similarity between them. 2 M.