【正文】
., by cutting the hierarchical tree at a particular level.Exclusive versus Overlapping versus Fuzzy The clusterings shown in Figure are all exclusive, as they assign each object to a single are many situations in which a point could reasonably be placed in more than one cluster, and these situations are better addressed by nonexclusiveclustering. In the most general sense, an overlapping or nonexclusiveclustering is used to re?ect the fact that an object can simultaneously belong to more than one group (class). For instance, a person at a university can be both an enrolled student and an employee of the university. A nonexclusiveclustering is also often used when, for example, an object is “between” two or more clusters and could reasonably be assigned to any of these a point halfway between two of the clusters of Figure . Rather than make a somewhat arbitrary assignment of the object to a single cluster,it is placed in all of the “equally good” clusters.In a fuzzy clustering, every object belongs to every cluster with a membership weight that is between 0 (absolutely doesn’t belong) and 1 (absolutelybelongs). In other words, clusters are treated as fuzzy sets. (Mathematically,a fuzzy set is one in which an object belongs to any set with a weight thatis between 0 and 1. In fuzzy clustering, we often impose the additional constraint that the sum of the weights for each object must equal 1.) Similarly,probabilistic clustering techniques pute the probability with which each point b。 ., new, unlabeled objects are assigned a class label using a model developed from objects with known class labels. For this reason, cluster analysis is sometimes referred to as unsupervised classi?cation. When the term classi?cation is used without any quali?cation within data mining, it typically refers to supervised classi?cation.Also, while the terms segmentation and partitioning are sometimesused as synonyms for clustering, these terms are frequently used for approaches outside the traditional bounds of cluster analysis. For example, the termpartitioning is often used in connection with techniques that divide graphs into subgraphs and that are not strongly connected to clustering. Segmentation often refers to the division of data into groups using simple techniques。 ., a data object that is representative of the other objects in the cluster. These cluster prototypes can be used as the basis for a number of data analysis or data processing techniques. Therefore, in the context of utility, cluster analysis is the study of techniques for ?nding the most representative cluster prototypes.? Summarization. Many data analysis techniques, such as regression or PCA, have a time or space plexity of O(m2) or higher (where m is the number of objects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be parable to those that would have been obtained if all the data could have been used.? Compression. Cluster prototypes can also be used for data pression. In particular, a table is created that consists of the prototypes for each cluster。初步實(shí)現(xiàn)了微博輿情管理平臺的功能。微博輿情管理平臺在民意調(diào)查,輿情監(jiān)控和消息獲取等方面均有很大價值。通過加入語義分析模塊,可以極大提高預(yù)警的準(zhǔn)確率,實(shí)現(xiàn)對媒體模式微博消息的預(yù)警。對現(xiàn)有各種主題分類相關(guān)技術(shù)進(jìn)行研究,分類列出了這些技術(shù)中對文本傾向性分類仍然適用的方式和方法,并總結(jié)了其中面臨的主要技術(shù)瓶頸;通過列舉文本傾向性分類處理對象的特點(diǎn)規(guī)律,總結(jié)了在構(gòu)建傾向性分類器時需要著重考慮的問題和因素。如美國專利局編號為4930077的專利提出了通過文本分析來預(yù)測輿情的方法加州大學(xué)伯克利分校社會科學(xué)計算實(shí)驗(yàn)室的SDA項(xiàng)目,主要針對網(wǎng)頁數(shù)據(jù)進(jìn)行自動分析;國內(nèi)的方正智思是北大方正技術(shù)研究院挾多年積累的中文信息處理的技術(shù),研發(fā)推出的一個中文智能信息挖掘與知識管理的軟件開發(fā)包與服務(wù)系統(tǒng)。而網(wǎng)絡(luò)輿情分析系統(tǒng)是讓計算機(jī)去動態(tài)的收集數(shù)據(jù),對其進(jìn)行自動分析形成輿情分析結(jié)果。進(jìn)而完成對趨勢分析模塊的改進(jìn),對趨勢分析模塊的初步改進(jìn)設(shè)想是將微博傳播趨勢分析進(jìn)行分類,微博達(dá)人模式適用現(xiàn)行模塊,媒體模塊則需要重新設(shè)定參數(shù),進(jìn)行修改,同時在趨勢分析模塊中加入文本傾向性分析,也就是語義分析模塊來提高微博分析的準(zhǔn)確性,并且實(shí)現(xiàn)對熱點(diǎn)的熱度分級。由于技術(shù)限制,對一些傳播特別廣的全國范圍性的消息熱點(diǎn)的監(jiān)測沒能實(shí)現(xiàn),希望能對挖掘算法進(jìn)行改進(jìn),完成對這種熱點(diǎn)的特點(diǎn)分析和模型建立。而如果實(shí)現(xiàn)了中文語義分析,充分解析微博句子或詞語,對于敏感話題識別和微博輿情趨勢分析將會有重要意義。而且從系統(tǒng)功能方面,首先實(shí)現(xiàn)的是對指定微博內(nèi)容的熱點(diǎn)趨勢分析,而沒有實(shí)現(xiàn)從實(shí)時所有微博信息中發(fā)現(xiàn)熱點(diǎn),網(wǎng)絡(luò)抓取技術(shù)這方面有待改進(jìn)。微博輿情信息與普通文本的最大區(qū)別在于它的擴(kuò)散性和不可控性,信息內(nèi)容非常動態(tài)。初步設(shè)想對媒體認(rèn)證的意見領(lǐng)袖傳播的微博消息加上一個轉(zhuǎn)發(fā)率的對比,但尚未從已知數(shù)據(jù)中發(fā)現(xiàn)熱點(diǎn)與轉(zhuǎn)發(fā)率的明顯關(guān)系。微博題目熱點(diǎn)預(yù)警結(jié)果消息最終走勢結(jié)果對比李克強(qiáng)將在波茨坦會議舊址發(fā)表講話黃色級藍(lán)色級錯誤江蘇鹽城政府單位吃喝27萬黃色級藍(lán)色級錯誤南京一郵局被強(qiáng)拆橙色級橙色級正確李克強(qiáng):無論多忙都要抽時間讀書藍(lán)色級藍(lán)色級正確埃及浮雕刻有“丁錦昊到此一游”橙色級紅色級正確人民日報:農(nóng)村孩子為何不愿躍“龍門”黃色級藍(lán)色級錯誤人民日報海外版:房地產(chǎn)商哭窮屬賣萌裝天真藍(lán)色級黃色級錯誤鄭州暴雨黃色級黃色級正確藍(lán)色級藍(lán)色級正確營養(yǎng)餐食物變質(zhì),營養(yǎng)縮水,問題不斷藍(lán)色級黃色級錯誤陳佩斯關(guān)于網(wǎng)絡(luò)輿論的評論黃色級黃色級正確王石:愛國主義與民族主義黃色級黃色級正確銀河SOHO環(huán)境監(jiān)測藍(lán)色級藍(lán)色級正確小學(xué)生作文《停車》藍(lán)色級藍(lán)色級正確Esports海濤:G1聯(lián)賽IG負(fù)于LGD黃色級黃色級正確圖48 分析結(jié)果分析圖根據(jù)與真實(shí)走向的對比,可以發(fā)現(xiàn)在微博達(dá)人模式中,趨勢分析預(yù)警成為熱點(diǎn)的結(jié)果的準(zhǔn)確率在80%左右,即使沒有成為當(dāng)日熱點(diǎn)也是關(guān)注比較靠前的話題。前10組為媒體傳播模式的微博消息,其中“江蘇鹽城政府單位吃喝27萬”,“南京一郵局被強(qiáng)拆”,“埃及浮雕刻有‘丁錦昊到此一游’”,“人民日報海外版:房地產(chǎn)商哭窮屬賣萌裝天真”,“鄭州暴雨”,這5條微博成為熱點(diǎn),其余沒有成為熱點(diǎn),趨勢分析模塊沒有預(yù)測出“江蘇政府單位吃喝”, “人民日報海外版:房地產(chǎn)商哭窮屬賣萌裝天真”這兩條熱點(diǎn);誤測了“李克強(qiáng)發(fā)表講話”,“人民日報:農(nóng)村孩子為何不愿躍‘龍門’” 為熱點(diǎn),準(zhǔn)確率只有60%。而若10000以上也出現(xiàn)兩次或以上,則分為橙色級,也就是很可能成為多日熱點(diǎn)。兩種模式中,有一共同點(diǎn)則是都曾經(jīng)出現(xiàn)過短時間內(nèi)的傳播量激增,然后成為熱點(diǎn),根據(jù)這一特性設(shè)計了趨勢分析模塊,從最早的意見領(lǐng)袖開始,每出現(xiàn)一個意見領(lǐng)袖,提取這一意見領(lǐng)袖后一小時的意見領(lǐng)袖的傳播廣度,設(shè)定不同的M值(一小時內(nèi)微博傳播量),根據(jù)以往數(shù)據(jù)可以得出,M值在5000以下為藍(lán)色級基本無威脅,在10000到50000為黃色級,需要注意,有很大概率成為熱點(diǎn),而50000以上則肯定成為熱點(diǎn),但持續(xù)時間還未能有效的分級,也就是還不能對橙色和紅色級進(jìn)行有效分級,但已能區(qū)分熱點(diǎn)與否。對曾經(jīng)成為過熱點(diǎn)的微博消息的傳播特點(diǎn)進(jìn)行整理后,得出兩種熱點(diǎn)傳播模型,一種是傳統(tǒng)媒體在微博建立的用戶的傳播方式,一種是微博達(dá)人的消息傳播模式,圖45就是兩種方式的轉(zhuǎn)發(fā)量時間曲線圖,例子選擇則是媒體模式選擇的是南方周末“一名中國公民在波士頓爆炸案中遇難”的消息傳播,微博達(dá)人模式選擇的是“國學(xué)大師劉文典說過的一句話”,圖46和圖47選擇的是傳播量時間曲線圖,其中南方周末的微博消息在一天的時間內(nèi)的轉(zhuǎn)發(fā)量為997,傳播用戶量接近500萬,其中南方周末本身的粉絲數(shù)量就占了近450萬,轉(zhuǎn)發(fā)率非常低,但是傳播范圍廣,依然是熱點(diǎn),而微博達(dá)人模式則不一樣,轉(zhuǎn)發(fā)量為724,最終傳播用戶量接近10萬,在一定范圍內(nèi)也成為了熱點(diǎn),而它的傳播時間圖就和起點(diǎn)很高的南方周末的圖形很不一樣,有著較高的轉(zhuǎn)發(fā)率,雖然廣度不及南方周末,但也成為過熱點(diǎn)話題。微博用戶對該輿情關(guān)注度高,傳播速度快,影響擴(kuò)散到了很大范圍,輿情有可能成為多日熱點(diǎn);紅色級(I級):出現(xiàn)輿情。微博用戶對該輿情關(guān)注度低,傳播速度慢,輿情影響局限在較小范圍內(nèi),沒有成為當(dāng)日熱點(diǎn)的可能;黃色級(Ⅲ級):出現(xiàn)輿情。網(wǎng)絡(luò)輿情預(yù)警等級的設(shè)定在綜合考慮國際慣例、我國相關(guān)機(jī)構(gòu)管理規(guī)定及微博輿情發(fā)展趨勢的前提下,微博輿情的預(yù)警等級被劃分為:輕警情(Ⅳ級,非常態(tài))、中度警情(Ⅲ級,警示級)、重警情(Ⅱ級,危險級)和特重警情(I級,極度危險級)四個等級,并依次采用藍(lán)色、黃色、橙色和紅色來加以表示。預(yù)警體現(xiàn)動態(tài)的認(rèn)知,預(yù)案體現(xiàn)靜態(tài)的防范。究其原因,自然現(xiàn)象內(nèi)外部影響因素之間的因果關(guān)系相對確定,而且這些現(xiàn)象都經(jīng)歷了長期的觀察測量,有了較好的量化基礎(chǔ),因而可以方便地進(jìn)行預(yù)警。預(yù)警的概念源于對重大自然災(zāi)害征兆的研究。轉(zhuǎn)發(fā)量大的用戶并不一定意味著其影響力也大。同時我們發(fā)現(xiàn)WeiboRank算法和用戶的followers兩個序列的傳播影響人次覆蓋率比較相近,這