【正文】
上海大學(xué)博士學(xué)位論文 2008年4月中圖分類號: 單位代號:10280密 級: 學(xué) 號:05720159 碩士學(xué)位論文SHANGHAI UNIVERSITYMASTER DISSERTATION題目機器學(xué)習(xí)算法在生物信息學(xué)中的應(yīng)用機器學(xué)習(xí)算法在生物信息學(xué)中的應(yīng)用 畢業(yè)論文作 者 金雨歡學(xué)科專業(yè) 物理化學(xué)導(dǎo) 師 陸文聰 教授完成日期 二零零八年五月XI上海大學(xué)碩士學(xué)位論文 2008年5月上海大學(xué)本論文經(jīng)答辯委員會全體委員審查,確認(rèn)符合上海大學(xué)碩士學(xué)位論文質(zhì)量要求。答辯委員會簽名:主任:委員:導(dǎo) 師:答辯日期: 原 創(chuàng) 性 聲 明本人聲明:所呈交的論文是本人在導(dǎo)師指導(dǎo)下進行的研究工作。除了文中特別加以標(biāo)注和致謝的地方外,論文中不包含其他人已發(fā)表或撰寫過的研究成果。參與同一工作的其他同志對本研究所做的任何貢獻(xiàn)均已在論文中作了明確的說明并表示了謝意。 簽 名: 日 期: 本論文使用授權(quán)說明本人完全了解上海大學(xué)有關(guān)保留、使用學(xué)位論文的規(guī)定,即:學(xué)校有權(quán)保留論文及送交論文復(fù)印件,允許論文被查閱和借閱;學(xué)??梢怨颊撐牡娜炕虿糠謨?nèi)容。(保密的論文在解密后應(yīng)遵守此規(guī)定)簽 名: 導(dǎo)師簽名: 日期: 上海大學(xué)理學(xué)碩士學(xué)位論文機器學(xué)習(xí)算法在生物信息學(xué)中的應(yīng)用姓 名:金雨歡導(dǎo) 師:陸文聰 教授學(xué)科專業(yè):物理化學(xué)上海大學(xué)理學(xué)院二零零八年五月A Dissertation Submitted to Shanghai University for the Master’s Degree in ScienceUsing Machine Learning MethodsIn BioinformaticsM. D. Candidate:Jin YuhuanSupervisor:Prof. Lu WencongMajor:Physical ChemistryScience College, Shanghai UniversityMay, 2008摘要 20世紀(jì)后期,人類和其他生物物種基因組學(xué)的研究飛速發(fā)展,生物信息的增長驚人,生物科學(xué)技術(shù)極大地豐富了生物科學(xué)的數(shù)據(jù)資源。數(shù)據(jù)資源的急劇膨脹迫使人們尋求一種強有力的工具,運用新的技術(shù)手段對復(fù)雜的海量生物信息進行儲存、管理、分析和研究,組織這些數(shù)據(jù),以利于儲存、加工和進一步利用,有效管理、準(zhǔn)確解讀、充分使用這些信息。 本文的工作就是應(yīng)用機器學(xué)習(xí)方法來對生物信息數(shù)據(jù)進行分析,處理。本文的主體工作分為三個部分: 1. 用集成學(xué)習(xí)算法研究蛋白質(zhì)亞細(xì)胞定位預(yù)測。蛋白質(zhì)的亞細(xì)胞位置,是蛋白質(zhì)的一個重要性質(zhì),能夠表明蛋白質(zhì)在細(xì)胞中的功能。預(yù)報蛋白質(zhì)亞細(xì)胞位置,在基因注釋和藥物設(shè)計工作中,都扮演了很重要的角色。本文用基于序列氨基酸組成成分進行蛋白質(zhì)序列特征編碼,選用了AdaBoost與Bagging這兩種最重要的集成學(xué)習(xí)算法來對訓(xùn)練數(shù)據(jù)集進行建模。在建模過程中,分別嘗試了用4種不同的弱分類器來訓(xùn)練樣本,并用基于交叉驗證法的建模結(jié)果來對建模參數(shù)進行優(yōu)化。結(jié)果表明:用AdaBoost隨機森林算法作為弱分類器時有最好的建模結(jié)果,%;%。用獨立測試樣本集對訓(xùn)練好的預(yù)報模型進行驗證,%%,優(yōu)于SVM方法所得結(jié)果(%,%)。 2. 用支持向量機回歸算法(SVR)對1苯基2氫四氫三嗪3酮同系物進行QSAR研究。1苯基2氫四氫三嗪3酮同系物可用作5脂抗氧化酶抑制劑。本工作中用來自文獻(xiàn)的12個拓?fù)渲笖?shù)與Hyperchem計算得到的17個物理化學(xué)參數(shù)作為初始分子描述符,然后用基于SVR留一交叉驗證法進行變量篩選,最終得到8個分子描述符用于建立預(yù)報模型。該模型的留一交叉驗證法的RMSE(最小殘差平方和),作為對比,多元線性回歸算法(MLR)、偏最小二乘法(PLS)、人工神經(jīng)網(wǎng)絡(luò)(ANN)、 ;SVM與MLR、PLS、。 3. 提出了一種基于MVC架構(gòu)的服務(wù)器設(shè)計途徑,建立了基于已得模型的在線預(yù)報服務(wù)器。建立生物信息學(xué)預(yù)報模型的目的是為了提供對生物信息中的未知對象進行預(yù)報的工具,使得預(yù)測結(jié)果能夠為他人所用。為了更好的達(dá)到這個目的,將研究得到的預(yù)報模型提供給所有相關(guān)領(lǐng)域的研究人員,建立在線預(yù)報服務(wù)器是一條有效途徑。關(guān)鍵詞:生物信息學(xué),定量構(gòu)效關(guān)系(QSAR),機器學(xué)習(xí),集成學(xué)習(xí),支持向量機(SVM),支持向量回歸算法(SVR),AdaBoost,Bagging,亞細(xì)胞位置定位,5脂抗氧化酶抑制劑,在線預(yù)報服務(wù)器AbstractIn the late 20th century, genomics research in human and other living species had been developed rapidly, and the information of biology increased by surprised speed. The information source of bioscience was great enriched by bioscience techniques. The rapidly expanding of information source force people to search for a powerful and effective tool, which uses new techniques to the storage, management, analysis and research of the mass of plex biological information, then organize these data to be better in storage, processing and utility. Machine learning methods were used to analyse and process the data of biological information in this work. The main work of the paper contains three parts: 1. Using integrated learning algorithm to study the prediction of protein subcellular localization. Protein subcellular localization, which tells where a protein resides in a cell, is an important characteristic of a protein, and relates closely to the function of proteins. The prediction of their subcellular localization plays an important role in the prediction of protein function, genome annotation and drug design. In this work, the sequences were coded based on the sequence amino acid position, and the models were built using AdaBoost and Bagging, which were the most important algorithm of the integrated learning algorithm. During the modeling process, four different weak classifiers were used in training data, and the modeling parameters were optimized based on the result of crossvalidation of the models. As a result, AdaBoost got the best model with a correct rate of % in crossvalidation prediction, when random forest algorithm was selected as the weak classifier。 Bagging got the best model with a correct rate of % in crossvalidation prediction, when KNN was selected as the weak classifer. Then, independent dataset test was used to validate the trained model, the result of AdaBoost and Bagging were % and % of prediction correct rate. As parison, SVM was used, and the result of training crossvalidation was % of correct rate, and the independent dataset test was % of correct rate. 2. Using support vector machine regression algorithm to take QSAR study in 1phenyl [2H]tetrahydrotriazine3one analogues. 1phenyl [2H]tetrahydrotriazine3one analogues could be used as 5lipoxygenase inhibitors. In this work, 12 topological indexes and 17 physical chemical parameters caculated by Hyperchem were used as the original molecular descriptors. Then, the descriptors were filtered based on SVR leaveoneout cross validation. As a result, 8 descriptors were selected to build the predicting model. The RMSE of this model using leaveoneout cross validation was . As parison, the RMSE value of multiple linear regression (MLR), partial least squares (PLS) and artificial neural network (ANN) were , and , respectively. The independent data sets of SVR, MLR, PLS, and ANN were tested to demonstrate the generalization alility of these models, and the results in RMSE values were , , , and , respectively. 3. Building online predicting server based on the gained model. The aim of building the bioinformatics predicting model is to supply a tool to predict unknowns in the biological information, and make the information to benefit human. Building online predicting server is an effective way. The predicting models available online can be used by experimental researchers. In this work, a design of server based on MVC construction was brought out, which could increase the efficiency of building a series of online predicting server.Keywords: bioinformatics, quantitative structure activity relationship(QSAR), machine learning, integrated s