【正文】
” ——林家翹 《 應(yīng)用數(shù)學(xué)的拓展 ——用一篇關(guān)于蛋白質(zhì)分子的結(jié)構(gòu)和 功能的動(dòng)理論發(fā)展的論文來(lái)說(shuō)明 》 ( 2022年第 2期“力學(xué)進(jìn)展”) 感悟和體會(huì) 。 原核基因的自動(dòng)預(yù)測(cè)系統(tǒng) 1. EDP模型 ——刻畫(huà) ORF序列整體編碼性與相似性 發(fā)展了對(duì)高 GC含量基因組的 EDP模型 2. TIS模型 ——刻畫(huà)基因上游區(qū)域的復(fù)雜序列特征 是基于 RBS模型的發(fā)展 定義基因翻譯起始的三種機(jī)制 刻畫(huà)基因翻譯起始信號(hào)的復(fù)雜性 考慮結(jié)構(gòu)基因群的特征 考慮高 GC含量物種基因組的序列特征 3. 綜合運(yùn)用 EDP模型、 TIS模型,發(fā)展了無(wú)監(jiān)督自學(xué)習(xí)的基因預(yù)測(cè)系統(tǒng) MED 流程圖 Naneq 古細(xì)菌 真核生物 細(xì)菌 MED模型參數(shù)揭示基因組轉(zhuǎn)錄、翻譯調(diào)控機(jī)制隨生物進(jìn)化復(fù)雜程度的演化 翻譯調(diào)控信號(hào) 翻譯調(diào)控信號(hào) 翻譯調(diào)控信號(hào) 轉(zhuǎn)錄調(diào)控信號(hào) 轉(zhuǎn)錄調(diào)控信號(hào) MED方法的特點(diǎn) 自由參數(shù)(~ 102個(gè))少于傳統(tǒng)的 HMM方法,對(duì)學(xué)習(xí)集的依賴(lài)性小 HMM:~ 104個(gè)自由參數(shù)(如: GeneMark系統(tǒng)) 迭代自學(xué)習(xí),大大少于其它方法的經(jīng)驗(yàn)參數(shù)、預(yù)設(shè)參數(shù) 有利于新測(cè)序物種的基因組分析和注釋 預(yù)測(cè)精度達(dá)到并部分超過(guò) GeneMark、 Glimmer等 模型參數(shù)具有非常明確的生物學(xué)意義,有利于基因組復(fù)雜結(jié)構(gòu)信息的深刻理解 “事實(shí)上 , 人類(lèi)基因組計(jì)劃的巨大成功已經(jīng)表明 ,那些經(jīng)常用偏微分方程處理連續(xù)介質(zhì)力學(xué)問(wèn)題的傳統(tǒng)應(yīng)用數(shù)學(xué)家對(duì)這一計(jì)劃所用到的數(shù)學(xué)方法并不熟悉 。 Zhu et al., 2022) Postprocessor for MED 張春霆 我國(guó)著名生物信息學(xué)家,天津大學(xué)中國(guó)科學(xué)院院士、第三世界科學(xué)院院士。 原核基因結(jié)構(gòu)的 RBS模型 精確預(yù)測(cè)基因的重要性: ——有助于研究基因表達(dá)的產(chǎn)物(蛋白質(zhì)、功能 RNA) ——有助于認(rèn)識(shí)基因轉(zhuǎn)錄和翻譯的機(jī)制 提高基因翻譯起始位點(diǎn)的預(yù)測(cè)精度是精確預(yù)測(cè)基因的關(guān)鍵 原核基因起始位點(diǎn)預(yù)測(cè)的困難 ——缺乏用于學(xué)習(xí)的數(shù)據(jù)集 具有實(shí)驗(yàn)確認(rèn)起始位點(diǎn)的基因數(shù)據(jù)遠(yuǎn)遠(yuǎn)不夠 ——與基因翻譯起始相關(guān)的序列特征并不強(qiáng) 翻譯起始機(jī)制的多樣性、復(fù)雜性 序列信號(hào)的模糊性 基因起始位點(diǎn)( TIS)預(yù)測(cè)方法 ? RBSfinder (Salzberg et al., 2022) : — inputs an entire genomic sequence and firstpass annotation to train a probabilistic model that scores candidate RBS surrounding previously annotated start codons. ? GSfinder (Zhang et al., 2022) : — Introduced six recognition variables to describe the consensus signals (., the SD sequences) in the vicinity of gene starts, the coding potential of DNA sequences near the start codon, the start codon itself and the distance from the leftmost start codon to the candidate start codon, respectively. — The former four variables were derived based on the Zcurve method, while the latter two variables were given as empirical constants or formulas. MEDStart: Accuracy Improvement for Identifying TIS in Microbial Genomes (Zhu et al., 2022) Protein Synthesis in Bacteria Figure: Ribosomebinding sites on mRNA can be recovered from initiation plexes. They include the upstream ShineDalgarno sequence and the initiation codon. (From Gene VIII) 構(gòu)造刻畫(huà)原核基因 TIS的 4元統(tǒng)計(jì)模型 : P1: the correlation between translation terminate site and TIS of genes P2: the sequence content around the start codon P3: the sequence content of the consensus signal related to RBS P4: the correlation between TIS and the upstream consensus signal ATG ATG P1 P2 P3 P4 STP …CCC TCGAAGC… ATG …AACAGGAGGATT… …AGGATT… 自學(xué)習(xí)迭代系統(tǒng)MEDStart MEDStart算法的實(shí)現(xiàn) (1). Finding candidate motifs in upstream regions of predicted coding ORFs ? Motif (l, d): — Motif: a subsequence that is well preserved over several sequences, and the occurrences of the motif in those sequences are called instances. — The motifs in DNA or protein sequences may indicate functional connections, such as the transcription factor binding sites in noncoding regions of genes, as well as RBS in prokaryotes. — We use the term, (l, d) motif, to refer to the situation where a consensus string of length l, without wildcards, and the instances must differ in at most d positions from the consensus. ? Assume that the SD signal should be found in the upstream region of the leftmost start codons — The SD signal tends to be a preserved feature in the upstream regions of bacterial gene starts — Most of the start codons of the longest ORF are real gene starts. Reliable data set EcoGene dataset Link dataset Bsub1248 Number of genes 854 195 1248 Number of genes with 5’most start codons 537 (%) 133 (%) 786 (%) Table: Numbers of genes whose starts are leftmost start codon for a set of reliable data ?We first search for (l, d) string within L bps upstream of the start codon of the longest ORF in the original annotation (the default values are l=5, d=0, L=20) — In order to remove many false positive cases, the initial search is restricted to ORFs longer than 300bp. — For instance, a (5, 0) string is a word of 5 alphabets with zero variation that appears in many sequences within 20 bp upstream of the start codons. ?We select several strings with the highest frequency of occurrence as the candidate motifs. — In the next iteration step, the search for candidate motifs will be conducted within L bps upstream regions of the adjusted start sites that may not be the start codon of the longest ORFs. — The training sequences, . L bps long upstream regions of start sites of all the training ORFs are updated constantly until the iteration reaches convergence. (2). Determining hit motifs and their alignment weight matrix ? For each candidate motif, search for its (l, 1) instances. — They are regarded as candidates for SD signallike substring. ? Calculate the distribution of the location of the occurred instance to the start codon, which will be referred to as the spacer distribution. ? ?2( ) ( )1LkkiilppLl? ??????)(kip ????