【正文】
Phraselevel? Generate from meaning? Reinforced learning? Reranking? What kinds of resources are available to MT? ? Translation lexicon: – Bilingual dictionary ? Templates, transfer rules: – Grammar books ? Parallel data, parable data ? Thesaurus, WordNet, FrameNet, … ? NLP tools: tokenizer, morph analyzer, parser, … ? More resources for major languages, less for “minor” languages. Major approaches ? Transferbased ? Interlingua ? Examplebased (EBMT) ? Statistical MT (SMT) ? Hybrid approach The MT triangle word Word Meaning Transferbased Phrasebased SMT, EBMT Wordbased SMT, EBMT (interlingua) Transferbased MT ? Analysis, transfer, generation: 1. Parse the source sentence 2. Transform the parse tree with transfer rules 3. Translate source words 4. Get the target sentence from the tree ? Resources required: – Source parser – A translation lexicon – A set of transfer rules ? An example: Mary bought a book yesterday. Transferbased MT (cont) ? Parsing: linguistically motivated grammar or formal grammar? ? Transfer: contextfree rules? Additional constraints on the rules? Apply at most one rule at each level? How are rules created? ? Translating words: wordtoword translation? ? Generation: using LM or other additional knowledge? ? How to create the needed resources automatically? Interlingua ? For n languages, we need n(n1) MT systems. ? Interlingua uses a languageindependent representation. ? Conceptually, Interlingua is elegant: we only need n analyzers, and n generators. ? Resource needed: – A languageindependent representation – Sophisticated analyzers – Sophisticated generators Interlingua (cont) ? Questions: – Does languageindependent meaning representation really exist? If so, what does it look like? – It requires deep analysis: how to get such an analyzer: ., semantic analysis – It requires nontrivial generation: How is that done? – It forces disambiguation at various levels: lexical, syntactic, semantic, discourse levels. – It cannot take advantage of similarities between a particular language pair. Examplebased MT ? Basic idea: translate a sentence by using the closest match in parallel data. ? First proposed by Nagao (1981). ? Ex: – Training data: ? w1 w2 w3 w4 ? w1? w2? w3? w4? ? w5 w6 w7 ? w5? w6? w7? ? w8 w9 ? w8? w9? – Test sent: ? w1 w2 w6 w7 w9 ? w1? w2? w6? w7? w9? EMBT (cont) ? Types of EBMT: – Lexical (shallow) – Morphological / POS analysis – Parsetree based (deep) ? Types of data required by EBMT systems: – Parallel text – Bilingual dictionary – Thesaurus for puting semantic similarity – Syntactic parser, dependency parser, etc. EBMT (cont) ? Word alignment: using dictionary and heuristics ? exact match ? Generalization: – Clusters: dates, numbers, colors, shapes, etc. – Clusters can be built by hand or learned automatically. ? Ex: – Exact match: 12 players met in Paris last Tuesday ? 12 Spieler trafen sich letzen Dienstag in Paris – Templates: $num players met in $city $time ? $num Spieler trafen sich $time in $city Statistical MT ? Basic idea: learn all the parameters from parallel data. ? Major types: – Wordbased – Phrasebased ? Strengths: – Easy to build, and it requires no human knowledge – Good performance when a large amount of training data is available. ? Weaknesses: – How to express linguistic generalization? Comparison of resource requirement Transferbased Interlingua EBMT SMT dictionary + + + Transfer rules + parser + + + (?) semantic analyzer + parallel data + + others Universal representation thesaurus Hybrid MT ? Basic idea: bine strengths of different approaches: – Syntaxbased: generalization at syntactic level – Interlingua: conceptually elegant – EBMT: memorizing translation of ngrams。 generalization at various level. – SMT: fully automatic。 optimizing some objective functions. ? Type