【正文】
corpus data – If the distribution of linguistic features is predetermined when the corpus is designed, there is no point in analyzing such a corpus to discover naturally occurring linguistic feature distributions Criteria for text selection ? Time? – If a corpus is not regularly updated, it rapidly bees unrepresentative (Hunston 2021) ? The relevance of permanence in corpus design actually depends on how we view a corpus a static or dynamic language model – Static model: sample corpora (nearly all existing corpora, BNC, LOB/FLOB) – Dynamic model: Bank of English Criteria for text selection ? Tips –“Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.” (Sinclair 2021) Corpus balance ? A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration ? The proportions of different kinds of text it contains should correspond with informed and intuitive judgements ? There is no scientific measure for balance – just best guess ? The acceptable balance is determined by the intended use – your research questions The BNC model ? Generally accepted as being a balanced corpus ? Has been followed in the construction of a number of corpora ? 4,124 texts (including transcripts of recording) ? ca. 100 million words: 90% Written + 10% Spoken ? Three criteria for Written – Domain: the content type (. subject field) – Time: the period of text production – Medium: the type of text publication (book, periodicals etc) ? Two criteria for Spoken – Demographic: informal conversations by speakers selected by age group, sex, social class and geographical region – Contextgoverned: formal encounters such as meetings, lectures and radio broadcasts recorded in 4 broad context categories Written BNC Spoken BNC BNC vs. balance ? The design criteria of the BNC illustrates the notion of corpus balance very well – “In selecting texts for inclusion in the corpus, account was taken of both production, by sampling a wide variety of distinct types of material, and reception, by selecting instances of those types which have a wide distribution. Thus, having chosen to sample such things as popular novels, or technical writing, bestseller lists and library circulation statistics were consulted to select particular examples of them.” (Aston and Burnard 1998: 28) Pragmatics in corpus design ? “Most general corpora of today are badly balanced because they do not have nearly enough spoken language in them。 estimates of the optimal proportion of spoken language range from 50% the neutral option to 90%, following a guess that most people experience many times as much speech as writing” (Sinclair 2021) ? The written BNC is nine times as large as the spoken BNC – Is speech less frequent or important than writing? Pragmatics in corpus design ? Absolutely not! ? …but writing typically has a larger audience than speech ? …also collection of spoken data costs 10 times as much as for written data ? …it takes 10 hours to transcribe one hour of recording ? Pragmatic considerations also mean that balance is a more important issue for a static sample corpus than for a dynamic monitor corpus – As a monitor corpus is frequently updated, it is usually “impossible to maintain a corpus that also includes text of many different types, as some of them are just too expensive or time consuming to collect on a regular basis.” (Hunston 2021: 3031) Corpus balance: Some tips ? “The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its ponents.” (Sinclair 2021) ? “It would be shortsighted indeed to wait until one can scientifically balance a corpus before s