【正文】
Corpus design and types of corpora Corpus Linguistics Richard Xiao Outline of the session ? Corpus design issues – Corpus representativeness – Corpus balance – Sampling – Corpus size – Types of corpora ? Introducing some wellknown English corpora of different types Representativeness ? A corpus is a collection of (1) machinereadable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety ? A corpus is different from a random collection of texts or an archive ? Representativeness is a defining feature of a corpus ? As language is infinite but a corpus has to be finite in size, we sample and proportionally include a wide range of text types to ensure maximum balance and representativeness Some definitions … ? “generally assembled with particular purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type” (Leech 1992: 116) ? “…selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” (Sinclair 1996) ? “A wellanized collection of data” (McEnery 2021) ? “gathered according to explicit design criteria” (TogniniBonelili 2021: 2) ? “built according to explicit design criteria for a specific purpose” (Atkins et al 1992) ? texts selected and put together “in a principled way” (Johansson 1998: 3) What is representativeness? ? “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety” (Leech 1991) ? Representativeness refers to the extent to which a sample includes the full range of variability in a population (Biber 1993) What is representativeness? ? Representativeness is a fluid concept closely related to your research questions – If you want a corpus which is representative of general English, a corpus representative of newspapers will not do – If you want a corpus representative of newspapers, a corpus representative of The Times will not do Two types of representativeness ? The representativeness of general corpora and (domain or genre specific) specialized corpora are measured in different ways – General corpora ? Balance: The range of genres included in a corpus and their proportion ? Sampling: How the text chunks for each genre are selected – Specialized corpora ? Degree of closure/saturation: Closure/saturation for a particular linguistic feature (. size of lexicon) of a variety of language (. puter manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point, . the curve of lexical growth is flattening out Why should we care about representativeness? ? Reader of corpusbased studies (assessment) – To interpret the results of corpus research with caution, considering whether the corpus data and the method used in the study was appropriate ? Corpus user (assessment) – Important to “know your corpus” – To decide whether a given corpus is appropriate for their specific research question – To make appropriate claims on the basis of such a corpus ? Corpus creator (assessment?) – To make their corpus as representative as possible of a language (variety) claimed to represent – To document design criteria explicitly and make the documentation available to corpus users Criteria for text selection ? The criteria used to select texts for a corpus are principally external – The external vs. internal criteria corresponds to Biber?s (1993: 243) situational vs. linguistic perspectives ? External criteria are defined situationally irrespective of the distribution of linguistic features ? Internal criteria are defined linguistically, taking into account the distribution of such features ? It is circular to use internal criteria like the distribution of words or grammatical features as the primary parameters for the selection of