Corpora and Language Learners
Studies in Corpus Linguistics Studies in Corpus Linguistics aims to provide insights ...
52 downloads
1633 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Corpora and Language Learners
Studies in Corpus Linguistics Studies in Corpus Linguistics aims to provide insights into the way a corpus can be used, the type of findings that can be obtained, the possible applications of these findings as well as the theoretical changes that corpus work can bring into linguistics and language engineering. The main concern of SCL is to present findings based on, or related to, the cumulative effect of naturally occuring language and on the interpretation of frequency and distributional data. General Editor Elena Tognini-Bonelli Consulting Editor Wolfgang Teubert Advisory Board Michael Barlow
Graeme Kennedy
Rice University, Houston
Victoria University of Wellington
Robert de Beaugrande
Geoffrey Leech
Federal University of Minas Gerais
University of Lancaster
Douglas Biber
Anna Mauranen
North Arizona University
University of Tampere
Chris Butler
John Sinclair
University of Wales, Swansea
University of Birmingham
Sylviane Granger
Piet van Sterkenburg
University of Louvain
Institute for Dutch Lexicology, Leiden
M. A. K. Halliday
Michael Stubbs
University of Sydney
University of Trier
Stig Johansson
Jan Svartvik
Oslo University
University of Lund
Susan Hunston
H-Z. Yang
University of Birmingham
Jiao Tong University, Shanghai
Volume 17 Corpora and Language Learners Edited by Guy Aston, Silvia Bernardini and Dominic Stewart
Corpora and Language Learners Edited by
Guy Aston Silvia Bernardini Dominic Stewart University of Bologna at Forlì
John Benjamins Publishing Company Amsterdam/Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
Cover design: Françoise Berserik Cover illustration from original painting Random Order by Lorenzo Pezzatini, Florence, 1996.
Library of Congress Cataloging-in-Publication Data Corpora and language learners / edited by Guy Aston, Silvia Bernardini, Dominic Stewart. p. cm. (Studies in Corpus Linguistics, issn 1388–0373 ; v. 17) Includes bibliographical references and index. 1. Language and languages--Computer-assisted instruction. I. Aston, Guy. II. Bernardini, Silvia. III. Stewart, Dominic. IV. Series. P53.28.C68 2004 418’.0285-dc22 isbn 90 272 2288 6 (Eur.) / 1 58811 574 7 (US) (Hb; alk. paper)
2004057693
© 2004 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
Contents
v
Contents Introduction: Ten years of TaLC D. Stewart, S. Bernardini, G. Aston A theory for TaLC? The textual priming of lexis Michael Hoey Corpora by learners Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English Yukio Tono New wine in old skins? A corpus investigation of L1 syntactic transfer in learner language Lars Borin and Klas Prütz Demonstratives as anaphora markers in advanced learners’ English Agnieszka Leńko-Szymańska
1
21
45
67 89
How learner corpus analysis can contribute to language teaching: A study of support verb constructions Nadja Nesselhauf
109
The problem-solution pattern in apprentice vs. professional technical writing: An application of appraisal theory Lynne Flowerdew
125
Using a corpus of children’s writing to test a solution to the sample size problem affecting type-token ratios N. Chipere, D. Malvern and B. Richards
137
vi
Contents
Corpora for learners Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching Ute Römer
151
Can the L in TaLC stand for literature? Bernhard Kettemann and Georg Marko
169
Speech corpora in the classroom Anna Mauranen
195
Lost in parallel concordances Ana Frankenberg-Garcia
213
Corpora with learners Examining native speakers’ and learners’ investigation of the same concordance data and its implications for classroom concordancing with ELF learners Passapong Sripicharn
233
Some lessons students learn: Self-discovery and corpora Pascual Pérez-Paredes and Pascual Cantos-Gómez
245
Student use of large, annotated corpora to analyze syntactic variation Mark Davies
257
A future for TaLC? Facilitating the compilation and dissemination of ad-hoc web corpora William H. Fletcher
271
Index
301
Bionotes
307
Introduction: Ten years of TaLC
1
Introduction: Ten years of TaLC D. Stewart, S. Bernardini, G. Aston School for Interpreters and Translators, University of Bologna, Italy
1. Looking back over 10 years of TaLC TaLC was born in 1994, as a result of discussions among members of ICAME (the International Computer Archive of Modern and Medieval English), who realized that there was a growing interest in the use of text corpora in the teaching of languages and linguistics. e first TaLC conference was held in Lancaster in 1994, and its established purpose was well summed up in the announcement of the second conference (Lancaster 1996), which declared: While the use of computer text corpora in research is now well established, they are now being used increasingly for teaching purposes. This includes the use of corpus data to inform and create teaching materials; it also includes the direct exploration of corpora by students, both in the study of linguistics and of foreign languages.
e 5th TaLC conference, held in the hilltop town of Bertinoro in the summer of 2002, provided an opportunity to reflect on many of the developments that have taken place over the last decade. Perhaps the most striking development concerns the nature of the corpora investigated. Back in 1994, contributors were primarily concerned with what we might term “standard” or “reference” corpora, which were carefully designed to provide representative samples of particular language varieties. us there was much quotation of data from the Brown and LOB corpora, which aimed to provide representative samples of American and British written English, and a “bigger the better” enthusiasm for the growing Bank of English and the about-to-be-published British National Corpus. Comparisons of Brown and LOB had stressed the importance of geographical differences, so there was also considerable attention to the International Corpus of English project, with its
2
D. Stewart, S. Bernardini, G. Aston
attempt to produce corpora for a large number of different varieties.1 ere was also attention to domain- and genre-specific corpora, restricted to such areas as the oil industry and newspapers. Ten years later, it is clear that the distinctions now being made have become much more subtle. Geography and topic no longer seem to be the main criteria by which the type of corpora used in TaLC can be distinguished. Many of the papers in this volume, for instance, are concerned with corpora consisting of writing or speech produced by language learners, or of materials written for language learners. e question repeatedly implied is: what kinds of corpora are relevant for teaching?
2. Corpora AND learners At TaLC 5, Henry Widdowson drew attention to the least prominent part of the TaLC acronym, namely the small “a” of “and”. He reminded us that “the conjunction “and” [is] a very common word, number 3 in most frequency lists, and like most very frequent words, it has multiple functions […]”. He pointed out that it is not only “T” and “LC” that matter, but also the way they are related by this small, apparently insignificant conjunction “and”. Similarly, the interaction between language corpora “and” language learners may be of different kinds. In this volume we have identified three macro-areas, which appear to lie at the core of current research and applications of corpus linguistics to language teaching. Learners may be the authors or providers of corpus materials, they may be the ultimate beneficiaries of corpus insights, e.g. through the intermediation of the teacher or materials designer, or they themselves may be the main users of a corpus. is volume has accordingly been structured around three main sections, corresponding to these three different functions of the conjunction “and”.
2.1 Corpora BY learners e first section is concerned with corpora consisting of materials produced BY learners. Following the pioneering work by Sylviane Granger and her team in developing the International Corpus of Learner English (ICLE: Granger 1998),2 there has been rapidly growing interest in producing corpora which can be used to study features of interlanguage (oen in comparison with the
Introduction: Ten years of TaLC
3
language produced by native speakers) and to analyse “errors” – the latter raising considerable questions as to identifying and classifying errors, and hypothesising “correct” versions corresponding to the learner’s intentions. e general assumption underlying such work is that by identifying features of learner language it may be possible to focus teaching methods and contents more precisely so as to speed acquisition. is section therefore presents a series of studies of learner corpora, dealing with both lexico-syntactic and discoursal aspects of learner language, in almost all cases by means of a comparison with a TL corpus of English. Some contributions, however add a third variable: a corpus of the learners’ L1. is latter category, with which the section begins, thus examines learner language by means of a comparable corpus made up of three subcorpora: the students’ L2, L1 and TL. Tono investigates the acquisition of English verb subcategorization frame (SF) patterns on the part of Japanese learners by drawing multiple comparisons between the three corpora comprising his Japanese EFL Learner (JEFFL) corpus. ese are: (i) L1 Japanese, made up of newspapers and student compositions, (ii) TL English in the form of ELT textbooks at both junior and senior school level, and (iii) L2 English, i.e., his students’ interlanguage (IL), consisting of compositions and picture descriptions produced by students of varying levels of proficiency. e author’s aim is to study how various factors, principally the influence of verb SF patterns in Japanese, the degree of exposure to English SF patterns in the foreign language classroom, and the properties of inherent verb meanings in English, can influence the acquisition of such patterns on the part of Japanese learners. Borin and Prütz also investigate aspects of syntax, in this case the frequencies of POS sequences, using a similarly-constructed three-way comparable corpus. As was the case with Tono, the authors’ corpus consists of (i) texts in L1, in this case Swedish (the Stockholm Umeå Corpus of written Swedish), (ii) TL English in the form of the written part of the BNC sampler, and (iii) L2 English (IL), namely the Uppsala Student English Corpus. e three-way comparison favoured by both Tono and Borin and Prütz represents a move away from most other studies of learner language corpora, where the IL has been compared only with TL native-speaker production. e methodology adopted thus reflects a shi of emphasis towards considerations of L1 interference in IL. In the case in point it is claimed by Borin and Prütz that, by comparison with native-speaker English, there is significant overuse or underuse of specific
4
D. Stewart, S. Bernardini, G. Aston
POS sequences in the IL of advanced Swedish learners of English, and that such discrepancies are due to the influence of L1, inasmuch as Swedish is characterized by POS sequences analogous to the IL. e overuse or underuse of specific elements of usage on the part of learners by comparison with native speakers is also taken up by Leńko-Szymańska, who is one of a number of contributors who prefer a two-way comparison to investigate learner language, i.e., between a learner corpus and a nativespeaker corpus. e author uses the PELCRA corpus of learner English (essays written by Polish university students of varying levels of proficiency) and the BNC sampler to identify misuse of demonstratives as anaphora markers on the part of her students, and concludes that native-like use of demonstratives is unlikely to be acquired implicitly by Polish learners, who therefore need specific assistance in this area, particularly in view of the fact that this feature of language is given little emphasis in language programmes and ELT materials. On a broader level Leńko-Szymańska observes that the finer details of many interlanguage problem areas, whether L1 dependent or not, remain unexplored and oen not specifically focused upon in class, and that learner corpora must be seen as a vital resource in throwing light upon such details. e methodology of comparing a learner corpus with a native-speaker corpus is also adopted by Nesselhauf as part of her survey of support verb constructions (e.g., give an answer, have a look, make an arrangement) as used by advanced German-speaking learners of English. e analysis takes its data from a subcorpus of ICLE (the International Corpus of Learner English) containing essays written by native speakers of German. e support verb constructions extracted were then judged for acceptability via consultation not only of the written part of the BNC but also of a number of monolingual English dictionaries, along with native speakers where necessary. e author identifies constructions which would appear to be particularly problematic for German learners, subsequently suggesting ways in which her results could inform teaching strategies. Nesselhauf is wary, however, of drawing potentially glib conclusions from learner corpus studies. Most of these claim to have implications for language teaching, recommending that whatever is discovered to deviate significantly from native-speaker usage should be prioritized in the classroom. Yet by endorsing this view, the author argues, learner corpus researchers expose themselves to the kind of criticism that NS corpus analyses have encountered for some time now, i.e., that they rely exclusively and unimaginatively on fre-
Introduction: Ten years of TaLC
5
quency counts in order to reach their conclusions about what learners should be taught. Frequency is a crucial criterion, the author continues, but needs to be refined and elaborated within a more ample framework of associated criteria such as (i) the language variety the learners aim to acquire, (ii) text typology, (iii) the degree of disruption provoked for the recipient by inappropriate usage, and (iv) the frequency of those features of language that learners appear to find particularly useful. Flowerdew continues the series of papers which offer results stemming from comparisons between corpora of IL and TL. e author gives priority to discoursal aspects, focusing on problem-solution patterns used in technical reports by (i) apprentices and (ii) professionals. Salient lexis present in such patterns is identified and classified within the Hallidayan-based APPRAISAL framework, which is concerned on the Interpersonal level with the way language is used to evaluate and manage positionings. is contribution represents a departure from previous studies in two ways: firstly in its choice of raw materials, since APPRAISAL surveys to date have been applied mainly to media discourse, casual conversation and literature, and secondly within corpus linguistics itself, considering that problem-solution patterns have received scarce attention in corpus-based research by comparison with other areas of text linguistics. e contribution by Chipere, Malvern and Richards, which concludes the first section of the volume, also discusses a learner corpus, but with some important differences. In the first place the learners are native speakers, i.e., children writing in their first language, and secondly there is a much stronger methodological emphasis. e principal objective of the paper is to highlight sample size problems attendant upon the use of the Type-Token Ratio measure, and in particular to discuss what the authors suggest to be flawed strategies adopted in the literature over the years in order to address such problems. e authors then propose their own solution, based upon modelling the relationship between TTR and token counts, applying this to their corpus of children’s writing. It is subsequently claimed that the procedure adopted not only provides a reliable index of lexical diversity but also demonstrates that lexical diversity develops hand in hand with other linguistic skills.
6
D. Stewart, S. Bernardini, G. Aston
2.2 Corpora FOR learners e second section is concerned with corpora which are designed to benefit learning by allowing teachers and materials designers to provide better descriptions of the language to be acquired, and hence to decide what learners should learn: corpora FOR learners. is use of corpora was already well established in 1994, following the publication of the Cobuild project’s frequency-based dictionaries and grammars: there seems little point in teaching learners very rare uses, or failing to teach them common ones. e argument has extended itself from general surveys to more specific ones using corpora comprising language from situations in which particular groups of learners are likely to find themselves, such as the university settings considered in the construction of the Michigan Corpus of Academic Spoken English (MICASE) corpus.3 is approach raises questions of corpus construction, with the need for teachers of specific groups to be able to rapidly compile ad-hoc corpora which can be used to assess the linguistic characteristics of particular domains and genres – an ever-easier (but also in a way more complex) task given the massive quantity of electronic texts available on the internet. e central issue here remains just what language and what texts should be proposed to learners as models. Should learners be expected to imitate native speaker language? As far as English is concerned, it is increasingly argued that they above all need to acquire English as a lingua franca (ELF), and that in consequence, what should be analysed are corpora of speech and writing involving non-native speakers. is argument can be overstated, since most learners are likely to need to understand the speech and writing of native speakers, even if not necessarily to imitate it. But the move towards the study of ELF is an important reminder that language use is recipient-designed, to use the term of conversational analysts, and that it may not always be appropriate to take the language of native speakers for native speaker recipients as a model for learners’ own production: corpora for language learners may not be the same as corpora for linguists. e selection of appropriate corpora will be determined by the teacher’s and material writer’s assessment of learners’ needs and objectives, as a means of deciding what they should learn. e opening paper, Römer’s comparison of real and ideal language learner input, has links with the paper by Tono which opened Section 1, in that it is concerned with the use of a corpus of EFL textbook texts. Römer justifies her study by pointing out that while considerable attention is being devoted in cor-
Introduction: Ten years of TaLC
7
pus studies to learner output (clearly the first section of this book is a testimony to this), relatively little interest has been shown as regards learner input, and in particular the (substantial) input from EFL text books. e author has accordingly constructed a “pedagogical corpus” (Hunston 2002:16) of EFL material. e texts selected by Römer are all (written) representations of spoken English in EFL texts. ese were compared with the spoken part of the BNC, with particular focus on if-clauses, in order to seek insights into a topical question – of relevance not only to English language teaching but to language instruction in general – i.e., whether the input from foreign language textbooks is a fair reflection of the type of language students are likely to encounter in natural communicative situations. e section then moves from Römer’s pedagogical corpus to more “classic” types, i.e., target language reference corpora and parallel corpora, although it will be seen that the different types have some common goals. Kettemann and Marco consider the use of TL reference corpora in the classroom, yet their focus is different from other papers in the volume in that they propose the analysis of corpora of literary texts (in particular the complete works of writers examined). is move reflects the belief that approaching literary texts through corpora is a worthy pedagogical enterprise in many respects, not only in terms of foreign language acquisition but in particular in terms of awareness raising, whether this be language awareness, discourse awareness or methodological awareness. Examining corpora of classic British and American authors, Kettemann and Marco aim to raise the status of the literary corpus from its “subordinate position” in the TaLC sector, a position until now “too low-case to be assigned the capital L in the acronym’. e paper by Mauranen also describes classroom use of a TL reference corpus, though in this case the data examined are spoken rather than written. e author describes use of the MICASE Corpus within the framework of an experimental postgraduate course in English for Finnish students, using this as a springboard for reflections upon a number of topical issues in the TaLC sector. ese include (i) the degree of authenticity of a spoken corpus, which is in a sense twice removed from its original context, (ii) the communicative usefulness of a speech corpus, and (iii) – an issue clearly close to the author’s heart – is it fair that almost all spoken corpora consist of L1 adult data, i.e., is there a place for L2 spoken corpora, particularly as a model for international English? is goes hand in hand with the question of how necessary or relevant a highly idiomatic command of native-like English might be for most users of English
8
D. Stewart, S. Bernardini, G. Aston
as a foreign language. e author ends by stressing that, for most teachers not specialized in corpus use, making the corpus leap can be a daunting task. She therefore calls for both more sensitive training and more user-friendly corpus materials, in order to spread the word more effectively. e section closes with Frankenberg-Garcia’s paper on possible uses of a parallel corpus in second language learning, thus providing a variation upon the monolingual emphasis that has characterized the volume thus far. e issue is an interesting one because until now, as the author notes, parallel concordancing has been primarily associated with translation activities and lexicography, while it is monolingual concordancing which has prevailed in the language learning domain. Drawing upon examples from COMPARA, a parallel, bi-directional corpus of English and Portuguese, the author seeks to identify (i) which language learning situations might derive benefit from a parallel corpus, and (ii) how the corpus might best be exploited in the language classroom by teachers and learners alike. Such questions are not easily answered – and in any case lie on a different axis by comparison with monolingual corpora in pedagogy – precisely because of the dual nature of the corpus. Parallel concordancing offers contrasts not only between translational and non-translational language, but also between L1 and L2. It goes without saying that earnest reflection is required if such contrasts are to be converted to productive use in language teaching, and in this respect Frankenberg-Garcia furnishes some much-needed insights.
2.3 Corpora WITH learners e papers in the third section testify to a different perspective, again implying distinctive criteria for corpus selection on construction. Rather than what should be learned, they focus on how learning should take place. Right from the first TaLC conference, there were papers which viewed corpora primarily as tools which learners could use to find out about the language (and the culture behind that language) for themselves, with or without the help of their teachers. e section “Corpora WITH learners” includes discussions of a number of activities designed to help learners use corpora and to acquire linguistic knowledge and skills through their use. Here, the choice of corpora will depend on their appropriacy not as descriptive, but as learning tools. Sripicharn’s focus is on the processes and strategies adopted by users during concordance-based activities. He conducts an experiment to assess the per-
Introduction: Ten years of TaLC
9
formance of a group of ai students against that of a group of English native speakers, asking them to perform a number of concordance-based tasks. e author underlines the significantly different approaches used by the respective groups, with the ai students privileging data-driven hypothesis-testing strategies, while the English students paid less attention to the data and relied more on intuitive reactions, though both groups came up with sophisticated observations. Sripicharn, however, warns against the dangers of learners overgeneralising from the kind of restricted data attendant upon a small-scale study such as this. Pérez-Paredes and Cantos-Gómez also provide an example of how corpora can be used with learners. However in this case it is first and foremost the student, rather than the researcher/teacher, who examines the results. e authors collected samples of oral output from a group of Spanish students, then returning the findings to the group in the form of a spreadsheet complete with data on tokens, types, content words, frequency bands and other aspects of the students’ performance. e members of the group were then invited to compare their own individual production with mean data for the whole group, and consequently to provide an appraisal of their strengths and weaknesses. By confronting students with their own output, the authors aim to encourage learner autonomy through a guided process of self-discovery. Davies’ survey of classroom use of Spanish reference corpora occupies a shared middle ground between this and the previous section. It qualifies as corpora for learners inasmuch as it involves the use of TL reference corpora, but its chief emphasis is that of corpora with learners, since, like the contribution by Pérez-Paredes and Cantos-Gómez, it focuses upon learners’ ability to assimilate and draw conclusions from the available data. Davies reports the findings of an on-line course entitled “Variations in Spanish Syntax” for graduates in Spanish from different parts of the United States. e participants were trained in the use of a number of reference corpora, including the author’s own 100-million word Corpus del Español, and then assigned tasks regarding complex features of Spanish syntax where they were required to compare the corpus data with specific explanations provided in a well-known reference grammar of Spanish. Davies is especially interested in the role of the learners both as researchers, in locating useful corpus data, and as critics, in evaluating the findings of fellow learners. e survey thus has clear links with Römer’s paper in Section 2, since it takes as its justification and premise the notion that learners need to move beyond the sometimes simplistic usage and
10
D. Stewart, S. Bernardini, G. Aston
rules provided in foreign language manuals and textbooks. In brief, it is now recognized that what corpora should be used in the context of language teaching and learning depends on what they are to be used for. A variety of uses implies a variety of corpora. e papers in this volume indicate the richness of the issues raised, and the vitality of the field.
3. Looking out and around: A theory for TaLC, a future for TaLC? Two papers escape any obvious grouping along the lines just discussed. ese papers have been given very prominent positions in the volume: one opens it, and one closes it. e first is an important contribution to modern linguistic thought in the form of Michael Hoey’s discussion of the pervasiveness of priming in language use, and in particular the textual priming of lexis. Hoey argues that all lexical items are primed for grammatical and collocational use, i.e., every time we encounter a word it becomes “loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word that it regularly co-occurs with particular other words or with specific grammatical functions” (p. 21). e author then underlines that priming goes beyond the sentence, i.e., that a lexical item may be primed (i) to appear in particular textual positions with particular textual functions, a phenomenon heavily influenced by text domain and genre, and (ii) to participate in cohesive chains. Because as individuals our exposure to language is unique, i.e., different from everybody else’s, it follows that a word is primed for the individual language user. In other words, priming belongs to the individual and is constantly in flux. Hoey concludes with a discussion of the relevance of his theory of priming for pedagogical issues: firstly, its implications for language learners, i.e., how priming could be tackled within the walls of the classroom, and secondly, what bearing it has on the production of language in terms of routine and creativity. e second paper, by William Fletcher, is concerned specifically with how to exploit the world wide web to create ad hoc corpora, i.e., how to harness more efficiently and more selectively the prodigious quantities of machinereadable data available on-line, and more generally how to prioritize quality and relevance over quantity. Fletcher argues that there are various obstacles which hamper on-line searches and thus prevent the web from realizing its full
Introduction: Ten years of TaLC
11
potential. e most persistent drawback, the author claims, is the difficulty in identifying documents which are (i) germane to the user, and (ii) reliable. As a possible remedy for such problems, the author discusses his web concordancer KwiCFinder, which automates and renders more streamlined the process of retrieving relevant documents. However, searches can be time-consuming nonetheless, and with this in mind the author sketches an outline of two rather more visionary proposals. Since the orientation of most existing search engines is towards the general public and in any case driven by commercial requirements, it would be useful for learners and language professionals to have access to (i) a selective web archive and (ii) a specialized search engine, specifically tailored to their needs. With regard to the first, Fletcher states his intention to implement the Web Corpus Archive of web documents, which will collect, disseminate and build upon users’ searches, with each member of the user community benefiting from the efforts of others. As regards the second proposal, Fletcher details his plan to create a Search Engine for Applied Linguists, which would enable sophisticated queries and furnish information such as the frequency and dispersion of a given form across the web pages included in the corpus. Finally, aer a review of what is currently available on the market in terms of web concordancers, web corpora, and search engines for applied linguists, the author recommends some useful web search resources for language teaching and learning. While Hoey sketches a theory of language within which the papers that follow, and pedagogic applications of corpora in general, can be situated, Fletcher gives an exhaustive account of the role the WWW is playing today, and might play in the future of TaLC. e perspectives adopted are very different, yet both are invaluable in providing insights which set the papers that form the core of this volume against the wider pictures of linguistic theory and language technology. In different ways, they suggest that there is indeed a future for TaLC.
4. Authenticity: A common thread running through TaLC 4.1 What is authentic language? For over a decade, authenticity has arguably been the one fundamental theoretical and methodological issue which all those with an interest in applying corpora to didactic uses have sooner or later had to confront. Several papers
12
D. Stewart, S. Bernardini, G. Aston
in this collection tackle this central issue, which was also the main object of a joint keynote session by John Sinclair and Henry Widdowson at TaLC 2002, discussing “Corpora and language teaching tomorrow”. e issue here is whether the language that foreign language learners are exposed to (from example sentences in grammar books or on blackboards to readings, videos etc.) should necessarily be “authentic”. Authenticity in this sense refers to a piece of text being “attested”, having occurred as part of genuine communicative (spoken or written) interactions. According to Hoey (this volume), exposure to authentic data is crucial since “only authentic data can preserve the collocations, colligations, semantic associations of the language” (p. 37). Indeed, it is this belief that motivates more and more teachers to introduce corpora into their classrooms. Römer (this volume), provides an example of the difference between “authentic” and “made-up” (or, more precisely, “made-up sounding”) examples. She cites an example from her EFL textbook corpus where the following exchange is used to illustrate the present progressive in yes/no questions: (1) MR SNOW MRS SNOW MR SNOW MRS SNOW MR SNOW MRS SNOW
Hello, Wendy. Hello, Ron. Where are the girls? Are they packing? Yes, they are. Or are they playing? No, they aren’t, Ron. ey are packing.
On the other hand, as Römer points out, a search of the BNC spoken component retrieves utterances such as: (2) What’s happening now, does anybody know? (3) What are we talking about, what’s the subject? (4) Are you listening to me? (5) Are you staying at your mum’s tonight?
No. I’m staying at Christopher’s.4
Competent speakers of English might consider the corpus examples to be more “natural” than the textbook examples.5 Römer goes on to claim that the corpus backs up this impression, confirming that the two verbs, “packing” and “playing”, are not at all frequent in the pattern “are they VERB-ing”. It would seem,
Introduction: Ten years of TaLC
13
as claimed by Sinclair (in many places, among these also at TaLC 2002) that “we cannot trust our ability to make up examples”… But corpora are great sources of serendipitous findings, as we all know. So let us stick for a moment with “are they”, and look at a concordance of this string as the first element of a spoken utterance in the BNC. To start with, “are they” does not seem to colligate very oen with the progressive. Out of 393 solutions, 41 only are followed by a verb in the progressive form. Of these, 17 are instances of the pattern “are they going to/gonna VERB”, leaving 24 “good” candidates only. An example of these is the following (KDE): (6) PS0M4 >: Alia and Aden are coming around to play with you this aernoon. PS0M5 >: Are they coming now? PS0M4 >: In a minute.
is short exchange may appear somewhat more similar to the textbook examples than the corpus examples in 2–5, and as such possibly less natural than the latter. Let us consider another short extract from the same conversation: (7) PS0M5 >: Who who bought this? PS0M4 >: Mummy and daddy bought it. PS0M5 >: Where did it came from? PS0M4 >: It comes from the Gap.
If we remove the hesitation in line 1 and correct the grammar in line 3, we have a typical textbook example of WH-questions. On the contrary, the following example comes across as more natural: (8) PS04U >: What’s Ken and Marg having turkey at Christmas or PS04Y >: Mm? PS04U>: are they having turkey at Christmas or don’t they, don’t you know? PS04Y >: I don’t know what there’ll [sic] have, you see Naomi and Mitch are vegetarian ...
And yet both 6–7 and 8 are authentic, attested corpus examples. e exchanges in 6–7, not unexpectedly perhaps, take place between a mother and her son aged 3. e one in 8 between two housewives. Could it be the case, then, that authenticity of language is to be treated not as an absolute feature, but rather as a gradient feature? Or, in other words, could it be the case that some instances of attested language use are more “proto-
14
D. Stewart, S. Bernardini, G. Aston
typically” authentic than others? And that in evaluating authenticity we should take into account what words are being spoken/written, as well as to whom they are addressed, for what purpose(s) and so forth?
4.2 A richer view of authenticity Mauranen (this volume), for instance, proposes a distinction between “subjective” authenticity (as perceived by learners) and “objective” authenticity (as evaluated by a teacher or researcher). She also acknowledges that at least certain instances of spoken corpus material (e.g. dialogue) may be seen as less authentic than written corpus material. While the latter requires a reader in order to be interactively complete, the former is a record of an interactional event that is complete in itself. e learner can only interact with this type of spoken material as an external observer. And yet, she argues, observing interaction is as important as participating in interaction. By highlighting repeated patterns, spoken corpora offer a more form- and function-oriented approach to interaction than in real-life situations, where observers are more likely to be led to focusing on content and the unfolding situation. Nesselhauf (this volume), would similarly appear to endorse the richer view of authenticity described above. She suggests that, alongside frequency in native speaker usage, there are a number of other criteria on which recommendations for teaching should be based. Among these, for instance, the “degree of disruption of an unacceptable expression for the recipient”: if a mistake is a likely cause of misunderstandings, for instance, it should be insisted upon more. Similarly, we might add, learners are likely to need sophisticated repair strategies and routines which make up for their language deficiencies. Whether and to what extent these are attested in monolingual reference corpora of the target language is an open question. e debate on authenticity thus feeds into a more general debate over the most appropriate model of language for learners. Current work on ELF (Mauranen this volume, Seidlhofer 2001) suggests that native speaker corpora of the target language might, by definition, not provide an ideal model, and that a better alternative could be “good international English spoken in academic and professional contexts” (Mauranen this volume). e latter would be contextually more appropriate, recording language spoken in situations in which learners are likely to find themselves. ey would provide indications of successful (and unsuccessful) strategies that competent non-native speakers use in
Introduction: Ten years of TaLC
15
interaction with each other and with native speakers. And they would be fairer to foreign learners and teachers, setting them a more achievable and more coherent target than that of an idealized community of native speakers. ELF corpora are just beginning to see the light: substantial efforts are needed to build them, evaluate their contribution to language teaching, and get through the likely resistance of teachers and learners, who might not like the idea of doing without the useful fiction of the “native speaker” model. But the debate is open.
4.3 A decade-long controversy: What next? As mentioned above, it is no coincidence that authenticity in language teaching/learning features so prominently in this volume. e discussion was rekindled at TaLC 2002 by the joint plenary on “teaching and language corpora tomorrow” given by John Sinclair and Henry Widdowson, who agreed to discuss their current position with respect to this topic, a decade aer two well-known articles first sparked off interest in it (Sinclair 1991, Widdowson 1991). As it happens, their positions turned out to be distinct in theory, and yet far from irreconcilable in practice.6 Building on a “syntagmatic” view of language, Sinclair suggests that at the foundation of language teaching in the future there is likely to be the lexical item, a unique form of expression that goes together with a unique meaning. Like words, lexical items are not regulated by the open-choice rules of grammar. ey can undergo modifications (expansion, contraction, (ironic) exploitation etc.) which are regulated by convention, by the idiom principle. Unlike words, however, lexical items are unambiguous. Sinclair has provided several memorable examples of lexical items, e.g. those whose core constituents are the words brook (VB), budge, gamut or naked eye. In the case of gamut, for instance, he suggests (Sinclair 2003) that this lexical item consists of a verb, usually run, followed by a noun group containing an article, usually the, an optional adjective, e.g. whole or synonyms, the node word gamut and a prepositional phrase or another adjective referring to the area over which the phrase ranges. is lexical item, whose simplified base form might thus be RUN the whole gamut of …, has the unified function of referring to a set of events, highlighting its size/complexity and the extensiveness of the coverage achieved.
16
D. Stewart, S. Bernardini, G. Aston
A syntagmatic model of language centred around the lexical item should make learning easier, safer, and arguably more successful, Sinclair claims, since learners do not have to cope with lexical ambiguity and to worry about lexicogrammatical choices below the level of the lexical item. is change of perspective implies that, if contrived examples could be acceptable in a paradigmatic approach, in a syntagmatic approach they would not, since intuition is notoriously unreliable when it comes to identifying, exemplifying or describing lexical items. Now this is not to imply that any real example is fine, but rather that “to have occurred in communication is a necessary, but not a sufficient condition for [a piece of text to be] presented as a model of language” (Sinclair 2002). Widdowson’s approach is complementary rather than opposed to Sinclair’s, shiing the perspective, in Widdowson’s words, from LCat to Talc, from language theories and descriptions which have (crucial) implications for teaching, to language theories and descriptions which are subservient to teaching and a means towards learning. He does not deny “the enormous contribution that corpora have made over the years to linguistic descriptions”, but suggests that, especially when time and resources are limited, as in most language courses, decisions have to be made about what to teach based not only on (frequency of) occurrence in the target community, but also on what language is the best investment for learners: So here the question has to do with what has to be taught to provide an impetus for learning, how do you create the conditions for learning to take place beyond the end of the course … an acceptance that some things are teachable, and some are only learnable, in the sense that we could only point learners in the right direction, developing “vectors of learning”. (Widdowson 2002)
It might turn out that the most frequent lexical items attested in a general corpus of the target language, taught using corpus materials, provide just this impetus, and constitute a valid basis on which to base a language course syllabus. e work of Tim Johns and colleagues on Data-Driven Learning (Johns and King 1991) goes in this direction. But once again, this is an open question that awaits empirical verification. e importance and value of LCat is nowadays undisputed. e syntagmatic model which owes so much to the work of Sinclair is generally perceived as a more accurate model of language for teaching purposes than the paradigmatic one. And the fact that virtually every new learner dictionary to come out is corpus-based bears eloquent testimony to this.
Introduction: Ten years of TaLC
17
Evidence in favour of Talc is, on the other hand, still limited (exceptions are Cobb 1997, Gitsaki 1996, Sripicharn this volume). We do not know for sure if learners become better at using the language for their intended purposes when taught within a framework which follows the underlying principles of a syntagmatic model. Aer five TaLCs, and a decade of discussion, there is still much we have to learn about the effects of our teaching practices on learners, whether corpus use and corpus-inspired materials affect their learning path, and whether they do so in a positive manner. In the words of Vivian Cook (2002:268): Memorable, interesting, invented sentences may lead to better conscious learning of language and ultimately to better unconscious language use; on the other hand the more neutral the sentence the more its language elements may be absorbed into the students’ competence. […] It may be better to teach people how to draw with idealized squares and triangles than with idiosyncratic human faces. Or it may not. The job of applied linguists is to present evidence to demonstrate the learning basis for their claims […].
Hopefully, the search for this evidence will feature prominently in the TaLC agenda for the next decade.
Notes 1. Brown: http://helmer.aksis.uib.no/icame/brown/bcm.html (visited 17.5.2004) LOB: http://helmer.aksis.uib.no/icame/lob/lob-dir.htm (visited 17.5.2004) BoE: http://www.cobuild.collins.co.uk/ (visited 17.5.2004) BNC: http://www.natcorp.ox.ac.uk/ (visited 17.5.2004) ICE: http://www.ucl.ac.uk/english-usage/ice/ (visited 17.5.2004) 2. ICLE: http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm (visited 17.5.2004) 3. MICASE: http://www.hti.umich.edu/m/micase/ (visited 17.5.2004) 4. e reference in this last example is of course to future time, i.e., it is a“present progressive as future”, as grammar books sometimes have it. 5. We do not intend to go into the thorny question of the difference between genuineness, naturalness and authenticity, but it is clear that contextual issues are key in any such discussion. e textbook example cited might not appear at first sight to be particularly typical, but it would not be too arduous a creative task to imagine a situation in which it might actually be attested (a tense, awkward, in part sarcastic exchange between an estranged husband and wife, where the husband, who has come to pick up the kids and take them on holiday, doubts
18
D. Stewart, S. Bernardini, G. Aston
his wife’s capacities as a mother). In any case – paradoxically enough – the textbook example is now attested, and in a number of places to boot: in an EFL textbook, in an EFL corpus, and in this book (twice). Is it therefore “more” authentic? 6. References to “Sinclair 2002” and “Widdowson 2002” refer to their (unpublished) talks at TaLC 2002.
References Cobb, T. 1997. “Is there any measurable learning from hands-on concordancing?”. System 25, 3:301–315. Cook, V. 2002. “The functions of invented sentences: A reply to Guy Cook”. Applied Linguistics 23, 2:262–269. Gitsaki, C. 1996. The development of ESL collocational knowledge. PhD thesis, The University of Queensland. Granger, S. (ed.) 1998. Learner English on Computer. London and New York: Longman. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Johns, T. and King, P. (eds) 1991. Classroom Concordancing. ELR Journal 4. Birmingham: University of Birmingham. Seidlhofer, B. 2001. “Closing a conceptual gap: The case for a description of English as a lingua franca”. International Journal of Applied Linguistics 11, 2:133–158. Sinclair, J.McH. 1991. “Shared knowledge” in Georgetown University Round Table on Languages and Linguistics 1991, J. Alatis (ed.). Washington, D.C.: Georgetown University Press, 489–500. Sinclair, J.McH. 2003. Reading Concordances. Harlow: Longman. Widdowson, H.G. 1991. “The description and prescription of language” in Georgetown University Round Table on Languages and Linguistics 1991, J. Alatis (ed.). Washington, D.C.: Georgetown University Press, 11–24.
e textual priming of lexis
A theory for TaLC?
19
20
Michael Hoey
e textual priming of lexis
21
The textual priming of lexis Michael Hoey University of Liverpool, UK
This paper sketches a theory of language that gives lexis and lexical priming a central role. All lexical items are primed for grammatical and collocational use, i.e., every time we encounter a lexical item it becomes loaded with the cumulative effects of those encounters, such that it is part of our knowledge of the word that it regularly co-occurs with particular other words or with specific grammatical functions. Priming also goes beyond the sentence, i.e., a lexical item may be primed (i) to appear in particular textual positions with particular textual functions, a phenomenon heavily influenced by text domain and genre, and (ii) to participate in cohesive chains. Because as individuals our exposure to language is unique, i.e., different from everybody else’s, it follows that a word is primed for the individual language user. In other words, priming belongs to the individual and is constantly in flux. The theory of priming clearly has relevance for pedagogical issues, with important implications both for language learning, i.e., the way priming is to be tackled within the walls of the classroom, and for language production, in terms of routine and creativity.
Sinclair (e.g., 1991) has argued that the study of lexis leads to results incompatible with the descriptions provided by conventional grammars. Biber et al. (1999) have argued that lexical bundles characterize text types. Farr and McCarthy (2002) argue that the function of conditionals is specific to particular types of interaction. Morley and Partington (2002) argue that syntax is an epiphenomenon of lexis. All this suggests we need a new theory. In this paper I want to put forward a theory of language that places lexis at its very centre and gives to vocabulary the pivotal status once awarded to syntax. What I have to say is only a beginning – a mixture of the self-evident and the unproven. Much of what I am going to say will seem obvious but I want to build from a shared position to positions that may seem novel or wrong, though I shall defend those positions fiercely. e theory I shall briefly outline here has links with Brazil’s work on the grammar of speech (Brazil 1995), with
22
Michael Hoey
Construction grammar (e.g. Goldberg 1995) and with Pattern Grammar (Hunston and Francis 2000). It assumes the correctness of Halliday’s interpersonal and ideational metafunctions (Halliday 1967–8) but rejects, and attempts to supersede, his account of the textual metafunction (while retaining the insights that Halliday’s model provides). e classical theory of the word is epitomized by those two central nineteenth and early twentieth century compendia of lexical scholarship – Roget’s esaurus (1852) and the Oxford English Dictionary (Murray et al. 1884–1928) . According to such texts, lexis can be described in terms of hyponymy and co-hyponymy, near synonymy and antonymy and has meaning(s) which can be defined using the lexical relations just mentioned. Every word, furthermore, belongs to one or more grammatical categories and has pronunciation, etymology, and history. According to the theory that underpins these positions, words interact with phonology through pronunciation, with syntax through their grammatical categories and with semantics through their senses; they find their place in diachronic linguistics through etymology. In such a theory, the lexical item is reactive to other systems, particularly those of grammar and phonology, and in some versions of the classical theory, the relationship between the word and the other systems has been so weak that grammar has been generated first and the words brought in as the last stage in the process (Chomsky 1957, 1965) or that the semantics have been generated first and the words seen as merely expressing the pre-existent meaning. Systemic-functional linguistics has an altogether more central place for lexis, but even in this model the systems can sometimes make it seem as if lexical choice is the last (because most delicate) choice to be made. Even where theory starts from the assumption that lexis is chosen first, or at least much earlier, the assumption is still that it passes through a grammatical filter which organizes and disciplines it. I referred above to those great 19th century works of scholarship – the OED and Roget’s esaurus. It is interesting that these works have outlived almost all the theoretical work (apart from that of Saussure’s) from the same period. In the same way I am convinced that, when linguists look back at the 20th century, it will not be the grammatical theories that will be admired as permanent works of the highest scholarship but the corpus-backed advanced learners dictionaries, starting with Collins COBUILD and continuing with Oxford, Longman and Macmillan, and it is these works, and of course in particular the first Collins COBUILD Dictionary, that have shown the traditional
e textual priming of lexis
23
view of lexis outlined in previous paragraphs to be suspect. In particular what these dictionaries and accompanying corpus-linguistic work have established beyond doubt is the centrality and importance of collocation in any description of lexis. Collocation no longer needs support but it demands explanation. e only explanation that makes sense of its ubiquity, and indeed its existence, is psychological in nature. Every lexical item, I want to argue, is primed for collocational use. By primed I mean that as a word is acquired through encounters with it in speech and writing, it is loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word (along with its senses, its pronunciation and its relationship to other words in the same semantic set) that it regularly co-occurs with particular other words. As Sripicharn (2002 and this volume) put it in a paper at the conference to which this volume owes its existence, “years of listening to people speaking make me know which words sound right together.” Each use we make of the word reinforces the priming (unless our use runs counter to the priming it has received), as does each new encounter with the word in the company of the same co-occurring other words. Each encounter that we have with the word that does not reinforce the original priming either weakens that priming slightly or complicates it. A word may, and routinely does, accumulate a range of primings which are weighted in our minds in a variety of ways that take account of relative frequency, mode, genre and domain. Part of our knowledge of a word is that it is used in certain kinds of combination in certain kinds of text. So I hypothesize (supported by small quantities of data) that in gardening texts during the winter and during the winter months are the appropriate collocations, but in newspaper texts or travel writing in winter and in the winter are more appropriate; the phrase that winter is associated with narratives. It follows from the processes involved in collocational priming that it is not in principle a permanent feature of the word. As new encounters alter the weighting of the primings, so they shi in the course of an individual’s lifetime, and as they do so (and because they do so) words shi imperceptibly in their meaning or their use. I suspect that for many older linguists such a shi has occurred in the priming of the word collocation itself! Its collocations, postHalliday and Hasan (1976) and pre-Sinclair (1991), were, I suspect, predominantly with the words text and sentence, rather than with corpus and word. So collocational priming is context specific and subject to change. It is also, importantly, a matter of weighting rather than requirement. So the relatively rare phrase through winter is English as well as in winter. Priming belongs to
24
Michael Hoey
the individual. A word is primed for a particular language user. A corpus cannot demonstrate the existence or otherwise of a priming for any individual. It can only show that a particular combination is likely to be primed for anyone exposed to data of the kind represented in the corpus in question. If we accept these positions, as I think we must if we are to account for the existence and prevalence of collocation, we open the way to a more general recognition of the notion of priming. To begin with, the grammatical category a word belongs to can be seen as its grammatical priming. Instead of saying “is word is a noun” or “is word is an adjective” I would argue we should say “is word is primed for use as a noun”. In other words, the word is loaded with the grammatical effects of our encounters with it in the same way as it is loaded with collocational effects. If the encounters all point the same way, we assume 100% identification of the word with a particular grammatical category; this happens occasionally with collocation also (e.g. kith with kin). Nevertheless such total identification is not as common as we might imagine (Hoey 2003). Words such as estimated (V, adj), teaching (V, N, adj), human (N, adj), and real (as in real nice, get real, real world, the real and the unreal) are the norm. How, for example, might one categorize red in a red sunset, the colour red, he went red or he saw red? If we agree that words are primed for grammatical category, the question must be regarded as inappropriate. As with collocational priming, grammatical priming can change through an individual’s lifetime. Anyone British and over 50 is likely to have had the word program shi in its priming from noun to verb in writing. (With the alternative spelling programme or from an American perspective the priming shi will have been different.) As with collocation, grammatical priming is context specific. In the conversation of homophobes, queer is primed as adjective and noun, but in the writings of cinema theorists, queer is primed as adjective only (queer cinema, queer theory). is means that the priming must be tagged for domain, purpose and genre. Again, more controversially, the priming is a matter of weighting not requirement. Margaret Berry once wittily said that you can verb any noun. Strictly this is not true – it does not apply to nouns derived from verbs in the first place – but her observation encapsulates a real fluidity in the language. So routinely do we adjective our nouns that we see it as entirely normal and label the use as a noun modifier or classifier, rather than admitting the protean nature of language; the Oxford Dictionary of Collocations however treats such usage as adjectival. (Aer all, in a red sunset, we would traditionally treat the
e textual priming of lexis
25
word red as an adjective, despite its nominal use in the colour red.) is grammatical priming does not necessarily assume the prior existence of any grammatical category. Sinclair’s masterly analysis (1991) of of as belonging to a grammatical category with just one member in it is a warning about the assumption that grammatical categories are givens in the language. It could indeed be argued that what we call grammatical categories may be post hoc generalisations derived from the myriad individual instances of lexical priming that we encounter and take on board in the course of our language development. One last point needs to be made about both collocation and grammatical categories, and it is a point that equally applies to those categories of priming I have yet to explicate. is is that primings nest. us wing collocates with west, west wing collocates with the, and the west wing collocates with in. Similarly, face is in the first place primed for use as a noun or verb. Put into the phrase in your face, however, face loses the verbal priming. Once very is added, the latent ambiguity of in your face disappears, and so does the nominal priming. In its place the phrase in your face (in very in your face) is primed for adjectival use. If we accept, at least for the sake of argument, that words are collocationally and grammatically primed, in other words if we accept that the learning of a word involves learning what it tends to occur with and what grammar it tends to have, it opens the door to the possibility of other kinds of priming. e first of these is semantic association, which in earlier papers (e.g. 1997a, 1997b) I referred to as semantic prosody (following or rather mis-following Louw 1993, and Stubbs 1995, 1996), and which Sinclair (1996, 1999) refers to as semantic preference. I would use Sinclair’s term if it were not for the fact that I want to avoid building the term “preference” into one of the types of priming, since one of the central features of priming is that it leads to a psychological preference on the part of the language user. Also, the use of “association” is designed to pick up on the familiar “company a word keeps” metaphor used to describe collocation. e change of term does not represent a difference of opinion. Semantic association is defined as occurring when a word is associated for a language user with a semantic set or class, some members of which are also collocates for that user. e existence of the collocates in part explains the existence, and in part is explained by, the semantic set or class in question. As an example of semantic association, consider the verb lemma train, analysed in considerable detail by Campanelli and Channell (1994) and cited by Stubbs (1996). Train (in my corpus) collocates with as a and the resultant
26
Michael Hoey
combination of words has a semantic association with the notion “skilled role or occupation”. e corpus has 292 instances of train* as a, of which 262 were followed by an occupation or related role. e data included the following (numbers of occurrences are given in brackets). train* as a teacher (25) train* as a doctor (12) train* as a nurse (11) train* as a lawyer (11) train* as a painter (8) train* as a dancer (7) train* as a barrister (5) train* as a chef (5) train* as a social worker (5) train* as a solicitor (5) train* as a braille shorthand typist (1) train* as a concentration camp guard (1) train* as a kamikaze pilot (1) train* as a boxing second (1) train* as a cobbler (1) train* as a train waiter (1)
e combination train* as a has some clear collocates (teacher, nurse, doctor, lawyer, painter, dancer, barrister, chef, social worker and solicitor), as the frequency figures suggest. But it is hard to imagine evidence ever being available to support the idea that braille shorthand typist is a collocation of train as a – except, importantly, in a specialist corpus of, say, minutes of the Royal Society for the Blind. But its occurrence is still accounted for because of the generalization inherent in the notion of “semantic association”. Many semantic associations such as the one just given seem to be grammatically restricted. Although there are plenty of instances of train with “skilled role or occupation” in other combinations in my data, particularly as teacher training, the relationship is in part constructed through the structure given in the column of data above. is suggests that for some lexical items there might be restrictions that are not simultaneously instances of semantic association. ese can be covered under another type of priming – colligational priming. e term “colligation” was coined by Firth, who saw it as running parallel to collocation. He introduced it as follows:
e textual priming of lexis
27
The statement of meaning at the grammatical level is in terms of word and sentence classes or of similar categories and of the inter-relation of those categories in colligation. Grammatical relations should not be regarded as relations between words as such – between ‘watched’ and ‘him’ in ‘I watched him’ – but between a personal pronoun, first person singular nominative, the past tense of a transitive verb and the third person singular in the oblique or objective form. Firth (1957:13)
As put, it is difficult to distinguish his notion from that of traditional grammar. Interestingly, though, Firth’s student, M.A.K. Halliday, used colligation in an apparently different way, and it is to be assumed that his use followed Firth’s intention. is is how Halliday introduces colligation: The sentence that is set up must be (as a category) larger than the piece, since certain forms which are final to the piece are not final to the sentence. Of the relation between the two we may say so far that:1, a piece ending in liau or j¦e will normally be final in the sentence; 2, a piece ending in s¦ i2, ηa, heu or sanhηgeu2 will normally be non-final in a sentence; 3, a piece ending in lai or kiu may be either final or non-final in a sentence. Halliday (1959:46; cited by Langendoen 1968, as an example of Halliday’s use of colligation)
Halliday here uses colligation to mean the relation holding between a word and a grammatical pattern, and this is how the term is currently used. For several decades it disappeared from sight with only the most occasional of references, and returned into use in papers from Sinclair (1996, 1999) and myself (1997a, 1997b). (We were not aware of each other’s work, but, as we were colleagues for many years, it is more than possible that I picked up the notion from conversations with him without realising that I had done so. In any case, the earlier of the papers in which he discusses colligation predates mine by a year, so the credit for resurrecting this valuable concept must rest with him.) One point to note about Halliday’s formulation is that he formulates the colligational relationship in terms of sentential position. us colligation covers not only grammatical relations as conventionally understood but also such matters as eme/Rheme position – and, I shall later argue, textual positioning too. If one considers the conventional grammatical statements one might make about the first two words of a clause such as (1) e cat sat on the mat [fabricated, as if you did not know]
they include the following:
28
Michael Hoey
a cat is head of the nominal group in which it appears b e cat is Subject c e cat is eme of the sentence.
In other words, we are capable of talking about a word’s place in its group, the function that the group plays in the clause and the textual implications of its position. It should be no surprise therefore that colligations can take any of these forms. I define colligation as a the grammatical company a word keeps (or avoids keeping) either within its own group or at a higher rank. b the grammatical functions that the word’s group prefers (or avoids). c the place in a sequence that a word prefers (or avoids).
My claim is that every word is primed to occur in certain grammatical contexts with certain grammatical functions and in certain textual positions, and this priming is as fundamental as its priming for collocation or semantic association. I see connections between colligation as I am here describing it and the notion of “emergent grammar” referred to by Farr and McCarthy (2002). ere are also clear parallels between the position here formulated and Hunston and Francis’s pattern grammar (2000). As an instance of the first type of colligation, the grammatical company a word keeps (or avoids keeping) either within its own group or at a higher rank, consider the word tea, which characteristically is strongly primed to occur as premodification to another noun, e.g. tea chest tea pot tea bag tea urn tea break tea party
It is also typically primed to occur as part of a postmodifying prepositional phrase, usually with of, e.g.
e textual priming of lexis
29
time for tea a cup of tea a pot of tea a packet of tea her glass of tea nine blends of tea
Even the Guardian newspaper with which I work provides evidence of this despite the low occurrence of such items as tea pot and tea set (presumably because the mechanics of tea-making are rarely newsworthy). In my data tea occurs as premodification over a quarter of the time (29%) and as part of a postmodifying prepositional phrase just under 19% of the time. When it does not occur as premodification or as part of a prepositional phrase, it is oen coordinated or part of a list, this accounting for almost 20% of such cases: green tea and melon tea and coffee tea and sandwiches tea and refreshments tea and digestives tea and toast tea and scones tea and sympathy tea and salvation
It will be noted above that colligations can be negative as well as positive, and one of tea’s most obvious colligations is negative: it is typically primed to avoid co-occurrence with markers of indefiniteness (a, another, etc.). Just as with other primings, this is a tendency, not an absolute. ere are 52 instances of a tea or a …tea in my data, just over 1% of instances, and one instance with another. Examples are: (2) a lemonade Snapple and a tea, milk no sugar (3) a tea made from the blossoms and leaves (4) there was never a tea or a bun at Downing Street (5) a Ceylon tea with a fine citrus flavour (6) to enjoy a cream tea or a double brandy
30
Michael Hoey
(7) Oh I’ll have a tea, two sugars, thank you very much for asking (8) Another tea and I start dealing with the day’s twaddle
Notice that four of the above examples are not interpretable as “a type of tea”. So tea can occur with indefinite markers – it is not a matter of grammatical impossibility, nor a matter of a specific type of usage – but typically it is primed to avoid them. It is worth noting, too, that this aversion to indefinite markers is not the result of its being a drink. In my data there are 390 occurrences of the word Coke, referring to the cola rather than the drug or the fuel. Of the 314 instances which refer to the drink, as opposed to the company that markets the drink, 10% occur with a. (I have no instances with another, though there are three occurrences along the lines of a rum and coke). All of the above illustrate the first type of colligation. As an instance of the second kind of colligation, consider the following data (Table 1), where the clausal distribution of consequence is compared with that of four other abstract nouns. It will be seen that there is a clear negative colligation between consequence and the grammatical function of Object. e other nouns occur as part of Object between a sixth and a third of the time. Consequence on the other hand occurs within Object in less than one in twenty cases. To compensate, there is a positive colligation between consequence and the Complement function. Only one of the other nouns – question – comes close to the frequency found for consequence. e others occur within Complement four times less oen than consequence. ere is also a positive colligation between consequence and the function of Adjunct, consequence occurring here nearly half the time. e Table 1. Distribution of consequence across the four main clause functions in comparison with that of other abstract nouns
Consequence Question Preference Aversion Use
Part of subject
Part of object
Part of complement
Part of adjunct
Other
24% (383) 26% (79) 21% (63) 23% (47) 22% (67)
4% (62) 27% (82) 38% (113) 38% (77) 34% (103)
24% (395) 20% (60) 7% (21) 8% (16) 6% (17)
43% (701) 22% (66) 30% (90) 22% (45) 36% (107)
5% (74) 4% (13) 4% (13) 8% (17) 2% (6)
e textual priming of lexis
31
other nouns in our sample occur between around a quarter and a third of the time. I would conclude that consequence is characteristically positively primed for Complement and Adjunct functions and negatively primed for Object function. (Interestingly, this is not true for the plural form consequences, which routinely occurs as part of Object, supporting the argument of Sinclair and Renouf (1988) and Stubbs (1996) against too ready an adoption of the lemma as the locus of analysis.) Consequence also illustrates the third type of colligation, in that 49% of instances in my data occur as part of eme. Given that one would expect, on the basis of random distribution, that around 33% of instances would occur in eme, this suggests that consequence is typically primed for this textual position. e position we have reached is that lexis is primed for each language user, either at the word or phrase level, for collocations, grammatical categories, semantic associations and colligations. I do not however believe that there is any necessity to assume that priming stops at the sentence boundary. Aer all, the third kind of colligation, concerned with textual positioning, is an overt claim that priming has a textual dimension, in that choice of eme is in part affected by the textual surround and therefore we are primed to use consequence to encapsulate the previous text, whether as Adjunct or Subject. We can take this point considerably further. In Hoey (forthcoming) I argue that words may be primed to appear in (or avoid) paragraph initial position. So consequences, for example, is primed to begin paragraphs, but consequence is primed to avoid paragraph-initial position. I secretly hope that you respond to this information with the feeling that I am spelling out the obvious, in that it is obvious that we might start a paragraph with mention of a multiplicity of consequences and then spend the rest of the paragraph itemising and elaborating on those consequences and equally it is obvious that if there is only a single consequence it will be tied closely to what ever was the cause. If you were to so react, then that would be evidence that it is part of your knowledge of the words consequence and consequences that they behave in these textual ways, in short, that you were primed to use them in particular textual positions. It may be objected that consequence and consequences are exceptional in that they have long been recognized to have special text organising functions (e.g. Winter 1977). But the evidence suggests that textual colligation is not limited to a special class of words, nor is priming for positioning only operative at sentence and paragraph boundaries or in the written word only. As an
32
Michael Hoey
example of spoken priming, the has an aversion to appearing at the beginning of conversational turns (McCarthy, personal communication). As an example of textual priming at a level higher than the paragraph, take sixty which, when the group in which it appears is sentence-initial, is positively primed for textinitial position. In my newspaper data, 14% of thematized instances of sixty are text-initial. Given that the average length of texts beginning with sixty is 20 sentences, this means that sixty begins a text three times more oen than would occur on the basis of random distribution. Sixty begins newspaper texts for a variety of reasons, all of which are specific to the goal of newspaper production. In the first place, sixty is a majority in terms of percentage and therefore potentially newsworthy; a number of such texts begin Sixty per cent of... Newspapers are conscious of time and their place in time; a number of articles begin Sixty years ago... If an event affects sixty people, it may be a significant event; a number of articles begin with phrases such Sixty spectators... Examples are the following: (9) Sixty per cent of adults support the automatic removal of organs for transplant from those killed in accidents unless the donor has registered an objection, according to a survey published yesterday. (10) Sixty years ago Florida was the holiday home of the super-rich and the flamboyant. (11) Sixty baffled teachers from 24 countries yesterday began learning how to speak Geordie as part of a three-week course run by the British Council on the banks of the Tyne.
e explanations I have given for sentences such as these, which are aer all not lexical but discoursal in nature, might be thought at first sight to challenge the notion of priming, in that the choice of sixty would appear to be the product of external factors. is would however be to misunderstand the relationship being posited between lexical choice and discoursal purpose. In the first place, the text-initial priming of sixty does not extend to 60 (nor do many of its other primings – there is no association of 60 with vagueness, for example). So the choice of sixty over 60 is made simultaneously with one of the discoursal choices described above. Secondly, there is no externally driven obligation on a writer to place the phrase of which sixty is a part in sentence-initial position. In theory (rather than practice), news articles and stories could begin:
e textual priming of lexis
33
(9a) A clear majority of adults support the automatic removal of organs for transplant from those killed in accidents unless the donor has registered an objection, according to a survey published yesterday. (10a) Florida was, six decades ago, the holiday home of the super-rich and the flamboyant. (11a) Five dozen baffled teachers from 24 countries yesterday began learning how to speak Geordie as part of a three-week course run by the British Council on the banks of the Tyne.
irdly, and most importantly, I would argue that the text-initial priming of sixty for journalists and Guardian readers is the result of their having encountered numerous previous examples of sixty in this position. Consequently, Guardian readers do not expect, and journalists do not provide, articles that focus on the views of twenty-two per cent of a sample of interviewees, despite the fact that these views may be original or though-provoking; readers and journalists do not concern themselves with what happened twenty-two years ago, even though time divisions are arbitrary as a way of talking about changes in the world and what happened twenty-years ago is, from some perspectives, as interesting as what happened sixty years ago. e possible effects of primings on the way we view the world are perhaps matters for critical discourse analysts to consider. Even more than was the case with collocation, grammatical category, semantic association and colligation (of the non-textual kind), claims about textual colligation have to be domain- and genre-specific. e claim just made for sixty is palpably false for academic articles, for example; on the other hand, I would speculate that for the latter genre the word recent might be positively primed for text-initial position – Recent research has shown.., Recent papers ... etc. For some purposes – lexicography, dictionaries of collocations, thesauri, comparable corpora – huge corpora representing a wide range of linguistic genres and styles are extremely useful. Resources like the Bank of English and the British National Corpus have huge value. But I believe, and what I am saying here and in the remainder of this paper provides grounds for believing, that homogenized corpora iron out and render invisible important generalisations – truths even – about the language they sample. For the purposes of identifying primings, specialized corpora are likely to be more productive. Gledhill (2000) shows that no corpus is too specialized: a mini-corpus of the introduc-
34
Michael Hoey
tions to cancer research papers revealed distinct differences from that of results sections. Before we leave textual colligation, it is worth noting that it is sometimes the case that one priming only becomes operative when another is overridden. An instance of this phenomenon is the combination of in and an abstract noun, which has a strong negative priming for sentence-initial position. Once, though, the negative priming is overridden, a strong positive priming for paragraph-initial position becomes operative. Textual position is not the only supra-sentential feature for which lexis appears to be primed. I want to argue that lexical items are also primed for cohesion. Certain words (e.g., Blair, planet, gay, and genetic) tend to appear as part of readily cohesive chains, whereas others (e.g., elusive or wobble) form single ties at best, and rarely if ever participate in extensive chains. In order to test this claim, I took a text (e Invisible Influence of Planet X) that I had previously analysed with respect to its cohesion (Hoey 1995) and identified the four lexical items that contribute most to the cohesion of the text. I then selected four items that appear only once in the text. For each of these eight items, I examined 50 lines of a concordance, moving in each case into the text from which the line was drawn and analysing the text in terms of the cohesiveness of the item under investigation. e results of this investigation are presented in Table 2.
Table 2. Cohesive tendencies of eight lexical items from e Invisible Influence of Planet X
planet Uranus Pluto planets week wobble wide weakest
Frequency in original text
No of instances participating No of occurrences in single in cohesive chains across cohesive links not forming 50 texts chains across 50 texts
23 11 10 8 1 1 1 1
36% 68% 84% 66% 32% 10% 0% 0%
13% 6% 3% 8% 12% 6% 8% 2%
e textual priming of lexis
35
It will be seen that there is a close correlation between the cohesiveness (or otherwise) of the items in the Planet X text and their cohesiveness (or otherwise) across a range of texts. All four of the items forming strong cohesive chains in e Invisible Influence of Planet X participate strongly in cohesive chains in other texts, such that between 36% and 84% of instances in the concordance were participating in such chains. ree of the four words that were not cohesive in the Planet X text also never or rarely participated in cohesive chains. e exception is of course week, which is only slightly less cohesive than planet in the corpus; it is of course predictable from the statistics for the four highly cohesive items that their priming for cohesion is on occasion overridden and this appears to be the case with week also. Obviously the more cohesive an item is, the fewer the texts represented in the corpus (because an item that is repeated twenty times in a single text will generate twenty concordance lines), so we cannot simply read the results off the table, without further investigation, but the correlation is strong for all that. I hypothesize that when we read or listen we bring our knowledge of cohesive priming to bear and attend to those items that are most likely to participate in the creation of the texture of the text. Furthermore, it is part of our knowledge of every lexical item that we know what type of cohesion is likely to be associated with it. So Blair, for example, tends to attract pro-forms – he, his, him etc. – and co-referents – the Prime Minister, the Labour Party leader, while planet tends to attract hyponyms – Mars, Venus, Pluto – and gay favours simple lexical repetition. Grosz and Sidner (1986) and Emmott (1989, 1997) argue that cohesion is better treated as prospective rather than retrospective; the position presented here is in accordance with their view, in that encountering a named person such as Tony Blair in a discourse immediately creates in the reader/listener an expectation that the pronoun he and the co-referent the Prime Minister will follow (as well as simple repetitions of the name); Yule (1981) discusses the conditions under which one rather than the other might be chosen, and Sinclair (1993) discusses the mechanisms of prospection. In the terms presented here, we could say that Tony Blair is characteristically primed to create cohesive chains making use of some or all of pronouns, co-reference and simple lexical repetition (Hoey 1991). e claim that some items are characteristically primed to be cohesive and other are characteristically primed to avoid participation in cohesion is supported by Morley and Partington’s finding (2002) that the phrase at the heart
36
Michael Hoey
of is non-cohesive. ey found 29 instances of the phrase, each one from a different text. Again, as so oen in this paper, the observation seems obvious, but it is the obviousness of the observation that most supports my case. With the first kind of textual priming, we associated certain lexical items with certain textual positions (e.g. beginning of the sentence, beginning of the speaking turn, beginning of the paragraph). is was seen as a textual extension of colligation. e kind of textual priming we have just been examining – the cohesive priming of lexis – could likewise be seen as a textual extension of collocation, in that the characteristic cohesion of a word could be seen as an extension of “the company a word keeps”. Analogy suggests that there should be a third kind of textual priming of lexis, associated with semantic association, and preliminary investigation supports the suggestion. In addition to being primed for textual position and cohesion, lexical items are, I argue, primed for textual relations. What I mean by this is that the semantic relations that organize the texts we encounter are anticipated in the lexis that comprises these texts. So, for example, ago is typically primed to occur in contrast relations, occurring in such relations in my data 55% of the time, and discovered occurs with (or in) temporal clauses 86% of the time. e word hunt is associated with a shi within a Problem-Solution pattern (Winter 1977; Hoey 1983, 2001; Jordan 1984) or a Gap in Knowledge-Filling pattern (Hoey 2001) 60% of the time; it is also associated with a move from past to present in 67% of cases. I hypothesize that this aspect of textual priming accounts for the average reader/listener’s enormous competence at following and making sense of text in very little time. Earlier work on textual signals (Winter 1977, Hoey 1979) only scratches the surface of the signalling that the average text supplies, if this hypothesis proves to be correct; it is possible that evoked and provoked appraisal (Martin 2000; on appraisal see also Flowerdew, this volume) and its textual reflex (Hoey 2001) are also accounted for by this feature of priming. It will be noticed that I talk of items being “typically” or “characteristically” primed. is is of course partly because priming belongs to the individual, not to the language, and so no blanket claim can be made about any word. It is also because, as noted earlier, all claims about priming are domain and genre specific. A claim that a particular lexical item is primed to occur text-initially or form cohesive relations is only valid for a particular narrowly-defined situation. Since my corpus overwhelmingly comprises Guardian newspaper, the claims made above about lexical priming are true of those (kinds of) data, but carry no weight, until verified, in any other situations. While Biber et al.’s notion of
e textual priming of lexis
37
the “bundle” may be over-simplified, he and his colleagues are certainly right in saying that they occur in, and are true of, text types. I want to claim that the types of textual colligation I have been describing occur in all kinds of texts but the actualisation of these colligations varies from text type to text type. If we accept the notion that lexical items are primed for collocation, semantic association and colligation (textual or otherwise), there are two possible implications. e first is that this priming accounts for our ready ability to distinguish polysemous uses of a word. Where it can be shown that a common sense of a polysemous word favours certain collocations, semantic associations and/or colligations, the rarer sense of that word will, I would argue, avoid those collocations, semantic associations and colligations (see Hoey in press). e second implication is that in continuous text the primings of lexical items may combine. us the words that make up the phrase Sixty years ago today, which begins the text we considered earlier with regard to cohesion, have the primings on page 38 (amongst others) in newspaper data. What we have here is colligational prosody, where the primings reinforce each other (or not), the naturalness of the phrase in large part deriving from the non-conflictual nature of the separate primings when combined. I would want to suggest that some of the work currently undertaken by grammar might be absorbed into, or superseded by, colligational prosody. Two questions naturally arise from the preceding discussion. e first is practical in nature: what are the implications of all this for the language learner? e other is theoretical: what place is le in this theory for creativity? To tackle the practical question first, if the notion of priming is correct, the role of the FL classroom is to ensure that the learner encounters the lexis in such a way that it is properly and correctly primed. is can only be a gradual matter; nevertheless, there are grave dangers in teachers or teaching materials incorrectly priming the lexis such that the learner is blocked, sometimes permanently, from correctly priming the lexical items. Furthermore, certain learning practices must be inappropriate, such as the learning of vocabulary in lists (i.e. stripped of all its primings), while others (e.g. exposure to authentic data) are apparently endorsed. Authentic data, however, are usually inauthentically encountered in the classroom, in that they are read or heard for reasons remote from those that gave rise to the data in the first place. On the other hand, only authentic data can preserve the collocations, colligations, semantic associations of the language and only complete texts and conversations can preserve the textual associations and colligations.
ago
Collocation with ago
t
Semantic association with UNIT OF TIME Semantic association with NUMBER
t
Positive colligation with text initial position, when paragraph initial
Positive colligation with paragraph initial position, when thematized
t
Positive colligation with paragraph initial position, when thematized
Positive colligation with text initial position, when paragraph initial
t
t
Strong colligation with eme
t
t
today
t
Positive colligation with text initial position, when paragraph initial
t
t
Positive colligation with text initial position, when paragraph initial
Positive colligation with paragraph initial position, when thematized
t
t
Positive colligation with paragraph initial position, when thematized
Weak colligation with eme
t
t
Strong colligation with eme
t
t
Collocation with ago
t
t
years Semantic association with NUMBER
t
t
Sixty Collocation with years
t
Michael Hoey
t
38
To turn now to the theoretical question, there is of course ample room for the production of original utterances through semantic association, but semantic association will not by itself account either for the ability of the ordinary speaker to utter something s/he has never heard before or for the ability of the more self-conscious creative writer to produce sentences that are recognisably English but have never been encountered before. I would argue that, when speakers go along with the primings of the lexis they use, we produce utterances that seem idiomatic. is is the norm in conversation and writing. If they choose to override those primings, they produce acceptable sentences of the language that might strike one with their freshness or with their oddness but will not seem idiomatic. Crucially, though, even these sentences will conform to more primings than they override. So when Dylan omas, a poet famous for his highly creative (and sometimes obscure) use of language, begins one of his poems with A grief ago, he rejects the collocations and semantic associations
e textual priming of lexis
39
of sixty and years but conforms to the primings of ago, such that the phrase functions textually in similar ways to sixty years ago.
t
------
t t t
t
------
t
t
------
ago Semantic association with NUMBER
t
t
Positive colligation with text initial position, when paragraph initial
t
Positive colligation with paragraph initial position, when thematized
t
Strong colligation with eme
grief
t
A
Strong colligation with eme
Positive colligation with paragraph initial position, when thematized Positive colligation with text initial position, when paragraph initial
Priming is therefore something that may be partly overridden but not completely overridden. Complete overriding would result in instances of nonlanguage. us the task that Chomsky set himself of accounting for all and only the acceptable sentences of the language requires priming as (part of) its answer. Indeed, what we think of as grammar may be better regarded as a generalisation out of the multitude of primings of the vocabulary of the language; it may alternatively be seen usefully as an account of the primings of the commonest words of the language (such as the, of and is). Either way, I hope I have done enough to demonstrate that a new theory of language might need to place priming at the heart of it.1 Note 1. Note that this sentence conforms to the priming for non-cohesion of at the heart of discussed earlier, and in that respect is idiomatic. In so far, however, as this endnote draws attention to the possibility of cohesion, it has created it and thereby demonstrated my ability to override one priming of the phrase while conforming to its other primings – an essential feature of a theory of priming.
40
Michael Hoey
References Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finnegan 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Brazil, D. 1995. The Grammar of Speech. Oxford: Oxford University Press. Campanelli, P. and Channell, J. M. 1994. Training: An Exploration of the Word and the Concept with an Analysis of the Implications for Survey Design. London: Employment Department. Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton. Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge (MA): MIT Press. Emmott, C. 1989. Reading between the lines: Building a comprehensive model of participant reference in real narrative. Ph.D. thesis, University of Birmingham. Emmott, C. 1997. Narrative Comprehension: A Discourse Perspective. Oxford: Clarendon Press. Farr, F. and McCarthy, M. 2002. “Expressing hypothetical meaning in context: Theory versus practice in spoken interaction”. Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Firth, J. R. 1957. “A synopsis of linguistic theory, 1930–1955” in Studies in Linguistic Analysis, 1–32, reprinted in Selected Papers of J R Firth 1952–59, F. Palmer (ed.), 168–205. London: Longman. Gledhill, C. J. 2000. Collocations in Science Writing. Tübingen: Gunter Narr Verlag Tübingen. Goldberg, A.E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: The University of Chicago Press. Grosz, B. J. and Sidner, C. L. 1986. “Attention, intentions, and the structure of discourse”. Computational Linguistics, 12(3):175–204. Jordan, M.P. 1984. Rhetoric of Everyday English Texts. London: Allen & Unwin. Halliday, M.A.K. 1959. The Language of the Chinese ‘Secret History of the Mongols’. Oxford: Blackwell [Publication 17 of the Philological Society]. Halliday, M.A.K. 1967–8. “Notes on transitivity and theme in English” (parts 1, 2 and 3), Journal of Linguistics 3.1, 3.2 and 4.2. Halliday, M.A.K. and Hasan, R. 1976. Cohesion in English. London: Longman. Hoey, M. 1983. On the Surface of Discourse. London: Allen & Unwin. Hoey, M. 1995. “The lexical nature of intertextuality: A preliminary study” in Organization in Discourse: Proceedings from the Turku Conference, B. Wårvik, S-K. Tanskanen and R. Hiltunen (eds), 73–94. Turku: University of Turku [Anglicana Turkuensia 14]. Hoey, M 1997a. “Lexical problems for the language learner (and the hint of a textual solution)”, in Proceedings of the 5th Latin American ESP Colloquium, Merida, Venezuela. Hoey, M 1997b. “From concordance to text structure: New uses for computer corpora”, in PALC ’97: Proceedings of Practical Applications in Language Corpora Conference, B. Lewandowska-Tomaszczyk and P.J. Melia (eds), 2–23. Łódź: University of Łódź. Hoey, M. 2001. Textual Interaction: An Introduction to Written Discourse Analysis. London: Routledge. Hoey, M. 2003. “Why grammar is beyond belief” in Beyond: New Perspectives in Language,
e textual priming of lexis
41
Literature and ELT. Special issue of Belgian Journal of English Language and Literatures, J.P. van Noppen, C. den Tandt and I. Tudor (eds), Ghent: Academia press. Hoey, M. in press. Lexical Priming: A New Theory of Words and Language. London: Routledge Hoey, M. forthcoming. “Textual colligation – A special kind of lexical priming”, in Proceedings of ICAME 2002, Göteborg, K. Aijmer and B. Altenberg (eds). Amsterdam: Rodopi. Hunston, S. and Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. Langendoen, T. 1968. The London School of Linguistics: A Study of the Linguistic Contributions of B. Malinowski and J.R. Firth. Cambridge (MA): MIT Press. Louw, B. 1993. “Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies’” in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology, 157–76. Amsterdam: John Benjamins. Martin, J.R. 2000. “Beyond Exchange: APPRAISAL systems in English”. In Evaluation in Text: Authorial Stance and the Construction of Discours, S. Hunston and G. Thompson (eds), 142–75. Oxford: Oxford University Press. Morley, J. and Partington, A. 2002. “From frequency to ideology: Comparing word and cluster frequencies in political debate”, Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Murray, J.A.H. et al. (ed) 1884–1928. A New English Dictionary on Historical Principles (reprinted with supplement, 1933, as Oxford English Dictionary) Oxford: Oxford University Press. Roget, Peter M. 1852. Thesaurus of English Words and Phrases. Harlow: Longman. Sinclair, J. McH 1991. Corpus, Concordance, Collocation, Oxford: Oxford University Press. Sinclair, J. McH 1993. “Written discourse structure” in Techniques of Description, J. McH Sinclair, M. Hoey and G. Fox (eds), 6–31. London: Routledge. Sinclair, J. McH 1996. “The search for units of meaning”. Textus 9:75–106. Sinclair, J. McH 1999. “The lexical item” in Contrastive Lexical Semantics, E. Weigand (ed.), 1–24. Amsterdam: John Benjamins. Sinclair, J. McH and Renouf, A. 1988. “Lexical syllabus for language learning” in Vocabulary and Language Teaching, R. Carter and M. McCarthy (eds), 197–206. Harlow: Longman. Sripicharn, P. 2002. “Examining native speakers’ and learners’ investigation of the same concordance data: a proposed method of assessing the learners’ performance on concordance-based tasks”. Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Stubbs, M. 1995. “Corpus evidence for norms of lexical collocation” in Principle and Practice in Applied Linguistics, G. Cook and B. Seidlhofer (eds), 245–256. Oxford: Oxford University Press. Stubbs, M. 1996. Text and Corpus Analysis Oxford: Blackwell. Winter, E. O. 1977. “A clause-relational approach to English texts” Instructional Science 6: 1–92. Yule, G. 1981. “New, current and displaced entity reference”. Lingua 55:42–52.
42
Michael Hoey
Multiple comparisons of IL, L1 and TL corpora
Corpora by learners
43
44
Yukio Tono
Multiple comparisons of IL, L1 and TL corpora
45
Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English Yukio Tono Meikai University, Japan
This study investigates the acquisition of verb subcategorization frame (SF) patterns by Japanese-speaking learners of English by examining the relative influence of factors such as the effect of first language knowledge, the amount of exposure to second language input, and the properties of inherent verb semantics on the use and misuse of verb SF patterns. To do this, three types of corpora were compiled: (a) a corpus of students’ writing, (b) a corpus of L1 Japanese, and (c) a corpus of English textbooks (i.e., one of the primary sources of input in the classroom). Ten high frequency verbs were examined for the learners’ use of SF patterns. Log-linear analysis revealed that the overall frequency of verb SF patterns was influenced by the amount of exposure to the patterns in the textbooks whereas error frequency was not highly correlated with it. There were strong interaction effects between error frequency and L1-related and L2 inherent factors such as the differences in verb patterns and frequencies between English and Japanese, and verb semantics for each verb type. Multiple comparison of IL, L1, TL (textbook) corpora were found to be quite useful in identifying the complex nature of interlanguage development in the classroom context.
1. Introduction Each individual language has its own way of realizing elements following a verb. Every verb is accompanied by a number of obligatory participants, usually from one to three, which express the core meaning of the event. Participants which are core elements in the meaning of an event are known as arguments. Other constituents, which are optional, are known as adjuncts. What
46
Yukio Tono
core elements follow a verb is accounted for by subcategorization. Different subcategories of verbs make different demands on which of their arguments must be expressed (cf. (1a) – (1c)), which can optionally be expressed (cf. (1d)), and how the expressed arguments are encoded grammatically – that is, as subjects, objects or oblique objects (objects of prepositions or oblique cases). For example, as in (1a), the verb “dine” is an intransitive verb and takes only one argument (i.e., a subject) while verbs such as eat or put can take two or three arguments respectively (see 1b and 1c). (1) a) Mary dined./ *Mary dined the hamburger. b) Mary ate./ Mary ate the hamburger. c) *Mary put./ *Mary put something./ Mary put something somewhere. d) Tom buttered the toast with a fish-knife.
[1 ARG] [2 ARG] [3 ARG] [optional]
In this paper, I will present a study which investigates the acquisition of verb subcategorization frame (SF) patterns by Japanese-speaking learners of English. For this study, I compiled three different types of corpora: Interlanguage (IL), L1 and Target Language (TL). For IL corpora, students’ free compositions were used whilst newspaper texts and EFL textbooks were assembled for L1 and TL corpora respectively. I will discuss the rationale of using textbooks as TL corpora in more detail below. By conducting multiple comparisons of the three corpora, I examined how different factors such as the effect of L1 knowledge, the amount of exposure to L2 input, and the properties of inherent verb meanings in L2 affect the acquisition of verb SF patterns. e acquisition of SF patterns is oen associated with the broader issue of the acquisition of argument structure (Pinker 1984, 1987, 1989). e development of argument structure can be influenced by several factors. Four main factors (verb semantics, learning stage, L1 knowledge, and L2 input) were selected and the relationship of these factors to the use/misuse of argument structure was investigated. An L1 corpus was used to define the influence of verb SF patterns in L1 while ELT textbook corpora were used for determining the degree of exposure to certain SF patterns in the classroom. Based on the data from these corpora, I compared the SF patterns of a group of high-frequency verbs in the Japanese EFL Learner (JEFLL) Corpus.
Multiple comparisons of IL, L1 and TL corpora
47
2. Factors affecting the acquisition of SF patterns 2.1 Views from L1 acquisition research ere are competing theories seeking to explain the acquisition of argument structure in L1 acquisition. e major issue is how to explain the children’s initial acquisition of argument structure. Do they learn the argument structure patterns from the meaning of verbs they initially acquire or do they acquire the structure first, then move on to the acquisition of verb meanings? e two bootstrapping hypotheses, semantic and syntactic, claim that the acquisition of argument structure is bootstrapped by first acquiring either semantic or syntactic properties of the verbs. Pinker (1987) is keen to identify what happens at the very first stage of syntax acquisition while Gleitman (1990) states the hypothesis in such a way that it applies not only to the initial stage but to the entire process of acquisition. As Grimshaw (1994) argues, however, these two hypotheses could complement each other, once the initial state issue is solved. Despite the difference in the view of how the acquisition of argument structure starts, Pinker and Gleitman both agree that knowledge of the relationship between a verb’s semantics and its morpho-syntax is guided in part by Universal Grammar (UG) (cf. Chomsky 1986) because adult grammars go beyond the input available. According to Goldberg (1999), on the other hand, it is a construction itself which carries the meaning. Although verbs and associated argument structures are initially learned on an item-by-item basis, increased vocabulary leads to categorization and generalization. “Light” verbs, due to the fact that they are introduced at a very early stage and are highly frequent, act as a centre of gravity, forming the prototype of the semantic category associated with the formal pattern. e perspective which Goldberg and other construction grammarians have taken on children’s grammar learning is fundamentally that of “general” nativism. ey reject the claim of “special” nativism in its particular guise of UG, but they still assume other, innate, aspects of human cognitive functioning accounting for language acquisition. As a matter of fact, this position is increasingly widely supported nowadays within more general cognitive approaches, including so-called emergentism (Elman et al. 1996; MacWhinney 1999), cognitive linguistics (Langacker 1987, 1991; Ungerer and Schmid 1996) and constructivist child language research (Slobin 1997; Tomasello 1992).
48
Yukio Tono
One of the purposes of this study is to determine the relative effect of L1 knowledge, classroom input, developmental factors and inherent verb semantics on the use/misuse and overuse/underuse of SF patterns by Japanese learners of English. It should be noted that the study does not need to call on a specific acquisition theory at this stage. Rather, this corpus-based study should shed light on the nature of IL development by weighting the factors which are possibly relevant to the acquisition of argument structure. is will help to evaluate the validity and plausibility of the claims made in L1 acquisition research in the light of SLA theory construction. For instance, if the study shows the strong effect of frequencies of verbs used in the ELT textbooks on the use of particular SF patterns, then the results may indicate that L2 acquisition can be better explained by the theory that attaches more importance to the frequency of the items to be acquired in the input. From this viewpoint, Goldberg’s theory is more plausible. On the other hand, if the effect of verb semantics is highly significant, one may be inclined to agree with the theory that emphasises the semantic properties of verbs as the driving force for the acquisition of argument structure. Hence one would be more likely to adopt the theoretical framework of semantic bootstrapping theory proposed by Pinker (see 1 above). is study has the potential, therefore, to tease out possible factors affecting L2 acquisition in the light of L1 acquisition theories, making observations on L1, TL, and IL corpus data while controlling all those selected factors, and finally giving each factor a weighting according to the results of the corpus analysis. is weighting of the factors relevant to L2 acquisition will then contribute to decision-making about which L1 acquisition theory is more plausible.
2.2 Views from L2 acquisition research Whilst a vast literature exists on the L1 acquisition of semantics-syntax correspondences, second language acquisition of verb semantics and morpho-syntax only really attracted detailed attention in the 1990s. e major issues in L2 acquisition of argument structure are: (1) whether or not L1 effects are strong in this area, (2) whether there is any evidence of universal patterns of development, and (3) the role of input in the acquisition of argument structure. From the previous SLA studies, L1 effects appear strong in the acquisition of argument structure. Especially SF frames are a case in point. Recently, there
Multiple comparisons of IL, L1 and TL corpora
49
has been much investigation of the proposal that the SF requirements of a lexical item might be predictable from its meaning (Levin 1993:12). e issue here is whether such lexical knowledge in L1 or in UG will affect L2 acquisition. is is usually investigated through the study of the acquisition of diathesis alternations1 – alternations in the expression of arguments, sometimes accompanied by changes of meaning – verbs may participate in. In the case of dative alternations (White 1987, 1991; Bley-Vroman and Yoshinaga 1992; Sawyer 1996; Inagaki 1997; Montrul 1998), most evidence seems to indicate that the initial hypothesis regarding syntactic frames is based on the L1. Studies on the locative alternations (Juffs 1996; epsura 1998 cited in Juffs 2000) indicate that there is a difference in the way a hypothesis is formed by learners at different proficiency levels. While beginning learners start off with a wider grammar for non-alternating locative verbs, very advanced learners end up with a narrower grammar (Juffs 1996). ere are several studies (Zobl 1989; Hirakawa 1995; Oshita 1997) that indicate an L1 transfer effect on transitivity alternations and the unergative/unaccusative distinction. To recapitulate, L1 effects appear strong in this area of grammar. Based on their L1, learners transfer and overgeneralize in the dative and the locative alternations. ey also show a preference for morphology for inchoatives. Consequently, learners are helped if their L1 has certain features which are also in the L2. Advanced learners, however, seem able to recover from overgeneralization errors in some instances by acquiring narrow conflation classes which are not in their L1. us there seems to be an interaction effect between L1 influence and proficiency levels. In spite of studies showing L1 effects, there is some evidence of universal patterns of development. Learners from a variety of backgrounds seem to use passive morphology for NP movement in English L2 with pure unaccusatives (Yip 1994; Oshita 1997). English-speaking learners of Spanish seem to use se selectively for the same purpose even when it is not required with unaccusative verbs (Toth 1997). Montrul (1998) found evidence which indicates that L2 learners have an initial hypothesis that all verbs can have a default transitive template, allowing an SVO structure in English even with pure unaccusatives and unergatives. Hence, learners seem to overgeneralize causativity in root morphemes much as children acquiring their first language do. ere are not many studies on the role of input in the acquisition of verb meaning and the way such knowledge relates to syntax. Inagaki (1997) argues that the fact that the double-object datives containing “tell”-class verbs were
50
Yukio Tono
more frequent in the input than those containing “throw”-class verbs, explains why the Japanese learners distinguished the tell verbs more clearly than the throw verbs. e fact that the English native speakers made a stronger distinction between the tell/whisper verbs than between the throw/push verbs is also consistent with the assumption that the double-object datives containing the tell verbs were more frequent in the input than those containing the throw verbs (ibid:660). Unfortunately, measuring the frequency in L2 input is difficult since so few analyses of input corpora for L2 learners exist (Juffs 2000:202).
3. JEFLL Corpus and the multiple comparison approach e JEFLL Corpus project aims to compile a corpus of Japanese EFL text produced by learners from Year 7 to university levels. e strength of the JEFLL Corpus is that it contains L1 and TL corpora as an integral part of its design. As was shown in the last section, very few studies have made use of both attested L2 learner data and L1/TL data to identify features of interlanguage development, let alone a corpus-based analysis of these data. Most learner corpus studies to date have made use of NS corpora because the studies are typically focused on learning English, and many native English corpora are readily available as a standard reference, whereas very few studies (except for JEFLL and PELCRA, see Leńko-Szymańska, this volume) collect parallel L1 corpora for comparison. Figure 1 shows the overall structure of the JEFLL Corpus. e total size of the L2 corpus is approximately 500,000 running words of written texts and 50,000 words of orthographically transcribed spoken data. e L1 corpus consists of a corpus of Japanese newspaper texts (approximately 11 million words) plus a corpus of student compositions written in Japanese. e L1 Japaneselanguage essays were written on the same topics as the ones used for the L2 English composition classes. e third part of the JEFLL Corpus comprises the TL corpus. It is a corpus of EFL textbooks covering both junior and senior high school textbooks. e junior high school textbooks are the ones used officially at every junior high school in Japan. ere are seven competing publishers producing such textbooks. Irrespective of which publisher one chooses, each publishes three books corresponding to the three recognized proficiency grades for years 7–9.
Multiple comparisons of IL, L1 and TL corpora
51
Table 1. e JEFLL Corpus project: Overall structure Part 1: L2 learner corpora – Written corpus (composition): ~500,000 words – Spoken corpus (picture description): ~50,000 words Part 2: L1 corpora – Japanese written corpus (composition): ~50,000 word, same tasks as in relevant L2 corpus – Japanese newspaper corpus: ~11,000,000 words Part 3: TL corpus – EFL textbook corpus: ~650,000 running words (Y7–9: 150,000; Y10–12; 500,000)
Senior high school textbooks are more diversified and more than 50 titles have been published. is corpus contains mainly the textbooks for English I and II (general English). I would argue that textbook English is a useful target corpus to use in the study of learner language. As this claim runs counter to that of other researchers (e.g., Ljung 1990; Mindt 1997), it is important to examine the basis for this claim in some detail. Firstly, the target language which learners are measured by should reflect the learning environment of learners. It is not always appropriate to use a general corpus such as the BNC or the Bank of English to make comparisons with non-native-speaker corpora. e differences you will find between L2 corpora and such general corpora will be those between learner English and the English produced by professional native-speaker writers. Such a comparison may be meaningful in the case of highly advanced learners of English or professional non-native translators. e output of such highly advanced learners, however, is something which the vast majority of L2 learners in Japan never aspire to. We have to consider very seriously what the target norm should be for the learners we have in mind. In the present case, it is certainly not the language of the BNC that the Japanese learners of English are aiming at, but, rather, a modified English which represents what they are more exposed to in EFL settings in Japan. I am fully aware of the fact that the type of language used in ELT textbooks may be unnatural in comparison to actual native speaker usage (see, for instance, Ljung 1990, 1991 and Römer, this volume). Pedagogically, however, beginning- or intermediate-level texts are designed to contain a level and form of English which can facilitate learning. In spite of all their peculiarities in comparison with L1 corpora, these textbooks represent the primary source of input for L2 learners in Japan, and as such their use in explaining and assessing L2 attainment is surely crucial.
52
Yukio Tono
e ELT textbook is the primary source of English language input for learners in Japan. Inside the classroom, some teachers will use classroom English, and others will not use English at all as a medium of instruction. Even if they do use English in the classroom, they usually limit their expressions to the structures and vocabulary that have previously appeared in the textbook. Outside the classroom, those who go to “cram” schools – private schools where students study aer school to prepare for high school or university entrance examinations – will receive extra input, but this input is comprised of questions borrowed from past entrance exams, or questions based on the contents of the textbooks (Rohlen 1983). Hence, it is fair to say that the English used in ELT textbooks is the target for most learners of English in Japan. If we exclude textbooks from our investigation, explaining the differences between TL and IL usage may be impossible. However, where textbooks are included in an exploration of L2 learning, they can explain differences between NS and NNS usage (McEnery and Kifle 1998). While the above argument presents the basis for the inclusion of textbooks in my model for the study of learner language, more evidence is required to substantiate this claim. is will be provided below, as part of the description of some of my research results, where the textbook corpus will be called upon to provide an explanation for differences between IL and TL. For the moment I will take the argument presented so far as sufficient evidence to warrant the inclusion of textbook material in my learner corpus exploitation model. My proposal, therefore, is that standard reference (e.g., the BNC), textbook and learner corpora all have roles to play in a fuller and proper exploration of learner language, a method which we may refer to as the “multimethod comparison” approach. Figure 1 illustrates this point diagrammatically. “IL1 ↔ ILx” in Figure 1 refers to the different subcorpus divisions according to academic year that L2 learner texts may be divided into. ese IL-IL comparisons can be of several different types, depending on the learner variables. For instance, if the independent variable (i.e., the variable that you manipulate) is age or the academic year of the learners, with all other variables constant, one can make a comparison of different IL corpora from different age groups. In ICLE (International Corpus of Learner English, Granger et al. 2002), on the other hand, the age (or proficiency level) factor is held constant, and research using ICLE centres around the IL characteristics of different L1 groups. A comparison between L2 corpora and TL corpora can also be made (see (B) in figure 1). One can use either a general standard corpus such as the Brit-
Multiple comparisons of IL, L1 and TL corpora
53
Figure 1. Multiple comparison of L1, TL and IL corpora
ish National Corpus to look at differences in, for example, lexicogrammar between native speakers and L2 learners, or use a more comparable corpus of native-speaker texts, e.g., LOCNESS (Louvain Corpus of Native English Essays)2 in ICLE, to compare like with like. We can refer to this type of comparison as IL-TL comparison. TL corpora may be are compared with L1 corpora (TL-L1 comparison, cf. (C) in figure 1) in order to describe the target adult grammar system and identify potential causes of L1 transfer. is analysis should be combined with L2 corpus analysis. TL-L1 comparison could provide significant information on the influence of the source language on the acquisition of the target language. A fourth type of comparison is that between IL corpora and L1 mother tongue corpora (L1-IL comparison, cf. (D) in figure 1). L1 corpora can provide information on features of the L2 learners’ native language, which can help us understand potential sources of L1-related errors or overuse/underuse phenomena. Despite the sophistication of recent error taxonomies, it is rather difficult to distinguish interlingual errors from intralingual ones, unless some empirical data are available on the pattern of a particular linguistic feature in both languages. L1-IL comparisons provide fundamental data in this area. Table 2 summarises each comparison type.
54
Yukio Tono
Table 2. Multiple comparison approach Comparison
Description
IL-IL comparison
Comparisons between different stages of ILs or ILs by learners with different L1 backgrounds. Comparisons between learner corpora and target language corpora (i.e. ELT textbook corpora in the present study, or general native corpora). Comparisons between target language corpora and L1 mother tongue corpora (to identify potential causes of L1 transfer). Comparisons between L1 corpora and learner corpora (to identify L1-related errors or overuse/underuse phenomena). Combination of the above comparisons (to identify the complex relationship between IL, L1 and TL corpora on L2 learners’ error patterns or overuse/underuse phenomena).
IL-TL comparison TL-L1 comparison L1-IL comparison IL-L1-TL comparison
4. The relationship between factors and corpora used Table 3 shows the factors to be examined in this study and how corpus data can supply the relevant information. It is only through multiple comparisons of L1, TL, and IL corpora that such issues can be fully addressed. Note that the primary purpose of this study is not to identify the role of specific UG constraints in L2 acquisition. Rather, the study aims to capture the cause-effect relationships among those variables and to identify their relative effects on the acquisition of argument structure in L2 English.
Table 3. e relationship between the factors in this study and types of information from different corpora Factors
Corpus data
e L1 effects
Frequency of similar/different argument structure properties in L1 corpus Frequency of subcategorization patterns in ELT textbook corpus Frequency of use/misuse of subcategorization patterns from the developmental IL corpus Frequency of different verb classes and alternations from the IL corpus
e L2 input Developmental stages e L2 internal effects
Multiple comparisons of IL, L1 and TL corpora
55
5. Research design 5.1 Research questions is study has the following research questions: 1. Which of the following variables affect L2 acquisition of argument structure (most)? • e L1 effects • e L2 input effects • e L2 internal effects • e developmental effects 2. Are there any interaction effects between the variables? If so, what are they? e clarification of the relationship between the above questions will contribute to current SLA research especially in terms of the possible role of L1 knowledge, L2 classroom input, and verb semantics-syntax correspondences in the acquisition of argument structure.
5.2 Variables and operational definitions Each variable is operationally defined as follows: 1. L1 effects: L1 effects were examined with respect to the following two aspects: the degree of similarities in SF patterns between English and Japanese in terms of (a) the degree of SF matching and (b) the frequencies of similar SF patterns in the L1 Japanese corpus and the COMLEX Lexicon (TL). 2. L2 input effects: L2 input effects were defined in terms of the frequencies of the given SF patterns in the L2 textbook corpus. 3. L2 internal effects: ese characteristics pertain to the English verb system. For differences in verb classes and alternation types I follow Levin’s (1993) classification. 4. Developmental effects: Developmental effects were simply measured in relation to the three groups of subjects categorized by their school years (Year 7–8; 9–10; 11–12).
56
Yukio Tono
5.3 Extraction of SF patterns For this study, I parsed the learner and textbook corpora using the Apple Pie Parser (APP), a statistical parser developed by Satoshi Sekine at New York University (see Sekine 1998 for details). e accuracy rate of the APP is approximately 70%, hence it was not very efficient to extract SF patterns automatically using the APP alone. Consequently, aer running the parser over the corpus, I exported concordance lines of verbs with the automatically assigned syntactic information into a spreadsheet program and then categorized them into SFs using pattern matching. is proved to be an efficient means of studying verb SFs. e Comlex Lexicon (Macleod et al. 1996; Grishman et al. 1994) was also referred to for frequency information relating to some subcategorization frames in the TL corpus. e Comlex Lexicon itself does not provide complete frequency data for all SF patterns. However, it has frequency information for the subcategorization frames of the first 100 verbs appearing in the Brown Corpus. I calculated the percentages of each SF pattern in the Comlex database and used the information to supplement the data from the textbook corpus. For the L1 corpus, a Japanese morphological analyser, ChaSen (Matsumoto et al. 2000), was used for tokenization and morphological analysis and the frequencies of SF patterns were detected by using pattern matching. SF extraction was done aer extracting all the instances of a particular verb under study, and thus manual postediting was also possible.
5.4 Categorization of verb classes e verb classification in Levin (1993) was used to categorize verbs into groups with similar meanings. Levin classifies verb classes into two major categories: (a) those which undergo diathesis alternations and (b) those which form semantically coherent verb classes. While Levin’s classification is very important for the study of lexical knowledge in the human mind, it should also be noted that her study is not concerned with the actual usage of those verb classes. Out of the 49 verb classes Levin created, only 22 classes were found in the top 40 most frequent verbs in the BNC. An important fact to note, therefore, is that a small number of categories which meet essential communication needs (e.g., “communication’, “motion’, and “change of possession’) predominate in actual verb usage. e input thus consists of only a handful of highly
Multiple comparisons of IL, L1 and TL corpora
57
frequent verb classes, with the rest of the classes being rather infrequent. e information on Japanese SFs was obtained from the IPAL Electronic Dictionary Project.3 Aer making a matching database of corresponding verbs in English and Japanese, the frequency information of English SFs was extracted from the Comlex Lexicon. SFs were also extracted from the ELT textbook corpus for TL (English) and from the Japanese corpus I made for L1 Japanese. e next step in the study involved a statistical analysis of these data, taking the various influences into account. Log-linear analysis was the method employed, and the next section gives a summary of the procedure.
5.5 Log-linear analysis e objective of log-linear analysis is to find the model that gives the most parsimonious description of the data. For each of the different models, the expected cell frequencies are compared to the observed frequencies. A Chi-square test can then be used to determine whether the difference between expected and observed cell frequencies is acceptable with an assumption of independence of the various factors. e least economical model, the one that contains the maximal number of effects, is the saturated model; it will by definition yield a “perfect” fit between the expected and observed frequencies. e associated χ2 is zero. In this study, the procedure called backward deletion was employed. is begins with the saturated model and then effects are successively le out of the model and it is checked whether the value of χ2 of the more parsimonious model passes the critical level. When this happens, the effect that was le out last is deemed essential to the model and should be included. Several statistical packages contain procedures for carrying out a log-linear analysis on contingency tables, e.g., SPSS, STATISTICA, SAS. In this study, STATISTICA was the main program used for model testing.
5.6 Subcategorization frame database For each high-frequency verb, the following information was gathered and put into the database format: • •
Parsed example sentences containing the target verb School year categories (year 7–8; 9–10; 11–12)
58
• • • • • • • • • • •
Yukio Tono
Verb name Verb class Verb meaning Alternation type SF for each example Frequency of SF in COMLEX Lexicon TL frequency of the given SF (i.e., textbook corpora) Learner errors Parsing errors Japanese verb equivalents L1 frequency of the equivalent SF (i.e., Japanese corpus)
ese data were collected for each of the high-frequency verbs and exported to the statistical soware used for further analysis. In order to process the data by log-linear analysis, the frequencies of TL and L1 were converted into categorical data ([HIGH]/ [MID]/ [LOW]). In order to study the acquisition of argument structure, ten verbs were selected for the analysis (bring, buy, eat, get, go, like, make, take, think, and want). While it would have been desirable to cover as many verbs as possible from different verb classes for the study, it should be noted that frequencies of SF patterns become extremely small if low frequency verbs are included. Only the ten most frequent verbs in the data were therefore selected for investigation, since these allowed a sufficient number of observations to be made for each verb. Even though they are frequent, be and have were excluded from the analysis because their status as lexical verbs is very different from that of other verbs. Due to limitations of space, I cannot go into the details of the SF patterns, but interested readers may consult Tono (2002).
6. Results 6.1. The results of log-linear analysis for individual verbs Using log-linear analysis, I tested various models using combinations of the six factors in Table 4. e results of the log-linear analysis of each individual verb revealed quite an interesting picture of the relationship between learner errors and the chosen
Multiple comparisons of IL, L1 and TL corpora
59
Table 4. Factors investigated in the study – – – – – –
L2 learners developmental factor (Factor 1): 3 levels: Year 7–8/ Year 9–10/ Year 11–12 Subcategorization matching between L1 and L2 (Factor 2): 2 levels: Matched/ Unmatched Subcategorization frequencies of each SF pattern in COMLEX (Factor 3): 3 levels: High/ Mid/ Low Subcategorization frequencies of each SF pattern in L1 Japanese Corpus (Factor 4): 3 levels: High/ Mid/ Low Subcategorization frequencies of each SF pattern in Textbook Corpus (Factor 5): 3 levels: High/ Mid/ Low L2 learner errors (Factor 6): 2 levels: Error/ Non-error
factors. Here let me summarise results by putting all the best fitting models together in a table (see table 5) and examining which factor exerts the most influence on learner performance across the ten verbs. In order to analyse the interactions, graphical interpretations of higher dimensional log-linear models are sometimes used (e.g., McEnery 1995; Kennedy 1992). However, as I am dealing with six dimensional models here, attempting to interpret them using graphical models would be extremely complicated. Also, my primary aim is not to interpret individual cases but to capture the overall picture of how factors are related across different verbs. Consequently I will not interpret the models visually, but simply provide an outline of the main results.
6.1.1 Distinctive effects of the school year Table 5 shows that the school year factor (YEAR) has a very strong effect across all of the verbs. For five out of the ten verbs (buy, get, go, make, and think), the main effect of YEAR was observed. e YEAR effect also has two-way interactions with the factor of text frequency (TEXTFRQ) for four verbs (bring, like, take, want) and with the learner error/non-error factor (LERR) for the verb get. is shows that the number of years of schooling influences the way L2 learners use the verbs. It involves both the use/misuse and the overuse/underuse of verbs.
60
Yukio Tono
Table 5. Summary of log-linear analysis Verbs
Factor 1 YEAR
Factor 2 Factor 3 SUBMATCH COMLEX
Factor 4 L1FRQ
Factor 5 TEXTFRQ
bring
51
532, 432
buy
1
642, 632 542, 532 642, 632 432, 521 432, 532
643, 543 532, 432 532, 543 632 632, 531 432 643, 543 432
643, 543 432 642, 543 542, 642, 432
543, 532 643 51 543, 542 532 642, 632 531, 521
642, 632
543, 532
61, 643
632, 542 432, 532 652, 542 532
632, 543 432, 532 643, 543 532
643, 543 432 543, 542 432 643, 543 542
632
642, 632 542, 532 632, 632 532 642, 632 542, 532 642, 632 542
632, 543 532 632, 543 532 632, 543 532 632, 543 31
543, 542 532 51, 652 543, 542 532 543, 542 532 51, 543, 532
eat get
1, 61
go
1
like
51
make
1
take
51
think
1
want
31
642, 543 542 642, 543 642, 543 542 642, 543 542
543,542 532 543, 542
Factor 6 LERR
652, 643 642, 632 642, 632 642, 632 642, 632
Note: e numbers correspond to the factors described in Table 4. A single underlined number (e.g. 1) is used for the main effect, two (e.g. 51) for the two-way interaction effect, and three (e.g. 642, 532) for the three-way effects.
6.1.2 Strong effects of the SF frequencies in the textbook corpus We can also see from the summary table that there are strong two-way effects between YEAR and TEXTFRQ. Note that there is only one case (652 for the verb like) of interaction of the textbook frequency factor (Factor 5) with the learner error factor (Factor 6). is implies that SF frequencies in the textbooks mainly affect the overuse/underuse of the verbs, not the use/misuse.
6.1.3 SF similarities and frequencies in L1 and TL Factors such as the degree of similarity in SF patterns between English and Japanese (SUBMATCH: Factor 2), the frequency in the COMLEX lexicon (Factor 3), and the frequency of SF patterns in L1 Japanese (L1FRQ: Factor
Multiple comparisons of IL, L1 and TL corpora
61
4) appear many times with the learner error factor (LERR: Factor 6). ese factors are different from the school year and textbook frequency factors, as they represent more inherent linguistic features of the verbs and L1 effects. Each of the effects, however, is not very strong because none of them survived backward deletion for the one-way or two-way effects. It seems that only the interactions of these factors affect learners’ use/misuse of the verbs.
6.2 The effects of verb classes and alternation types In order to analyse the relationship between verb classes/alternation types and the results of the above log-linear analysis, I used correspondence analysis (for more details, see Tono 2002). Instead of looking at each verb, I labelled each verb with its verb semantic classes and alternation types. I then gave scores to each factor according to the significance of its effects as shown in table 5; for instance, if a certain factor has a one-way interaction, which is the strongest, I gave it 10 points; if it has a two-way interaction, I gave 5 points to each of the factors involved. Only 1 point was given for each of the factors involved in three-way effects. In this way, I quantified each of the effects in the best model for each verb in table 5 and used correspondence analysis to see the relationship between the six factors and verb classes and alternation types. Figure 2 shows the results of the re-classification of the effects found by loglinear analysis for each verb according to verb alternation types. Correspondence analysis plots the variables based on the total Chi-square values (i.e., inertia): the more the variables cluster together, the stronger the relationship. Dimension 1 explains 71% of inertia, so we should mainly consider Dimension 1 as a primary source of interpretation. e figure shows clearly that there are three major groups of effects: the factor of SF patterns in the textbook corpus (TEXTFRQ) in the le corner, three effects (SF frequencies in L1 corpus, the degree of matching between English and Japanese SFs, and the SF frequencies in COMLEX) in the centre, and the learner error effect and the school year effect toward the right side. As was discussed above, the school year represents the developmental aspect of verb learning while the three factors in the middle represent linguistic features in each verb, and the textbook frequency represents L2 input effects. ere is a tendency for verbs involving benefactive alternations (buy, get, make, and take), sum of money alternations (buy, get, and make), and there insertions (go) to cluster around the school year factor and the error factor.
62
Yukio Tono
Figure 2. Correspondence analysis (alternations x effects)
us these verb alternation classes seem to be sensitive to the developmental factor of acquisition. Dative (bring, make, take, think, and want), locative (take, go) and as alternations (make, take and think) cluster around inherent linguistic factors such as the degree of SF matching and SF frequencies in L1 and TL. e verbs involving resultative alternations (bring and take) cluster around the textbook SF frequencies factor. Post-attributive and blame alternations are both features of the verbs like and want. ese two alternation types also cluster together close to the textbook frequency effect. ese are the verbs showing a strong relationship with L2 input effects. ere is only one alternation type that did not cluster with any other groups: ingestion (eat). e verb eat was very frequent in the learner data and was thus included in the analysis. However, it turned out that there were neither very many errors nor many varieties of alternations for this verb. e results for eat thus look very different from those for the other nine verbs.
7. Implications and conclusions In this paper, I have discussed some initial findings concerning the developmental effect of schooling, L1 effects, L2 input effects and L2 internal effects
Multiple comparisons of IL, L1 and TL corpora
63
(i.e., verb classes and alternations) on the overall use of a small number of very frequent verbs. I hope to have given an idea of the potential of a multiple comparison approach using IL, L1 and TL corpora for the study of classroom SLA. is study shows that it is valuable to compile corpora which represent different types of texts L2 learners are exposed to or produce, and to compare them in different ways to identify the relative strength of the factors involved in classroom SLA. Especially the method of comparing interlanguage corpora assembled based on the developmental stages, together with the subjects’ L1 corpus and TL textbook corpus seems to be quite promising in identifying the complex nature of interlanguage development in L2 classroom settings. As regards L2 acquisition of verb SF patterns, the results show that the learners’ correct use of verb SF patterns seemed to have little to do with the time spent on learning. Learners used verbs more oen which they encountered more oen in the textbooks, which is rather unsurprising. What is surprising is the fact that there was no significant relationship between learners’ correct use of those verbs and the frequency of those verbs in the textbooks. In other words, they continue to make errors related to the SF patterns of certain verbs even though their frequencies are relatively high in the textbooks. e study also reveals that the misuse of those verb patterns is mainly caused by the factors which are inherent in L2 verb meanings and their similarities and differences with L1 counterparts. ere is a tendency for certain alternation types to be more closely related to certain effects. For instance, benefactive alternations are linked to the developmental factor more strongly while dative and locative alternations are related to L1 effects more positively. Given that most SLA studies so far have only provided very fragmented pictures of different alternation types, it is beyond the scope of this study to determine the reason for such associations. To date, no SLA research has been conducted to identify the relative difficulties of different verb classes and alternations. is study does so. However, the theoretical implications arising from this study are a moot point until further research in this area is undertaken. Future studies of SLA will also require a large and varied body of L2 learner corpora. As we work together with researchers in natural language processing (NLP), there is the possibility that we will be able to develop a computational model of L2 acquisition. Machine learning techniques will facilitate the testing of prototypical acquisition models and the collection of probabilistic informa-
64
Yukio Tono
tion on IL using corpora. Computational analyses of IL data will shed light on the process of IL development in a way we never thought possible. For this to happen, well-balanced representative corpora of L2 learner output, along with appropriate TL and L1 corpora are indispensable.
Notes 1. Here by alternation I mean “argument-structure” alternation such as in the dative alternation (e.g. John gave a book to Mary/John gave Mary a book), the causative/inchoative alternation (He opened the door/ e door opened) among others. 2. http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm (visited 10. 5. 2004) 3. IPAL is a machine-readable Japanese dictionary. For more details see http://www.ipa.go.jp/ STC/NIHONGO/ IPAL/ipal.html (visited 1.3.2004).
References Bley-Vroman, R. and Yoshinaga, N. 1992. “Broad and narrow constraints on the English dative alternation: Some fundamental differences between native speakers and foreign language learners”. University of Hawai’i Working Papers in ESL, 11:157–199. University of Hawaii at Manoa. Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin and Use. New York: Praeger. Elman, J.L., E. Bates, M. Johnson, A. Karmiloff-Smith, D. Parisi and K. Plunkett 1996. Rethinking Innateness: A Connectionist Perspectives on Development. Cambridge, MA: A Bradford Book. Gleitman, L. 1990. The structural sources of verb meaning. Language Acquisition 1:3–55. Goldberg, A. 1999. “The emergence of the semantics of argument structure constructions”. In B. MacWhinney (ed.), 197–212. Granger, S., Dagneaux, E., and Meunier, F. (eds). 2002. The International Corpus of Learner English. Handbook and CD-ROM. Version 1.1. Louvain-la-Neuve: Presses Universitaires de Louvain. Grimshaw, J. 1994. “Lexical reconciliation”. In The Acquisition of the Lexicon, L. Gleitman and B. Landau (eds), 411–430. Cambridge, MA: MIT Press. Grishman, R. C. Macleod and A. Meyers 1994. “Comlex syntax: Building a computational lexicon”. Proceedings of 15th International Conference in Computational Linguistics (COLING 94), Kyoto, Japan, August 1994. Hirakawa, M. 1995. “L2 acquisition of English unaccusative constructions”. In Proceedings
Multiple comparisons of IL, L1 and TL corpora
65
of the 19th Boston University Conference on Language Development 1, D MacClaughlin and S. McEwen (eds), 291–302. Somerville, MA: Cascadilla Press. Inagaki, S. 1997. “Japanese and Chinese learners’ acquisition of the narrow-range rules for the dative alternation in English”. Language Learning 47:637–669. Juffs, A. 1996. Learnability and the Lexicon: Theories and Second Language Acquisition Research. Amsterdam: John Benjamins. Juffs, A. 2000. “An overview of the second language acquisition of links between verb semantics and morpho-syntax”. In Second Language Acquisition and Linguistic Theory, J. Archibald (ed.), 187–227. Oxford: Blackwell. Kennedy, J. 1992. Analyzing Qualitative Data. Log-linear Analysis for Behavioural Research. New York: Praeger. Langacker, R.W. 1987. Foundations of Cognitive Grammar. Vol.1: Theoretical Prerequisites. Stanford, CA: Stanford University Press. Langacker, R.W. 1991. Foundations of Cognitive Grammar. Vol.2: Descriptive application. Stanford, CA: Stanford University Press. Levin, B. 1993. English Verb Classes and Alternations. Chicago: The University of Chicago Press. Ljung, M. 1990. A Study of TEFL Vocabulary. [Stockholm Studies in English 78.] Stockholm: Almqvist & Wiksell. Ljung, M. 1991. “Swedish TEFL meets reality”. In English Computer Corpora, Johansson, S. & A.-B. Stenström (eds.), 245–256. Berlin: Mouton de Gruyter. Macleod, C. A. Meyers and R. Grishman 1996. “The influence of tagging on the classification of lexical complements”. Proceedings of the 16th International Conference on Computational Linguistics (COLING 96). University of Copenhagen. MacWhinney, B. (ed.) 1999. The Emergence of Language. Mahwah, NJ: Lawrence Erlbaum Associates. McEnery, T. 1995. Computational pragmatics: Probability, deeming and uncertain references. Unpublished PhD thesis. Lancaster University. McEnery, T. and Kifle N. 1998. “Non-native speaker and native speaker argumentative compositions – A corpus-based study”. In Proceedings of First International Symposium on Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, S. Granger and J. Hung (eds). Chinese University of Hong Kong. Matsumoto, Y., A. Kitauchi, T. Yamashita, Y. Hirano, H. Matsuda, K. Takaoka, M. Asahara 2000. Japanese Morphological Analysis System ChaSen version 2.2.1. Online manual (http://chasen.aist-nara.ac.jp/chasen/doc/chasen-2.2.1.pdf). Mindt, Dieter 1997. "Corpora and the teaching of English in Germany”. In Teaching and Language Corpora, Knowles, G., T. McEnery, S. Fligelstone, A. Wichman (eds), 40–50. London: Longman. Montrul, S.A. 1998. "The L2 acquisition of dative experiencer subjects”. Second Language Research 14 (1):27–61. Oshita, H. 1997. “The unaccusative trap”: L2 acquisition of English intransitive verbs. Unpublished PhD thesis. University of Southern California.
66
Yukio Tono
Pinker, S. 1984. Language Learnability and Language Development. Cambridge, MA: Harvard University Press. Pinker, S. 1987. “The bootstrapping problem in language acquisition”. In Mechanisms of Language Acquisition, B. MacWhinney (ed.), 399–441. Hillsdale, NJ: Erlbaum. Pinker, S. 1989. Learnability and Cognition: The Acquisition of Argument Structure. Cambridge, MA: MIT Press. Rohlen, T.P. 1983. Japan’s High School. Berkeley: University of California Press. Sawyer, M. 1996. “L1 and L2 sensitivity to semantic constraints on argument structure”. In Proceedings of the 20th Annual Boston University Conference on Language Development, 2, A. Stringfellow, D. Cahana-Amitay, E. Hughes and A. Zukowski (eds), 646–657. Somerville, MA: Cascadilla Press. Sekine, S. 1998. Corpus based parsing and sublanguage studies. Unpublished PhD Thesis. New York University. Slobin, D. 1997. The Crosslinguistic Study of Language Acquisition. – Vol.: Expanding the Contexts. Mahwah, NJ; London : Lawrence Erlbaum. Tomasello, M. 1992. First Verbs: A Case Study of Early Grammatical Development. Cambridge: Cambridge University Press. Tono, Y. 2002. The role of learner corpora in SLA research and foreign language teaching: The multiple comparison approach. Unpublished Ph.D. thesis. Lancaster University. Toth, P.D. 1997. Linguistic and pedagogical perspectives on acquiring second language morpho-syntax: a look at Spanish se. Unpublished PhD thesis, University of Pittsburgh. Ungerer, F. and H.J. Schmid 1996. An Introduction to Cognitive Linguistics. Harlow Essex: Addison Wesley Longman. White, L. 1987. “Markedness and second language acquisition: The question of transfer”. Studies in Second Language Acquisition 9:261–286. White, L. 1991. “Argument structure in second language acquisition”. Journal of French Language Studies 1:189–207. Yip, V. 1994. “Grammatical consciousness-raising and learnability”. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 123–138. Cambridge: Cambridge University Press. Zobl, H. 1989. “Canonical typological structures and ergativity in English L2 acquisition”. In Linguistic Perspectives on Second Language Acquisition, S. Gass and J. Schachter (eds.), 203–221. Cambridge: Cambridge University Press.
New wine in old skins?
67
New wine in old skins? A corpus investigation of L1 syntactic transfer in learner language Lars Borin* and Klas Prütz** * Natural
Language Processing Section, Department of Swedish, Göteborg University, Sweden **Centre for Language and Communication Research, Cardiff University, UK
This article reports on the findings of an investigation of the syntax of Swedish university students’ written English as it appears in a learner corpus. We compare part-of-speech (POS) tag sequences (being a rough approximation of surface syntactic structure) in three text corpora: (1) the Uppsala Student English corpus (USE); (2) the written part of the British National Corpus Sampler (BNCS); (3) the Stockholm Umeå Corpus of written Swedish (SUC). In distinction to most other studies of learner corpora, where only the target language (L2) as produced by native speakers has been compared to the learners’ interlanguage (IL), we add a comparison with the learners’ native language (L1) as produced by native speakers. Thus, we investigate differences in the frequencies of POS n-grams between the BNCS (representing native L2) on the one hand, and the USE (representing IL) and SUC (representing native L1) corpora on the other hand, the hypothesis being that significant common differences would reflect L1 interference in the IL, in the form of underuse or overuse of L2 constructions. This makes our study not only one of learner language, or IL in general, but of specific L1 interference in IL. We compare the results of our study to methodologically similar learner corpus research by Aarts and Granger, as well as to our own earlier investigation of English translated from Swedish.
68
Lars Borin and Klas Prütz
1. Introduction An important strand of inquiry in second language acquisition (SLA) research is that devoted to the investigation of language learners’ successive approximations of the target language, referred to as interlanguage (IL) in the SLA literature. Similarly to the practice in other kinds of linguistic investigation, SLA researchers are concerned with empirical description of various kinds of interlanguage, with discovering correlations between traits in interlanguage and features of the language learning situation, with explaining those correlations, and finally with the practical application of the knowledge thus acquired to language pedagogy. e features of language learning situations which have at one time or another been claimed to influence the shape and development of IL are the following (based on Ellis 1985: 16f): 1. Situational factors (explicit instruction or not; foreign vs. second language, etc.) 2. Linguistic input 3. Learner differences, including learner’s L1 4. Learner processes In this paper, we will be concerned mainly with factor (3), and more specifically with the influence of the learner’s L1 on her IL. e phenomenon whereby features of the learner’s native language are “borrowed” into her version of the target language – the IL – is referred to as transfer in the SLA literature. Transfer could in principle speed up language learning, if L1 and L2 are similar in many respects, but the kind of transfer which understandably has been most investigated is that where the learner transfers traits which are not part of the L2 system (negative transfer or interference). Interference and other features of IL have long been studied by so-called error analysis (EA), where language learners’ erroneous linguistic output is collected. Traditional EA suffers from a number of limitations: – Limitation 1: EA is based on heterogeneous learner data; – Limitation 2: EA categories are fuzzy; – Limitation 3: EA cannot cater for phenomena such as avoidance; – Limitation 4: EA is restricted to what the learner cannot do; – Limitation 5: EA gives a static picture of L2 learning. (Dagneaux et al. 1998: 164)
New wine in old skins?
69
e use of learner corpora is oen seen as one possible way to avoid the worst limitations of traditional EA.
1.1 Studying interlanguage with learner corpora Learner corpora are a fairly new arrival on the corpus linguistic scene, but have quickly become one of the most important resources for studying interlanguage. Like other corpora, a learner corpus is “a finite-sized body of machinereadable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery and Wilson 2001:32). A learner corpus is a collection of texts – written texts or transcribed spoken language – produced by language learners, and sampled so as to be representative of one or more combinations of situational and learner factors. is addresses the first limitation of EA mentioned in the preceding section; by design, learner corpus data is homogeneous. e whole gamut of corpus linguistics methods and tools are applicable to learner corpora, too. Available for immediate application are such tools as concordancers and word (form) listing, sorting and searching utilities, as well as statistical processing on the word form level. Even with these fairly simple tools a lot can be accomplished, especially with “morphologically naïve” languages like English. For deeper linguistic analysis, learner corpora can be lemmatized, annotated for part-of-speech (POS) – or POS-tagged – and/or parsed to various degrees of complexity. Learner corpora can also be annotated for the errors found in them, which raises the intricate question of how errors are to be classified and corrected (Dagneaux et al. 1998). Utilizing methods from parallel corpus linguistics (Borin 2002a; Kilgarriff 2001),1 learner corpora can be compared to each other or to corpora of texts produced by native speakers of the learners’ target language (L2) or their native language(s) (L1). Figure 1 illustrates some of the possibilities in this area. In Figure 1, case (i) [the double dotted line] is the ‘classical’ mode of learner corpus use (and of traditional error analysis) – interlanguage analysis (IA).2 Here, the interlanguage (IL), represented by the learner corpus, is compared to a representative native-speaker L2 corpus. Case (ii) [the dotted triangle] is an extension of (i), where different kinds of IL are contrasted to each other and to the L2 (called CIA – contrastive interlanguage analysis – by Granger 1996). e different ILs could be produced by learners with different native languages (as in most investigations based on ICLE; see Granger 1998 and 4.1 below) or
70
Lars Borin and Klas Prütz
Figure 1. Learner corpora and SLA research.
by learners with different degrees of proficiency, or, finally, by the same learners at different times during their language learning process, i.e. a longitudinal comparison (Hammarberg 1999), which goes some way towards dealing with limitation 5 of EA (see above). Case (iii) [faint double dashed line] represents a methodological tool which at times has been important in SLA research, but not very much pursued in the context of learner corpora, namely contrastive analysis (CA), where native-speaker L1 and L2 are compared in order to find potential sources of interference.3 Cases (i), (ii) and (iii) are quite general, and are meant to cover investigations on all linguistic levels. For pragmatic reasons, most such investigations have confined themselves to the level of lexis and such syntactic phenomena which are easily investigated through lexis. However, there is an increasing amount of work on (automatically) POS-tagged learner corpora (e.g., Aarts and Granger 1998; see 4.1 below), and even some investigations of parsed learner corpora (see Meunier 1998; Staerner 2001). e present paper addresses case (iv), the double solid lines, which to the best of our knowledge has not been investigated earlier using learner corpora.4 In the future, we hope to be able to also look into case (v) [the double+single solid lines], the extension of case (iv) to more than one kind of IL.
2. Investigating syntactic interference in learner language In distinction to most other studies of learner language corpora, where the IL has been compared only to native L2 production, in our own investigation we
New wine in old skins?
71
add a comparison with the learners’ L1. Arguably, this makes our study not only one of interlanguage in general, but of specific L1 interference as evidenced in IL, which is relevant, e.g., for the development of intelligent CALL applications, incorporating natural language processing components – our particular area of expertise – e.g. learner language grammars and learner models. We investigated differences in the frequencies of POS sequences (or POS ngrams) between a corpus of native English on the one hand, and two corpora – one of Swedish advanced learner English and one of native Swedish – on the other. e hypothesis is that significant common differences would reflect L1 interference in the IL on the syntactic level, since the POS sequences arguably serve as a rough approximation of surface syntactic structure, at least in the case of languages where syntactic relations are largely signalled by constituent order (both English and Swedish are such languages). e differences found were of two kinds, reflecting overuse or underuse of particular POS sequences, common to Swedish advanced learner English and Swedish, as compared to native English. In what follows, we will refer to those IL traits that we focus on in our investigation as “IL+L1”.5
2.1 The corpora and tagsets For our investigation, we used the following three sets of corpus materials. 1. e learner corpus, the Uppsala Student English corpus (USE; Axelsson 2000; Axelsson and Berglund 2002), contains about 400,000 tokens (about 350,000 words); 2. e native English corpus was made up of the written language portion of the British National Corpus Sampler (BNCS; Burnard 1999), containing about 1.2 million tokens (roughly 1 million words); 3. e native Swedish corpus, the Stockholm Umeå Corpus (SUC; Ejerhed and Källgren 1997), contains roughly 1.2 million tokens (about 1 million words). e BNCS and SUC corpora come in POS-tagged, manually corrected versions, which we have used without modification. e USE corpus was tagged by us with a Brill tagger trained on the BNC sampler, giving an estimated accuracy of 96.7 %. For the purposes of this investigation, both tagsets were reduced, the English set to 30 tags (from 148) and the Swedish to 37 tags (from 156). e reduced tagsets are listed and compared in the Appendix. e tagsets
72
Lars Borin and Klas Prütz
were reduced for two reasons: first, earlier work has indicated that training and tagging with a large tagset, and then reducing it, not only improves tagging performance, but also gives better results than training and tagging only with the reduced set. Prütz’s (2002) experiment with a Swedish Brill tagger and the same full and reduced tagsets as those used here gave an increased accuracy across the board of about two percentage points from tagging with the large tagset and then reducing it, compared to tagging with the full set. Tagging directly with the reduced set resulted in lower accuracy, by a half to one percentage point, depending on the lexicon used. Second, coarse-grained tagsets are more easily comparable than fine-grained ones even for such closely related languages as Swedish and English (Borin 2000, 2002b).
2.2 Experiment setup In Figure 2, the setup of the experiment is shown in overview. We used a similar procedure to that of our earlier investigation of translationese (Borin and Prütz 2001):6 1. First, we extracted all POS n-gram types (for n = 1 ... 4) and their frequencies from the three POS-tagged corpora; 2. From the n-gram lists we removed certain sequences, namely (a) those containing the tag NC (proper noun; we hypothesize that a higher or lower relative incidence of proper nouns is not a distinguishing trait in learner language), (b) those with punctuation tags except for those containing exactly one full-stop tag, in the first or the last position,7 and (c) those not appearing in all three corpora, either by necessity (because of differences between the English and Swedish tagsets) or by chance; 3. For each n-gram length, the incidence of the n-gram types in BNCS (representing native English) and USE (representing learner English) were compared, using the Mann-Whitney (or U) statistic (see Kilgarriff 2001 for a description and justification of the test for this kind of investigation), and instances of significant (p ≤ 0.05, two-tailed) differences (overuse and underuse) were collected (“n-gram ∆ analysis” in Figure 2); 4. BNCS and SUC (representing the learners’ native language, i.e. Swedish) were compared in exactly the same way; 5. Finally, the n-gram types which showed significant overuse or significant underuse in both comparisons were extracted, symbolized by the “&” (logical ) process in Figure 2.
New wine in old skins?
73
Figure 2. Experiment setup
3. Results by the numbers In this section, we give a general overview of our results, but defer discussion to section 4, where we compare our findings with those of other similar investigations. In Table 1, you will find the numbers, i.e. how many of each n-gram type occured in each corpus. We give both the actual and the theoretically expected figures. For unigrams, the expected figure is the cardinality of the tagset, of course, while the figure for the other n-grams is the actually occurring number of unigrams in the corpus in question raised to the corresponding power; thus, 293 (29 cubed) is the expected number of trigrams in the USE corpus. is simply illustrates the well-known fact that language has syntax, and is not in general freely combinatorial. e longer the sequence, the smaller the fraction becomes that is actually used of all possible combinations. is is what makes it possible to let POS n-grams stand in for real syntactic analyses. In Table 2, underuse and overuse are shown, found by the experimental procedure described in the previous section. e percentage figures shown in the table are calculated by dividing the underuse/overuse figures by the POS n-gram figures for the USE corpus, i.e., the percentage of significantly different (underused and overused) trigrams is calculated as (42+155)/6526 (≈ 0.03019, i.e. 3.0%). An interesting fact reflected by the figures in Table 2 is that there turned out to be more instances of overuse than of underuse for all n-gram lengths.
74
Lars Borin and Klas Prütz
Table 1. Actually occurring and expected n-gram types in the corpora corpus:
USE occurring (expected)
BNCS SUC occurring (expected) occurring (expected)
unigrams bigrams trigrams 4-grams
29 663 6526 31761
30 807 10800 60645
(30) (841) (24389) (707281)
(30) (900) (27000) (810000)
34 1035 13616 72770
(37) (1156) (39304) (1336336)
Table 2. Underuse and overuse per n-gram length unigrams
bigrams
trigrams
4-grams
underuse overuse underuse
overuse underuse
overuse underuse
overuse
1 3.4% = 13.7%
36 5.4%
155 2.4%
171 0.5%
3 10.3%
11 1.6% = 7.0%
42 0.6% = 3.0%
91 0.3% = 0.8%
In section 3.1, we discuss some representative cases of each n-gram type.
3.1 Distinctive IL+L1 n-grams 3.1.1 Unigrams Among the unigrams, there was one instance of underuse, “K2” (past participle), while there were three overused parts-of-speech: “V” (finite verb), “R” (adverb), and “C” (conjunction). Possibly, this indicates a less complex sentence-level syntax in the IL+L1 than in native English, with more finite clauses joined by conjunctions, rather than non-finite subordinate clauses.8 e adverbs could be a sign of a more lively, narrative style, and may possibly have nothing at all to do with the fact that these particular narratives happen to be in interlanguage (but see section 4.2). 3.1.2 Bigrams Just as adverbs by themselves are overused in the USE IL+L1, so are a number of bigrams containing adverbs, e.g. “R C” (adverb–conjunction), “R R”
New wine in old skins?
75
(adverb–adverb), “R NN” (adverb–common noun), “R V” (adverb–finite verb), “. R” (sentence initial adverb). Sentence initial common nouns (“. NN”) are also overused, perhaps strengthening the impression that sentence syntax is simpler in IL+L1 than in native L2. By way of illustration, we show some examples of the bigram “R R” from the USE corpus (the full tagset is used in this and in the other examples which follow below): (1) I/PPIS1 also/RR recantly/RR descovered/VVN that/CST my/APPGE spelling/NN1 was/VBDZ rather/RG poor/JJ so_that/CS is/VBZ someting/PN1 I/PPIS1 have/VH0 to/TO work/VVI on/RP ./YSTP (2) He’s/NP1 far/RR away/RP ./YSTP (3) So/RG naturally/RR ,/YCOM they/PPHS2 were/VBDR shocked/JJ to/TO find/VVI complete/JJ wilderness/NN1 and/CC a/AT1 nature/NN1 so/RR unlike/II the/AT English/NN1 ./YSTP
Additionally, examples 4–6 in section 3.1.3 below also contain “R R”. All the most consistently underused bigrams have in common the POS tag “K2” (past participle): “K2 I” (past participle–preposition), “K2 R” (past participle–adverb), “NN K2” (common noun–past participle), “V K2” (finite verb–past participle). We give some examples of the “K2 R” bigram in section 3.1.4 below (examples 13–18), from which we see that the adverb is usually the second component (the verb particle) of a phrasal (or particle) verb. Hence, the IL+L1 shows an underuse of either periphrastic tenses or non-finite clauses, or both, with phrasal verbs.9
3.1.3 Trigrams Many of the overused trigrams contain adverbs: “. R R” (sentence-initial adverb–adverb; example 3), “R R NN” (adverb–adverb–common noun; examples 4–6). Other examples of overused trigrams are “VI I NN” (infinite verb–preposition–common noun; examples 10–12), “V I NN” (finite verb– preposition–common noun). (4) When/CS I/PPIS1 write/VV0 ,/YCOM I/PPIS1 can/VM spend/VVI as/RG much/RR time/NNT1 as/CSA I/PPIS1 want/VV0 to/TO make/VVI changes/NN2 and/CC corrections/NN2 ./YSTP
76
Lars Borin and Klas Prütz
(5) ey/PPHS2 are/VBR trying/VVG to/TO imitate/VVI their/APPGE action/NN1 heroes/NN2 and/CC not/XX very/RG seldom/RR accidents/NN2 occur/VV0 ./YSTP (6) at_is/REX however/RR far_from/RG reality/NN1 ./YSTP
Among the underused trigrams we find many which contain adjectives: “A A NN” (adjective–adjective–common noun), “A NN K1” (adjective–common noun–present participle), “A NN K2” (adjective–common noun–past participle), “A NN NN” (adjective–common noun–common noun). Past participles appear among underused trigrams as well. us, we find “NN K2 R” (common noun–past participle–adverb) in addition to the already mentioned “A NN K2”.
3.1.4 4-grams Among overused 4-grams, there are a number involving conjunctions and prepositions, e.g.: “. C NN V” (sentence-initial conjunction–common noun–finite verb; examples 7–9), “C NN R V” (conjunction–common noun–adverb–finite verb), “VI I NN .” (sentence-final infinite verb–preposition–common noun; examples 10–12), “V I NN .” (sentence-final finite verb–preposition–common noun). (7) When/CS people/NN grew/VVD old/JJ they/PPHS2 were/VBDR depending_on/II their/APPGE relatives’/JJ goodness/NN1 ./YSTP (8) When/CS children/NN2 reach/VV0 a/AT1 certain/JJ age/NN1 ,/YCOM they/PPHS2 tend/VV0 to/TO find/VVI these/DD2 violent/JJ films/NN2 very/RG cool/JJ and/CC exciting/JJ ./YSTP (9) Because/CS fact/NN1 is/VBZ that/CST New/JJ Lanark/NP1 was/VBDZ a/AT1 success/NN1 ,/YCOM a/AT1 large/JJ one/PN1 ./YSTP (10) I/PPIS1 have/VH0 always/RR found/VVN it/PPH1 amusing/JJ to/TO write/VVI in/II English/NN1 ./YSTP (11) We/PPIS2 need/VV0 to/TO teach/VVI them/PPHO2 how/RRQ to/TO defend/VVI themselves/PPX2 in/II today’s/NN2 society/NN1 and/CC to/TO turn/VVI away_from/II violence/NN1 ./YSTP
New wine in old skins?
77
(12) Another/DD1 great/JJ fear/NN1 was/VBDZ that/CST wilderness/NN1 would/VM force/VVI civilised/JJ men/NN2 to/TO act/VVI like/II savages/NN2 ./YSTP
In the set of underused 4-grams, there are quite a few containing past participles, e.g.: “K2 R I A” (past participle–adverb–preposition–adjective), “K2 R I NN” (past participle–adverb–preposition–common noun; examples 13–15), “K2 R I P” (past participle–adverb–preposition–pronoun; examples 16–18), “NN V K2 R” (common noun–finite verb–past participle–adverb). (13) Why/RRQ does/VDZ anyone/PN1 want/VVI to/TO see/VVI a/AT1 man/NN1 get/VV0 his/APPGE head/NN1 chopped/VVN off/RP on/II television/NN1 ?/YQUE (14) Tom/NP1 is/VBZ blown/VVN up/RP with/IW dynamite/NN1 but/CCB is/VBZ still/RR alive/JJ ./YSTP (15) You/PPY can/VM be/VBI swept/VVN away/RP with/IW money/NN1 ,/YCOM towards/II materialistic/JJ values/NN2 ,/YCOM without/IW even/RR realizing/VVG it/PPH1 ./YSTP (16) It/PPH1 is/VBZ essential/JJ to/II all/DB infant/NN1 mammals/NN2 to/TO be/VBI taken/VVN care/NN1 of/IO ,/YCOM and/CC to/TO be/VBI brought/VVN up/RP by/II someone/PN1 who/PNQS knows/VVZ the/AT difficulties/NN2 of/IO life/NN1 ./YSTP (17) However/RR ,/YCOM the/AT Chief’s/NN2 images/NN2 of/IO machines/NN2 are/VBR not/XX only/RR similes/VVZ ,/YCOM he/PPHS1 also/RR suffers/VVZ delusions/NN2 which/DDQ make/VV0 him/PPHO1 think/VVI that/CST there/EX are/VBR actual/JJ machines/NN2 installed/VVN everywhere/RL around/II him/PPHO1 ,/YCOM controlling/VVG him/PPHO1 ./YSTP (18) I/PPIS1 know/VV0 that/CST woman/NN1 is/VBZ naturally/RR and/CC necessarily/RR weak/JJ in_comparison_with/II man/NN1 ;/YSCOL and/CC that/CST her/APPGE lot/NN1 has/VHZ been/VBN appointed/VVN thus/RR by/II Him/PPHO1 who/PNQS alone/JJ knows/VVZ what/DDQ is/VBZ best/JJT for/IF us/PPIO2 ./YSTP
78
Lars Borin and Klas Prütz
4. Comparisons with similar previous work In this section we compare our results in more detail to other relevant work. e only similar investigation of learner language that we know of is that made by Aarts and Granger (1998). eir work is methodologically similar to our approach, and therefore a fairly detailed comparison or our findings with theirs seems warranted. Section 4.1 is devoted to such a comparison. Further, it seems reasonable to assume that there should be common traits in translated language (translationese; Gellerstam 1985, 1996) and (advanced) learner language, and in section 4.2, we compare our results here to those obtained in our earlier investigation of translationese.
4.1 Aarts and Granger 1998 Aarts and Granger (1998; henceforth A&G) compared POS trigram frequencies in three learner corpora, the Dutch, Finnish and French components of ICLE, with comparable material produced by native speakers of English, i.e. the LOCNESS (LOuvain Corpus of Native English eSSays) corpus. eir investigation was thus an instance of corpus-based CIA (see above), and did not involve the native languages of the learners, other than indirectly, through the comparison between the three corpora. A&G produced POS trigram frequency lists from all four corpus materials (each about 150,000 words in length). Like in our investigation, they worked with a reduced version of the tagset they used for tagging the corpora (the TOSCA-ICE tagset with 270 tags, which were reduced to 19). ey then investigate their trigram lists in a number of ways: 1. ey calculate significant differences (underuse and overuse in relation to LOCNESS) in the rank orderings of the lists, using the c2 test; 2. ey investigate the differences common to the three ICLE components in relation to LOCNESS (the “cross-linguistic invariants”; about 7% of the trigrams), 3. and differences unique to one learner variety (“L1-specific patterns”; about 20–25% of the trigrams, depending on the L1), where only the French variety is discussed in any detail by A&G (see above). We now proceed to a more detailed comparison between the findings of A&G and our own results (B&P in what follows). We should keep some things in
New wine in old skins?
79
mind, though. First of all, A&G actually make a different investigation. ey investigate over- and underuse of POS trigrams in a learner corpus, compared to a native speaker corpus. Our investigation started out in the same way, but additionally, we remove all POS n-grams which do not differ in the same way between the native L2 corpus and a corpus of native L1, i.e. the native language of the learners. us, the POS n-grams that remain in our case should exclude A&G’s “cross-linguistic invariants”, if indeed their “L1-specific patterns” reflect transfer from the learners’ native language. A&G use a smaller tagset (which reflects a partly different linguistic classification) than we do. Also, we have used a different statistical test for significance testing. ese circumstances conspire to make comparisons between our investigations not entirely straightforward, and could easily account for the differences in the numbers that the two investigations arrive at (we come nowhere near the at least 20% L1-specific trigrams found by A&G; see Table 2, above). If our respective studies really investigate the same thing, we would make the following two predictions. 1. ere could – but need not – be partial overlap between the “L1-specific patterns” A&G found and those that we have uncovered. e overlap should in that case be larger, the closer the L1 in question is to Swedish, i.e. A&G’s Dutch ICLE material should show most overlap with our results. Unfortunately, A&G present concrete results only for French L1specific patterns, which show practically no overlap with our patterns, as expected; 2. We would also predict that those POS trigrams that A&G found to be over- or underused in all the three subcorpora they investigated – the “cross-linguistic invariants” –, should not appear in our material. By and large, this prediction holds, i.e. most of the patterns that A&G find as significantly different in the same way in all the three L1-specific subcorpora, are indeed not present in our set of significantly differently distributed POS n-grams. e only possible exceptions to this are the sentence-initial pattern shown in Table 3, where the picture is not as clear as in other cases. Although there are so few n-grams that no firm conclusions can be drawn from them, it still seems that there is a difference between those patterns where A&G found overuse and the ones that are underused according to their results. A&G tags should be fairly self-explanatory (except perhaps “#”, sentence break), and B&P tags are explained in the
80
Lars Borin and Klas Prütz
Table 3. Comparison with language-invariant sentence-initial patterns found by A&G (based on Section 4.2, Arts and Granger 1998: 137) A&G POS sequence # # CONNEC # # ADV # # PRON ##N # CONJ N # PREP Ving
overused
underused
= B&P POS sequence
A&G
B&P
.C .R .P
+ + +
+ + h
. NN . C NN . I K1
– – –
+ h h
Appendix. Differences are noted using “+” (overuse), “–” (underuse), and “h” (no significant difference).
4.2 Borin and Prütz 2001 Intuitively, translated language (translationese; see above) and IL ought to have features in common: “Both are situated somewhere between L1 and L2 and are likely to contain examples of transfer.” (Granger 1996: 48). us, it is of value to compare the results of the present investigation to an earlier similar investigation of translationese (Borin and Prütz 2001), where we looked at newstext translated from Swedish to English, using an almost identical experimental procedure to the one presented here. e differences were as follows. 1. Different corpora were used, of course: (a) the English translation and (b) Swedish original versions of a Swedish news periodical for immigrants, the “press, reportage” parts of the (c) FLOB and (d) Frown English corpora; 2. In addition to the 1- to 4-grams investigated in IL+L1, we also investigated 5-grams in our translationese study; 3. e initial selection of distinct n-grams was different, and based on an absolute difference in rank in the corpora, rather than on a statistical test. e same set of n-grams as in the present investigation were then removed from consideration (i.e., those containing proper nouns and certain kinds
New wine in old skins?
81
of punctuation, and those not occurring in all the compared corpora; see above); 4. e statistical test was applied only to the results of the initial selection, resulting in the removal of a number of n-grams. However, we do not know if the initial selection has excluded some n-grams which would have been singled out as significantly different by the statistical test. If we take as our hypothesis that there should be a fair amount of overlap between the two sets of distinct n-grams, or perhaps even that the n-grams found to be characteristic of translationese should be a subset of those characteristic of learner language, we have to admit that the hypothesis was soundly falsified. What we found was that there were a considerably larger number of significant differences characteristic of learner language than of translationese (506 2- to 4-grams in IL+L1 vs. 41 in translationese), except in the case of unigrams, where IL+L1 had 4, against 6 in translationese. On the other hand, there is almost no overlap – let alone inclusion – between the two sets of ngrams. ere are two shared bigrams (“. R” and “C VI”, both overused), one shared trigram (“. I P”, overused), and no shared unigrams or 4-grams.10 e one similarity that we did find was a somewhat similar situation with regard to overuse and underuse. ere are more overused than underused bigrams and trigrams both in IL+L1 and translationese, while they differ with respect to 4grams, where translationese displayed more underuse than overuse. In conclusion: while our results perhaps do not invalidate the intuition that IL and translationese “are situated somewhere between L1 and L2 and are likely to contain examples of transfer” (see above), it certainly seems that they are situated in quite different locations in the region between L1 and L2 (but see the next section). More research is clearly needed here.
5. Discussion and conclusion In this section, we would like to discuss some general issues which bear on the interpretation of our results and on the comparisons we have made of these results with the findings of other similar investigations: 1. Representativeness of the English “standard”. We have used (the written part) of BNCS as the L2 standard. Perhaps we should instead have used a
82
2.
3.
4.
5.
Lars Borin and Klas Prütz
native students’ essay corpus such as LOCNESS (like Aarts and Granger 1998), or perhaps even a corpus of spoken English, acknowledging the fact that the written English of Swedish learners is held to be influenced by colloquial spoken English (see Hägglund 2001); Representativeness of the Swedish “standard”. In the same way, we could question whether SUC really faithfully represents the learners’ “point of departure”, the form of Swedish most likely to influence their IL English. Perhaps here, too, a corpus of spoken Swedish would serve better (see Allwood 1999), or possibly a corpus of Swedish student compositions; What do the “L1-specific” trigrams found by Aarts and Granger (1998) reflect? Our hypothesis – which informed the way we set up our experiment, described in section 2 above – was that they represent transfer, i.e., that underuse and overuse of an n-gram type in IL reflect a relatively lower and higher incidence, respectively, of the same n-gram type in the L1. Only if this hypothesis holds are our results comparable with those of Aarts and Granger. If underuse or overuse in IL is due to something else, then obviously we cannot compare our results. For instance, underuse in the IL could be due to avoidance of an L1 structure, in which case it should be correlated to a higher incidence in the L1 or no significant difference; ere is an estimated tagging error rate of slightly more than 3% in the USE corpus (see section 2.1). If the errors made by the tagger are not random, there will be a bias in the results of our investigation; POS tag sequences are of course not syntactic units; they merely give better clues to syntax than word-level investigations are able to provide. e picture we get of learner (and native speaker) language syntax is therefore likely to be distorted and to need careful interpretation to be usable.
In conclusion, we would like to say that we think that our investigation confirms the observation made by Aarts and Granger (1998) and Borin and Prütz (2001) that a contrastive investigation of POS-tagged corpora can yield valuable linguistic insights about the differences (and similarities) among the investigated language varieties. At the same time, much remains to be done regarding matters of methodology; among others, the issues mentioned above need to be addressed. In the future, we would like to look into the issue of L1 and L2 corpus representativeness. We would also like to extend and refine our investigation of L1
New wine in old skins?
83
interference in learner language syntax in various ways, notably by the use of robust parsing (Abney 1996), which would enable us to look at syntax directly, to investigate e.g. which syntactic constituents and functions are most indicative of learner language.
Acknowledgements We would like to thank the volume editors for their careful reading and commenting of (the previous version of) this article. e research presented here was funded by the following sources: an Uppsala University, Faculty of Languages reservfonden grant; Vinnova through the CrossCheck project; e Knut and Alice Wallenberg Foundation through the Digital Resources in the Humanities project, part of the Wallenberg Global Learning Network initiative.
Notes 1. We use the term “parallel corpus linguistics” to subsume both work with parallel corpora – i.e., original texts in one language and their translations into another language or other languages – and work with comparable corpora, i.e., original texts in two or more languages which are similar as to genre, topic, style, etc. At least in the language technology-oriented research tradition, there are interesting commonalities between the two kinds of work (see Borin 2002a), e.g. in the use of distributional regularities for automatically discovering translation equivalents in both kinds of corpora. Work such as that presented here, dealing with comparisons among learner IL corpora and original L1 and L2 corpora, is most similar to work on comparable corpora, of course. 2. But using a learner corpus and (computational) corpus linguistics tools, we can do much more than in traditional EA. Perhaps the major advantage is that we can investigate patterns of deviant usage – i.e., instances of overuse and underuse – rather than just instances of clear errors. Even in the latter case, we can generalize over the normal linguistic contexts (on many linguistic levels, to boot) of particular errors fairly easily using corpus linguistics tools, something which in general was not feasible in traditional EA. is takes care of limitations 3 and 4 of EA mentioned above. 3. In corpus linguistics – at least if we are talking about the more interesting case, namely the development of automatic methods for making linguistically relevant comparisons between texts –, the closest thing to CA is the work on parallel and comparable corpora aimed mainly at extracting translation equivalents for machine translation or cross-language information retrieval systems (see, e.g., Borin 2002a). ese methods, although at present used almost
84
Lars Borin and Klas Prütz
exclusively for language technology purposes, could in principle be used for a more traditionally linguistically-oriented “contrastive corpus linguistics” as well, as has been argued elsewhere (e.g. in Borin 2001; cf. Granger 1996), complementing the largely manual modes of investigation used in present-day corpus-based contrastive linguistic research. 4. At least not in the way that we propose to do it. Although it shares some traits with Granger’s (1996: 46ff) proposed “integrated CA/CIA contrastive model [which] involves constant to-ing and fro-ing between CA and CIA”, we believe that our method provides for a tighter coupling between all the involved language varieties; there is no difference (indeed, there should be no difference) between CA and IA with our way of doing things. 5. Note that our method of investigation is by design unsuited for finding errors, since we count as instances of overuse only such items that actually appear in the native L2 corpus, i.e., if a construction appears in the L1 and IL corpora but not in the L2 corpus, it is not counted as an instance of overuse, even though the difference in itself may be statistically significant. Concretely, this is achieved by taking the L2 corpus – i.e., the British National Corpus Sampler in our case – as the basis for all comparisons; see further 2.2. 6. ere were some small differences, which we will return to below, when we compare the results of the two investigations. 7. e motivation for this is possibly less well-founded than in the case of proper nouns, but let us simply say that we wish to limit ourselves, at least for the time being, to looking at clause-internal syntax imperfectly mirrored in the POS tag sequences found in a text. Of course, at the same time we eliminate e.g., commas functioning as coordination conjunctions, i.e., clause-internally. We also do not wish to claim that rules of orthography, such as the use of punctuation, cannot be subject to interference. We are simply more interested in syntax more narrowly construed. e reason for keeping leading and trailing full stops is that a full stop is an unambiguous sentence (and clause) boundary marker, thus permitting us to look at POS distribution at sentence (and some clause) boundaries. 8. English has more possibilities for non-finite clausal subordination than Swedish, which may be relevant here. It seemed that the results of our earlier translationese investigation reflected this circumstance (Borin and Prütz 2001: 36). Granger (1997) finds a similar underuse of non-finite subordinate clauses in non-native written academic English as compared to that of native writers. 9. Here, it would be good to compare our results with Hägglund’s (2001) lexical investigation of phrasal verbs in the Swedish component of ICLE, compared to LOCNESS. For the time being, this will have to remain a matter for future investigation, however. 10. Although it is an intriguing fact that our translationese study found significantly more adverbs in Swedish than in all the English materials, and that the English translated from Swedish had more – but not significantly more – than either of the other two sets of English materials (see section 3.1.1).
New wine in old skins?
85
References Aarts, J. and Granger, S. 1998. “Tag sequences in learner corpora: A key to interlanguage grammar and discourse”. In Learner English on Computer, S. Granger (ed.), 132–141. London: Longman. Abney, S. 1996. “Part-of-speech tagging and partial parsing”. In Corpus-Based Methods in Language and Speech, K. Church, S. Young and G. Bloothooft (eds). Dordrecht: Kluwer. Allwood, J. 1999. “The Swedish spoken language corpus at Göteborg University”. In Fonetik 99: Proceedings from the 12th Swedish Phonetics Conference. [Gothenburg papers in theoretical linguistics 81]. Department of Linguistics, Göteborg University. Axelsson, M. W. 2000. “USE – the Uppsala Student English Corpus: An instrument for needs analysis”. ICAME Journal 24: 155–157. Axelsson, M. W. and Berglund, Y. 2002. “The Uppsala Student English Corpus (USE): A multi-faceted resource for research and course development”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 79–90. Amsterdam: Rodopi. Borin, L. 2000. “Something borrowed, something blue: Rule-based combination of POS taggers”. Second International Conference on Language Resources and Evaluation. Proceedings, Volume I, 21–26. Athens: ELRA. Borin, L. 2001. “Att undersöka språkmöten med datorn”. In Språkets gränser och gränslöshet. Då tankar, tal och traditioner möts. Humanistdagarna vid Uppsala universitet 2001, A. Saxena (ed.), 45–56. Uppsala: Uppsala University. Borin, L. 2002a. “… and never the twain shall meet?”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 1–43. Amsterdam: Rodopi. Borin, L. 2002b. “Alignment and tagging”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 207–218. Amsterdam: Rodopi. Borin, L. and Prütz, K. 2001. “Through a glass darkly: Part of speech distribution in original and translated text”. In Computational Linguistics in the Netherlands 2000, W. Daelemans, K. Sima’an, J. Veenstra and J. Zavrel (eds), 30–44. Amsterdam: Rodopi. Burnard, L. (ed.). 1999. “Users reference guide for the BNC sampler”. Published for the British National Corpus Consortium by the Humanities Computing Unit at Oxford University Computing Services, February 1999. [Available on the BNC Sampler CD]. Dagneaux, E., Denness, S. and Granger, S. 1998. “Computer-aided error analysis”. System 26: 163–174. Ejerhed, E. and Källgren, G. 1997. “Stockholm Umeå Corpus (SUC) version 1.0”. Department of Linguistics, Umeå University. Ellis, R. 1985. Understanding Second Language Acquisition. Oxford: Oxford University Press. Gellerstam, M. 1985. “Translationese in Swedish novels translated from English”. Translation Studies in Scandinavia. Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, Lund 14–15 June, 1985, L. Wollin and H. Lindquist (eds), 88–95. Lund: Lund University Press. Gellerstam, M. 1996. “Translations as a source for cross-linguistic studies”. In Languages
86
Lars Borin and Klas Prütz
in Contrast. Papers from a Symposium on Text-Based Cross-Linguistic Studies. Lund 4–5 March 1994, K. Aijmer, B. Altenberg and M. Johansson (eds), 53–62. Lund: Lund University Press. Granger, S. 1996. “From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora”. In Languages in Contrast. Papers from a Symposium on Text-Based Cross-Linguistic Studies. Lund 4–5 March 1994, K. Aijmer, B. Altenberg and M. Johansson (eds), 37–51. Lund: Lund University Press. Granger, S. (ed.). 1998. Learner English on Computer. London: Longman. Hägglund, M. 2001. “Do Swedish advanced learners use spoken language when they write in English?”. Moderna språk 95 (1): 2–8. Hammarberg, B. 1999. “Manual of the ASU Corpus – A longitudinal text corpus of adult learner Swedish with a corresponding part from native Swedes”. Stockholm University, Department of Linguistics. Kilgarriff, A. 2001. “Comparing corpora”. International Journal of Corpus Linguistics 6 (1): 1–37. McEnery, T. and Wilson, A. 2001. Corpus Linguistics. 2nd edition. Edinburgh: Edinburgh University Press. Meunier, F. 1998. “Computer tools for the analysis of learner corpora”. In Learner English on Computer, S. Granger (ed.), 19–37. London: Longman. Prütz, K. 2002. “Part-of-speech tagging for Swedish”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 201–206. Amsterdam: Rodopi. Staerner, A. 2001. Datorstödd språkgranskning som ett stöd för andraspråksinlärning. [Computerized language checking as support for second language learning]. MA Thesis in Computational Linguistics, Department of Linguistics, Uppsala University. Online: http://stp.ling.uu.se/~matsd/thesis/arch/2001–007.pdf (visited: 16.04.2004).
New wine in old skins?
87
Appendix. Reduced Swedish and English tagsets Table A1. Reduced Swedish (SV-R) and English (EN-R) tagsets SV-R
EN-R
description
1 2 3 4 5 6 7
– ! “ ( ) , .
8 9 10
: ; ?
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
A C E F G I K1 K2 L M NC NC$ NN NN$ O P P$ Q R S T V VI VK VS X
– ! “ ( ) , . ... : ; ? $ A C E
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I K1 K2
16 17 18
M NC
19 20
NN
21
O P P$
22 23 24
R S T V VI
25 26 27 28 29
X
30
dash exclamation mark quotes le bracket right bracket comma full-stop ellipsis colon semicolon question mark genitive clitic adjective conjunction infinitive mark numeric expression abbreviation preposition present participle past participle compound part numeral proper noun proper noun, genitive noun noun, genitive interjection pronoun pronoun, poss. or gen. pronoun, relative adverb symbol or letter determiner verb, finite verb, infinitive verb, subjunctive verb, supine unknown or foreign word (tagged at all only in SUC)
examples – ! ” ( ) , . ... : ; ? ’s röd, red och, that att, to 16 d.v.s. på, on seende, eating sedd, eaten högtvå, two Eva, Evelyn Åsas häst, goat tjuvs bu, um vi, we vår, our som fort, fast G en, the såg, ate se, eat såge sett
88
Agnieszka Leńko-Szymańska
Demonstratives as anaphora markers in advanced learners’ English
89
Demonstratives as anaphora markers in advanced learners’ English Agnieszka Leńko-Szymańska University of Łódź, Poland
The aim of this study is to confirm teachers’ informal observations and to identify the specific patterns of misuse of the demonstratives as anaphora markers in Polish advanced learners’ English. The misuse is treated here in terms of underuse or overuse of the particular categories of the demonstrative anaphors in students’ essays: the proximal versus the distal demonstratives and the demonstrative determiners versus the demonstrative pronouns. The specific questions addressed in this study are: (1) do Polish learners of English at higher and lower proficiency levels show different patterns of use of demonstrative anaphors? and (2) to what extent do these patterns differ from native speaker use? The data was drawn from two corpora: the PELCRA corpus of learner English and the BNC Sampler. Three stages of analysis were performed on the data. First, the frequencies of occurrence of the demonstratives in the three samples were compared. Next, the proportions of proximal and distal demonstratives were analysed across the samples. Lastly, the proportions of determiner and pronoun uses for the distal plural demonstrative those were assessed. The log likelihood chi-square and the regular chi-square tests were performed to estimate the statistical significance of the results. The results showed that Polish advanced learners of English overuse demonstratives in argumentative writing and this overuse is particularly robust with distal demonstratives. Moreover, learners show a preference for the selection of distal (as opposed to proximal) demonstratives when compared with the native norm. They also show statistically significant overuse of those as a determiner and underuse of those as a pronoun (results for other demonstratives not available). Finally, the patterns of learners’ misuse do not change significantly with years of exposure and learning. Thus, the results indicate that native-like use of the demonstratives is not acquired implicitly by Polish learners. The finding has important pedagogical implications, since this feature of language use has not been addressed explicitly in syllabi and ELT materials so far.
90
Agnieszka Leńko-Szymańska
1. Introduction In my experience of reading various types of argumentative and academic essays written by Polish advanced learners of English, it has come to my attention that students (and, to be frank, occasionally myself) have problems in using demonstratives. When sharing this intuitive finding with colleagues I learned that they had made a similar observation. e identified problems rarely involve explicit errors, but are frequently related to non-native patterns of use. Two areas of difficulty are the frequency of occurrence and the choice between proximal (this and these) and distal (that and those) demonstratives. e fragment of a student’s essay in (1) below illustrates the type of dilemma Polish learners of English encounter in their writing. (1) e fact is that there are as many approaches to achieving a success as there are people aiming at it. e same goes to what they perceive to be a success. For ____1____ with a superiority complex, ____2____ will be ruling a kingdom. 1. 2.
a) these a) this
b) those b) that
c) it
Demonstratives in English are classified as belonging to two different partof-speech categories: they can be determiners, when they premodify the head of a noun phrase, or pronouns when they themselves function as the head of a noun phrase. eir two major areas of use are situational and time reference (deixis) and anaphoric reference (Quirk et al. 1985, Biber et al. 1999). Teachers’ intuitions indicate that the deictic function of the demonstratives seems to be handled by Polish learners fairly well, moreover, deixis rarely surfaces in argumentative and academic writing. e type of use that is believed to be troublesome for Polish students is the anaphoric reference, when the choice of the proximal or distal demonstratives does not relate to the physical or temporal distance. Teachers’ observations concerning problems in the use of the demonstratives in Polish advanced learners’ writing do not go beyond awareness of the problem, and say very little about the exact nature of the difficulty. An aim of this study is to confirm these intuitions and to identify the specific patterns of misuse of the demonstratives as anaphora markers. e misuse will be treated here in terms of underuse or overuse of the particular categories of the demonstrative anaphors in students’ essays: the proximal versus the distal
Demonstratives as anaphora markers in advanced learners’ English
91
demonstratives and the demonstrative determiners versus the demonstrative pronouns, rather than in terms of the number of errors. e choice of the methodology is motivated by the fact that in the majority of contexts the selection of a proximate/distal demonstrative is not determined (as it is in gap 1 in (1) above) and depends solely on the writer’s intended meaning (cf. gap 2 in (1) above). us, learners’ problems with the demonstrative anaphora rarely involve errors and are rather connected with unnatural tendencies. Before investigating the problem, it can be worthwhile to explore how usage of the demonstratives is presented and explained to learners. A survey of ELT materials has revealed that this grammatical point is never taught explicitly. e coursebooks most widely used in Poland contain notes on the usage of demonstratives only in their deictic function, as a rule in the first units at the elementary level, and never return to this problem at more advanced stages. Even in books designed for students preparing for the Cambridge exams there are no sections devoted to the use of the demonstratives for anaphoric reference. Nor do ELT grammars offer much help in this area. For example, Swan’s Practical English Usage (1980) illustrates how singular and plural demonstratives can be used anaphorically and lists many examples of their use, but does not explain the difference between the use of the proximals and the distals. Descriptive grammars of English (cf. Quirk et al. 1985, Biber et al. 1999) also concentrate on the singular/plural distinction and in the little space devoted to the proximal/distal dichotomy, they present conflicting information. One reason for the lack of adequate explanations on the usage of the proximal and distal demonstratives for anaphoric reference may be that this depends mainly on subtleties of meaning which are very difficult to pinpoint in terms of rules: The conditions which govern the selection of this and that with reference to events immediately preceding and immediately following the utterance, or the part of the utterance in which this or that occur, are quite complex. They include a number of subjective factors (such as the speaker’s dissociation of himself from the event he is referring to), which are intuitively relatable to the deictic notion of proximity/non-proximity, but are difficult to specify precisely. (Lyons, 1977:668)
e selection of a demonstrative (as opposed to, for example, a pronoun or the definite article) and the choice between the proximal and the distal anaphoric markers are mainly considered dependent on the writer’s/speaker’s perception and intuition. In the process of learning English as a foreign language, students
92
Agnieszka Leńko-Szymańska
are very much le to their own devices to acquire these. us, a second aim of this study is to investigate whether such acquisition really takes place. If it does, the patterns of use of the demonstrative anaphors displayed by learners at higher proficiency levels should be closer to (if not identical with) native speaker patterns, than in the case of learners at lower proficiency levels.
1.1 Research questions e questions addressed in this study can be summarised in the following way: Do Polish learners of English at higher and lower proficiency levels show different patterns of use of demonstrative anaphors? To what extent do these patterns differ from native speaker use?
2. Study 2.1 Data e data used in the study was drawn from two corpora: the PELCRA corpus of learner English (compiled at the University of Łódź), which is a collection of essays written by Polish university learners at different proficiency levels, and the British National Corpus Sampler. ree samples of these corpora were analysed: 105 essays (57,431 tokens) written by second-year students of English (Comp2) 69 essays (48,414 tokens) written by fourth-year students of English (Comp 4) 23 texts (313,347 tokens) from the domain of World Affairs of the BNC Sampler (BNCS-WA)
e PELCRA corpus consists of essays written for the end-of-year exams by students at the Institute of English Studies, University of Łódź. e data is available at four proficiency levels, from Year I to Year IV. In order to ensure the robustness of the proficiency effect (if it exists) in learners’ use of the demonstrative anaphors, the decision was made to select essays at the extreme ends of the proficiency scale. e first-year compositions could not be used because they represented a different genre of writing from the other three groups of essays, and as such could be richer in deictic uses of demonstratives.
Demonstratives as anaphora markers in advanced learners’ English
93
A standard reference corpus, the BNC Sampler, was selected as a benchmark for comparison. Such a choice may have its drawbacks as the observed differences may not be a result of native/non-native use but rather the effect of discrepancies in authors’ age or experience in writing. While such a possibility has to be borne in mind when interpreting results, it has been proven elsewhere (Leńko-Szymańska 2003) that comparing databanks of equivalent native and non-native students’ essays is also not free of this problem, and since the target standard of writing for non-native students is native professional rather than native apprentice production, the BNC Sampler seems a suitable base for comparison. ‘World Affairs’ was chosen among other BNC Sampler written domains because the topics and genres covered in this domain compare best with the learners’ essays. Such topics and genres include reports and discussions of current events taken from British dailies and excerpts from books on topics ranging from geography to European integration.
2.2 Tools and procedures e frequencies of occurrence of the four demonstratives in the samples were calculated using the Wordsmith Tools package (Scott 1999). In the case of this, these and those raw texts were used. However, for the calculation of the occurrence of that, the learner corpus was first tagged with CLAWS (a part-ofspeech tagger developed at Lancaster University), which tags that as a singular demonstrative or a complementizer. Since CLAWS does not handle the task accurately, the results for the three samples were verified manually. Finally, the concordance lines for those were further sorted into two groups: the lines containing those as a determiner and those containing those as a pronoun. Since the sorting was performed manually it generated unexpected and interesting observations concerning the post-modification patterns of the pronoun those, which were also quantified. ree stages of analysis were performed on the data. First, the frequencies of occurrence of the demonstratives in the three samples were compared in order to identify patterns of overuse or underuse in the learners’ essays. Next, the proportions of proximal and distal demonstratives were analysed across the samples with the aim of diagnosing learners’ potential preferences for one or the other category. Lastly, the proportions of determiner and pronoun uses for the distal plural demonstrative those were assessed in order to explore further the patterns of use of this anaphora marker. e log likelihood chi-square and
94
Agnieszka Leńko-Szymańska
the regular chi-square tests were performed to estimate the statistical significance of the results.
2.3 Results e first step in the analysis involved a comparison of the overall frequencies of demonstratives in the three samples. Table 1 presents the observed frequencies and Table 2 contains the results of the log-likelihood chi-square tests assessing the differences between the samples. e tests show that both groups of learners overuse demonstratives in comparison to native speakers. ere is no statistical difference between the groups of learners, indicating that overuse does not significantly diminish with years of exposure and learning. e frequencies of occurrence of individual demonstratives are presented in Table 3, and Figure 1 displays the graphic representation of results. Table 1. Overall frequencies of demonstratives
tokens demonstratives
Comp2
Comp4
BNCS-WA
57,431 529
48,414 488
313,347 2182
Table 2. Results of the log-likelihood chi-square tests comparing the three samples
Comp2/ BNCS-WA Comp4/ BNCS-WA Comp2/ Comp4
%
%
LL
p
0.92 1.01 0.92
0.70 0.70 1.01
31.44 50.37 2.06
p0.05 p0.05 p0.05 p16 postgraduate).
References Alderson, C. and Urquhart, H. 1986. Reading in a Foreign Language. London: Longman. Aston, G. 1997. “Small and large corpora in language learning”. In PALC’97: Practical applications in language corpora, B. Lewondowska-Tomaszczyk and P.J. Melia (eds), 51–62. Łódź: Łódź University Press. Bernardini, S. 2000. “Systematizing serendipity: Proposals for concordancing large corpora with learners”. In Rethinking Language Pedagogy from a Corpus Perspective, L. Burnard and T. McEnery (eds), 225–234. Frankfurt am Main: Peter Lang. DeKeyser, R.M. 1998. “Beyond focus on form: Cognitive perspective on learning and practical second language grammar”. In Focus on Form in Classroom Second Language Acquisition, C. Doughty and J. Williams (eds), 42–63. Cambridge: Cambridge University Press. Leech, G. 1997. “Teaching and language corpora: A convergence”. In Teaching and Language Corpora, A. Wichmann, S. Fligelstone, A. McEnery and G. Knowles (eds), 1–23. Harlow: Longman. Nassaji, H. 2000. “Towards integrating form-focused instruction and communicative interaction in the second language classroom: Some pedagogical possibilities”. The Modern Language Journal 84(2):241–250. Norris, J.M. and Ortega, L. 2000. “Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis”. Language Learning 50(3):417–528. Nunan, D. 1995. “Closing the gap between learning and instruction”. TESOL Quarterly 29(1):133–158. Nunan, D. 2000. “Autonomy in language learning”. Plenary presentation, ASOCOPI 2000, Cartagena, Colombia, October 2000.
254 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Pérez-Paredes, P. 2003. “Integrating networked learner oral corpora into foreign language instruction”. In Extending the Scope of Corpus-based Research. New Applications, New Challenges, S. Granger and S. Petch-Tyson (eds), 249–261. Amsterdam: Rodopi. Robinson, P. 1995. “Attention, memory, and the noticing hypothesis”. Language Learning 45:283–331. Savignon, S.J. 1983. Communicative Competence: Theory and Classroom Practice. Reading, MA: Addison-Wesley. Schmidt, R. 1990. “The role of consciousness in second language learning”. Applied Linguistics 11(2):129–158. Schmidt, R. 1993. “Awareness and second language acquisition”. Annual Review of Applied Linguistics 13:206–226. Skehan, P. 2001. “The role of a focus on form during task-based instruction”. In Trabajos en lingüistica aplicada. C. Muñoz, M.L Celaya, M. Fernández-Villanueva, T. Navés, O. Strunk and E. Tragant (eds), 11–24. Barcelona: Univerbook SL. Stevens, V. 1990. “Text manipulation: What’s wrong with it, anyway?”. CAELL Journal 1(2):5–8. Williams, J. 2001. “The effectiveness of spontaneous attention to form.” System 29: 325–340.
Self-discovery and corpora 255
Appendix 1 Protocol Part I Personal Total tokens used Total types used Total content words used Token-type ratio Token-content word ratio Types with frequency > 10 used Types with frequency 5–10 used Types with frequency 2–4 used Types with frequency = 1 used Top-type 1 (‘the’) used Top-type 2 (‘and’) used Top-type 3 (‘she’) used Top-type 4 (‘is’) used Top-type 5 (‘picture’) used Top-type 6 (‘in’) used Top-type 7 (‘a’) used Top-type 8 (‘to’) used Top-type 9 (‘her’) used Top-type 10 (‘I’) used Total top-types used Top-content word 1 (‘picture’) used Top-content word 2 (‘woman’) used Top-content word 3 (‘like’) used Top-content word 4 (‘think’) used Top-content word 5 (‘painter’) used Top-content word 6 (‘seems’) used Top-content word 7 (‘friends’) used Top-content word 8 (‘man’) used Top-content word 9 (‘painting’) used Top-content word 10 (‘portrait’) used Total top-content words used Text difficulty (Fog-Index)
Group 159.84 68.76 38.12 2.3135 4.2022 1.52 6.72 20.40 40.12 12.88 7.44 6.96 6.84 5.12 4.76 4.60 4.56 3.44 3.12 59.72 5.12 2.96 2.12 1.76 1.12 0.88 0.84 0.76 0.80 0.76 17.12 13.86
Diff.
256 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Part II Comment on your oral production Note the points you scored above class-average Note the points you scored below class-average Which features of your oral production are particularly strong? Which features of your oral production are less strong? Overall, do you think your speaking is above or below class-average? Why? Do you think your speaking could improve/ benefit if you take some kind of remedial work/ exercises? In which way?
Part III Using the reference corpus, take your top-five content words Same meaning: and find instances (concordance lines) where these content words have been used with the same, similar and different Similar meaning: meanings. Different meaning: Which top-types have you not used? Using the corpus, find three instances of each of these nonused types Which top-content words have you not used? Using the corpus, find three instances of each of these nonused content words
Self-discovery and corpora 257
Appendix II Descriptive data (group scores)
TOKENS TYPES content words ratio token-type ratio token-content word token band 1 (freq. > 10) token band 2 (freq. 5–10) token band 3 (freq. 2–4) token band 4 (freq. 1) top type 1 (the) top type 2 (and) top type 3 (she) top type 4 (is) top type 5 (picture) top type 6 (in) top type 7 (a) top type 8 (to) top type 9 (her) top type 10 (I) total top types top content word 1 (picture) top content word 2 (woman) top content word 3 (like) top content word 4 (think) top content word 5 (painter) top content word 6 (seems) top content word 7 (friends) top content word 8 (man) top content word 9 (painting) top content word 10 (portrait) total top content words
N
Range
25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25
170 66 44 1.30 2.63 5 13 26 38 18 16 19 17 10 13 8 15 9 9 71 10 6 8 6 4 4 2 3 5 4 19
Min. 95 46 24 1.70 3.23 0 2 12 25 3 1 1 0 0 0 1 0 0 0 27 0 0 0 0 0 0 0 0 0 0 8
Max.
Mean
S.D.
265 112 68 3.00 5.87 5 15 38 63 21 17 20 17 10 13 9 15 9 9 98 10 6 8 6 4 4 2 3 5 4 27
159.84 68.76 38.12 2.31 4.20 1.52 6.72 20.40 40.12 12.88 7.44 6.96 6.84 5.12 4.76 4.60 4.56 3.44 3.12 59.72 5.12 2.96 2.12 1.76 1.12 0.88 0.84 0.76 0.80 0.76 17.12
47.27 16.65 10.89 0.35 0.56 1.29 2.98 6.00 9.93 5.06 3.61 4.42 4.21 3.37 3.59 2.14 3.44 2.35 2.11 18.92 3.37 1.72 1.83 1.61 1.13 1.20 0.55 0.97 1.22 1.13 5.99
258 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Student use of large, annotated corpora to analyze syntactic variation 259
Student use of large, annotated corpora to analyze syntactic variation Mark Davies Brigham Young University, USA
This study discusses the way in which advanced language learners have used several large corpora of Spanish to investigate a wide range of phenomena involving syntactic variation in Spanish. The course under discussion is taught via the Internet, and is designed around a textbook that contains both descriptive and prescriptive rules of Spanish syntax. The students carry out studies for those syntactic phenomena in which there is supposedly variation – either between registers, between dialects, or where there is currently a historical change underway. The three main corpora used by the students are the 100 million word Corpus del Español (which I have created), the CREA corpus from the Real Academia Española, and searches of the Web via Google. The students have found that each of the three large corpora has its own weaknesses, and that the most effective strategy is to leverage the strengths of each corpus to find the desired data. The students also learn strategies for comparing results across different corpora, and even within components of the same corpus – such as the frequency of occurrences on the Web from different countries.
1. Introduction A goal of many language learners is to more fully understand the range of syntactic variation in the second language and thus move beyond the simplistic rules that are presented in many textbooks. is effort can be aided by large corpora of the second language, which allow users to easily and quickly extract hundreds or thousands of examples of competing syntactic constructions from millions of words of text in different dialects and registers. Using this data, the teachers and students can then have a more realistic picture of how the constructions in question vary from one country to another, whether they are
260 Mark Davies
more common in formal or informal speech, and whether their use is increasing or decreasing over time. is paper provides an overview of the way in which students in a recent online course used three very large sets of corpora – involving hundreds of millions of words of text – to study “Variation in Spanish Syntax”. We will examine the goals of the course, how the students carried out their research using several corpora, the way in which they analyzed competing structures from the corpora, and how they were ultimately successful in describing syntactic variation in Spanish. e issue of how students can be trained as corpus researchers to learn about a foreign language has been the focus of a number of recent articles, including Bernardini (2000, 2002), Davies (2000), Osborne (2000), Kennedy and Miceli (2001, 2002), Kirk (2002), and Kübler and Foucou (2002). Although the issue of “student as researcher” is the focus of our study as well, it differs from many of the previous studies in several ways – the corpus used in this course is much larger (hundreds of millions of words of text), the students are more advanced (graduate students, with many of them language teachers themselves), the focus is on variation rather than norms, and the linguistic phenomena studied in the course deal with complex syntactic constructions, rather than simple collocations. Yet the additional data from our course should help to provide a more complete picture of the different ways in which students can use large corpora to study and analyze the grammar of the second language. In terms of the organization of the course, the “Variation in Spanish Syntax” class that we will be discussing was offered for the first time in 2002, and was taught online to twenty language teachers from throughout the United States (see http://davies.linguistics.byu.edu/sintaxis). Each week the students would examine two to four chapters in A New Reference Grammar of Modern Spanish (Butt and Benjamin, 2000), which is an extremely complete reference grammar of Spanish. Table 1 below lists the primary topics for each week’s assignments. Aer reading the assigned chapters from the reference grammar, each student would identify a particular syntactic construction from among the topics for that week, for which Butt and Benjamin indicated there was some type of variation – between geographical dialects, speech registers, or an overall increase or decrease in the use of the construction. Once the students had identified their topic of study, they would then spend the week using three different sets of corpora or web-based search engines to search for data on the
Student use of large, annotated corpora to analyze syntactic variation 261
Table 1. Topics in the “Variation in Spanish Syntax” course Week Topic
Week Topic
1 2
Morphology: gender and plurals Articles, adjectives, numbers
8 9
3 4 5 6
Demonstratives, lo, possessives Prepositions, conjunctions Pronouns: subject and objects Pronominal verbs, passives, impersonals
10 11 12 13
7
Indicative verb tenses
14
Progressive, gerund, participles Subjunctive, imperative, conditionals Infinitives, auxiliaries Ser/estar, existential sentences Negation, adverbs, time clauses Questions, relative pronouns and clauses Cle sentences, word order
constructions in question. e corpora were the Corpus del Español (www.co rpusdelespanol.org), the CREA and CORDE corpora from the Real Academia Española (www.rae.es), and the web through the Google and Google Groups search engines (www.google.com, groups.google.com) – all of which will be discussed below. Once they had extracted sufficient data for the construction, they would then write a short summary for that week’s project. In this summary they would point out the four most important findings from the data, summarize how their findings confirmed or contradicted the claims of Butt and Benjamin regarding variation, and then briefly discuss some of the possible motivations for this syntactic variation. e final step for each project, which was completed the following week, was to then review the projects of three other students. By the end of the semester, each student had carried out fairly in-depth research on fieen different syntactic constructions involving variation in Modern Spanish, and in addition each student had reviewed another forty-five studies by other students. Based on the quality of their projects, it seems clear that these corpus-based activities were extremely valuable in helping the students to move beyond simplistic textbook descriptions of Spanish grammar, and to acquire a much better sense of the actual variation in contemporary Spanish syntax.
262 Mark Davies
2. The corpora e corpora were the foundation for the entire course, and therefore an understanding of the composition and features of each corpus is fundamental to understanding how the students carried out their research.
2.1 The Corpus del Español e primary corpus for the course was the 100 million word Corpus del Español that I have created, and which was placed online shortly before the start of the semester. e Corpus del Español has a powerful search engine and unique database architecture that allow the wide range of queries shown in Table 2. ese include pattern matching (1), collocations (2), lemma and part of speech for nearly 200,000 separate word forms (3), synonyms and antonyms for more than 30,000 different lemmas (4), more complex searches using combinations of the preceding types of searches (5), queries based on the frequency of the construction in different historical periods and registers of Modern Spanish (6), and queries involving customized, user-defined lists (7). Note also that it would take only about 1–2 seconds to run any of these queries against the complete 100 million word corpus. In short, the Corpus del Español is richly annotated and allows searches for many types of linguistic phenomena, which made it extremely useful for the wide range of constructions that were studied in the “Variation in Spanish Syntax” course.
2.2 CREA / CORDE and Google (Groups) In addition to the 100 million word Corpus del Español, the students used two other sets of Spanish corpora. e first set is the CREA (Modern Spanish) and CORDE (Historical Spanish) corpora from the Real Academia Española, which contain a combined total of about 200 million words of text. e second set are the Google and Google Groups search engines. While these search engines are of course not limited just to web pages in Spanish, the main Google index covers more than 100 million words of text in Spanish language web pages, while the Google Groups search engine contains millions or tens of millions of words of text in messages to Spanish newsgroups.
Student use of large, annotated corpora to analyze syntactic variation 263
Table 2. Range of searches possible with the “Corpus del Español” 1 est_b* *ndo
estaba cantando, estábamos diciendo
2 lo * possible
"as * as possible" lo mejor posible, lo antes posible, lo máximo posible
3 poder.* *.v_inf
forms of poder (“to be able”) + infinitive puede tener, pudiera escapar
4 !dificil.*
all forms of all synonyms of difícil “difficult” imposible, duros, compleja, complicadas, . . .
5 estar.* !cansado.* *.prep *.v_inf
any form of estar + any form of any synonym of cansado (“tired”) + preposition + infinitive estoy harto de vivir, estaba cansada de escuchar
6 *.adv {19misc>5 19oral=0}
all adverbs that occur more than five times in newspapers or encyclopedias from the 1900s, but not in spoken texts from the 1900s inversamente, clínicamente
7 le/les [Bill.Jones:causative] *rse
le or les + a customized list of [causative] verbs created by [Bill.Jones] + words ending in [-rse] le mandaban ponerse, les hace sentirse
e main advantage that CREA/CORDE and Google (Groups) have over the Corpus del Español is their ability to limit searches to specific countries. As we will see in Section 3.3, this is useful for students who want to look at the relative frequency of constructions in different geographical dialects. In addition, CREA and CORDE allow users to compare the relative frequency of constructions in several different registers, beyond just the three divisions of the Corpus del Español (literature, spoken, newspaper/encyclopædias). e main disadvantage of CREA/CORDE and Google (Groups) is that they are not annotated in any way, which makes it impossible to search for syntactic constructions using lemma or part of speech. Even the wildcard features of both sets of search engines are rather limited, which means that it is also impossible to look for morphologically similar parts of speech or lemma. Both CREA/CORDE and Google (Groups) are really only useful in searching for exact phrases. However, as we will see in Section 3.3, when they are used in
264 Mark Davies
conjunction with the highly annotated Corpus del Español, the overall collection of corpora does permit students to carry out detailed searches on syntactic variation for a very wide range of constructions in Spanish.
3. Student outcomes: Examining syntactic variation through corpus use In this section, we will consider some of the challenges that the students faced in using the corpora effectively, how they overcame these obstacles, and how they were ultimately successful in carrying out more advanced research in Spanish and thus moving beyond the simplistic rules of many introductory and intermediate level textbooks. In the sections that follow, we will use as an example the question of clitic placement, in which the clitic can either be pre-posed or post-posed (lo quiero hacer vs. quiero hacerlo; “I want to do it”), and which exhibits variation that is dependent on dialect, register, and several functional factors (cf. Davies 1995, 1998).
3.1 Learning to use the corpora e first challenge facing the students was simply learning to use the different corpora successfully, in order to limit searches and extract the desired information. To help the students, during the first week of class they spent several hours completing two rather lengthy “scavenger hunt” quizzes using the Corpus del Español, the CREA and CORDE corpora, and Google (Groups). Each question would outline the type of data that they would look for from a particular corpus. For example, one of the questions dealing with the Corpus del Español asked them to look for the most common phrases involving any object pronoun followed by any form of any synonym of querer (“to want”) followed by an infinitive. e students would be responsible for looking at the help file to see how to search for lemma and parts of speech, and would then hopefully use the correct search syntax to query the corpus, in this case [ *.pn_obj !querer.* *.v_inf ]. As a hint to make sure that the students were on the right track in their search, they were told what the first two entries in the results set would be (e.g., te quiero decir, le quiero decir), and they could use this to check their results. ey were then responsible for providing the third entry from the list (in this
Student use of large, annotated corpora to analyze syntactic variation 265
case, me quiere decir). By the time they had answered sixty such questions for the different corpora during the first week, they were ready to use the corpora to extract data on syntactic constructions of their choosing. By examining Table 1 we can see that the topics in the course were arranged so that the students would start with constructions that were relatively easy to find, and that as the semester progressed became gradually more abstract and difficult. For example, at the beginning of the semester students started at the word-internal level (morphological variation for gender and number), and then moved to constructions involving adjacent words (e.g., demonstratives and possessives), then semantically more complex localized constructions (e.g., pronominal verbs), then even less well-defined structures (e.g., subjunctives), and ending up with fairly abstract and less localized constructions (e.g., cle sentences and word order).
3.2 Formulating the research question One of the hardest parts of carrying out linguistic research is knowing how to frame the question, and setting up the actual search of the corpora. To help the students in the course, during the first four weeks of the course I required that they send me a Plan de Trabajo (Work Plan) before they started the research in earnest. In a short paragraph they would first briefly outline what type of variation was described in the reference grammar. ey would then indicate which corpora would be most useful to examine the variation, and show exactly what type of queries would be run against these corpora to extract the data. Sometimes there were problems with the general research question – for example, the topic was much too wide or too narrow. Returning to the clitic placement construction, for example, they might propose to look at clitic placement with all main verbs (too wide) or with just three or four exact phrases (too narrow). Sometimes they intended to use a corpus that was not the best one for the question at hand, or else they had the wrong search syntax. In all such cases, I would help them frame the search correctly before they started. Once they had received this feedback, they were then ready to start the search itself. is procedure seemed to prevent a lot of wasted time and frustration on the part of the students as they were learning to use the corpora. By the end of the fourth week of using corpora, however, most of the students had sufficient experience in framing the research questions and setting up the queries, and it then became optional to submit a work plan before consulting the corpora.
266 Mark Davies
3.3 Extracting data from multiple corpora e students soon learned that the most productive research was that which incorporated searches from all three sets of corpora, by using each of the corpora for those purposes for which they were most useful. Typically, the students would start searching with the Corpus del Español, because it is the only one of the three that was annotated. For example, if they were examining variation in clitic placement, they could search for all cases of pre-positioning (“clitic climbing”) (1a) or postpositioning (1b) with all of the synonyms of a particular verb: (1a) [ *.pn_obj !querer.* *.v_inf ] (1b) [ !querer.* *.v_inf_cl ]
lo quiero hacer, me preferían hablar quiero hacerlo, preferían hablarme “I want to do it, he preferred to talk to me”
e students would see all of the matching phrases for both constructions and could easily compare the relative frequency across historical periods – to see whether one construction or the other is increasing over time – and they could also compare the relative frequency in the three general registers of literature, spoken, and newspapers / encyclopaedias. In order to carry out even more detailed investigations of register or dialectal variation, however, the students oen turned to the CREA/CORDE or Google (Groups) corpora. Because these corpora only allow searches of exact words and phrases, however, the student would need to select individual phrases from the lists generated in the Corpus del Español (e.g., lo quiero hacer vs. quiero hacerlo; “I want to do it”), and then search for these individual phrases one by one. Although somewhat cumbersome, this would allow students to compare the relative frequency of specific phrases in more than twenty Spanish-speaking countries and (in the case of CREA/CORDE), in a much wider range of register subdivisions than in the Corpus del Español. e two-step process – starting with the Corpus del Español and then using its data (when necessary) to search for individual phrases in CREA/CORDE and Google (Groups) – consistently yielded the best results for the students.
Student use of large, annotated corpora to analyze syntactic variation 267
3.4 Organizing the data Aer running the queries on the different corpora, the next step was to organize the data so that it would confirm or deny Butt and Benjamin’s claims about syntactic variation in Spanish. At the beginning, this was rather difficult for some students. ey might examine two competing syntactic constructions in different geographical dialects of Spanish – for example [clitic + main verb + infinitive] (“Type A”) vs. [main verb + infinitive + clitic] (“Type B”). In their attempt to show whether Type A or Type B was more common in different dialects, they might discover that CREA or Google had many more examples of Type A from Spain than from Mexico or Argentina. Inexperienced students might interpret this to mean that Type A was more common in Spain than in Mexico or Argentina, without realizing that there are more examples from Spain simply because the textual corpus from Spain is so much larger than that of other countries. Of course, the issue is not the relative frequency of Type A in Spain vs. the relative frequency of Type A in Mexico or Argentina, but rather the relative frequency of Type A vs. Type B in each of these three countries. Once students learned to correctly use relative frequencies to compare geographical dialects, different registers, or different historical periods, they were on much firmer footing as regards making valid judgments about the data.
3.5 Drawing conclusions regarding variation Once the data had been organized correctly, students were responsible for summarizing the most important findings from the data, and for suggesting whether the data confirmed or denied the original hypothesis regarding variation with the particular syntactic construction. During the first two or three weeks, students found it very difficult to clearly and concisely summarize the findings, and instead preferred to hope that quantity equaled quality. erefore, starting in the third week I limited them to only four short sentences to explain the major findings from the data. In addition, they were asked to include a two or three sentence conclusion, which showed whether the four points just mentioned confirmed or denied the hypothesis from Butt and Benjamin regarding syntactic variation. My sense was that this “bottom line assessment” was very useful in helping them to organize their data collection and written summary.
268 Mark Davies
3.6 “Explaining” the syntactic variation If students were able to accomplish all of the proceeding tasks, they were judged to have been successful in carrying out the research. Starting in about the fih week, however, they were presented with an additional goal, which was to suggest possible motivations for the syntactic variation – whether it was geographical, register-based, or diachronic in nature. e second textbook that was required for the course – in addition to the reference grammar – was Spanish-English Contrasts (Whitley 2002), which is an overview of recent research on a wide range of syntactic constructions in Spanish. To the extent possible, the students were asked to use any of the more theory-based explanations in Whitley for the particular construction in question, and see whether this might be useful in helping to “explain” variation. For example, with the clitic placement construction, they might realize that placement is a function of the semantics of the main verb, with semantically light verbs allowing clitic climbing more oen (cf. Davies 1995, 1998). In some cases students were able to identify possible causal factors, but in other cases it was simply sufficient to point out the actual variation and leave it at that. Even in these cases, however, the data that they presented was oen more complete than that found in previous studies by much more accomplished researchers, simply because of the size and power of the corpora that were available to the students in the course.
4. Conclusions As was explained in the introduction, there has been recent interest in the way in which students can be “trained” as corpus linguists to extract data from the foreign language. Many of these studies have focused on intermediate level students looking for “correct rules” for simple constructions in relatively small corpora. In the course that we have described, however, the graduate-level students focused on variation from the norm with rather complex constructions in hundreds of millions of words of data. In spite of the differences between this course and those described in previous studies, the hope is that our experience might provide insight into how students can use corpora to perform advanced research on the foreign language. As we have seen, if there is proper guidance and feedback, even students who are relatively inexperienced in syntactic research can be trained to use the
Student use of large, annotated corpora to analyze syntactic variation 269
corpora, formulate research questions and search strategies, organize the data, confirm or deny previous claims about language variation, and perhaps even begin to find motivations for this variation. In accomplishing these goals, these students have been successful in moving beyond the simplistic, prescriptivist rules found in many textbooks, and have begun to use corpora to acquire a much more accurate view of the syntactic complexity of the foreign language.
References Bernardini, S. 2000. “Systematising serendipity: Proposals for large-corpora concordancing with language learners”. In Burnard and McEnery, 225–234. Bernardini, S. 2002. “Exploring new directions for discovery learning”. In Ketteman and Marko, 165–182. Burnard, L. and McEnery, T. (eds). 2000. Rethinking Language Pedagogy from a Corpus Perspective [Łódź Studies in Language, Vol. 2]. Frankfurt am Main: Peter Lang. Butt, J. and Benjamin, C. 2000. A New Reference Grammar of Modern Spanish. 3rd edition. Chicago: McGraw-Hill. Davies, M. 1995. “Analyzing syntactic variation with computer-based corpora: The case of modern Spanish clitic climbing”. Hispania 78:370–380. Davies, M. 1998. “The evolution of Spanish clitic climbing: A corpus-based approach”. Studia Neophilologica 69:251–263. Davies, M. 2000. “Using multi-million word corpora of historical and dialectal Spanish texts to teach advanced courses in Spanish linguistics”. In Burnard and McEnery, 173–186. Kennedy, C. and Miceli, T. 2001. “An evaluation of intermediate students’ approaches to corpus investigation”. Language Learning and Technology 5:77–90. Kennedy, C. and Miceli, T. 2002. “The CWIC project: Developing and using a corpus for intermediate Italian students”. In Ketteman and Marko, 183–192. Kettemann, B. and Marko, G. 2002. Teaching and Learning by Doing Corpus Analysis (Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19–24 July, 2000). Amsterdam: Rodopi. Kirk, J. 2002. “Teaching critical skills in corpus linguistics using the BNC”. In Ketteman and Marko, 155–164. Kübler, N. and Foucou, P-Y. 2002. “Linguistic concerns in teaching with language corpora. Learner corpora”. In Ketteman and Marko, 193–203. Osborne, J. 2000. “What can students learn from a corpus? Building bridges between data and explanation”. In Burnard and McEnery, 165–172. Whitley, M.S. 2002. Spanish-English Contrasts. Washington, D.C: Georgetown University Press.
270 Mark Davies
Facilitating the compilation and dissemination of ad-hoc web corpora 271
A future for TaLC?
272 William H. Fletcher
Facilitating the compilation and dissemination of ad-hoc web corpora 273
Facilitating the compilation and dissemination of ad-hoc web corpora William H. Fletcher
United States Naval Academy,1 USA
Since the World Wide Web gained prominence in the mid–1990s it has tantalized language investigators and instructors as a virtually unlimited source of machine-readable texts for compiling corpora and developing teaching materials. The broad range of languages and content domains found online also offers translators enormous promise both for translation-by-example and as a comprehensive supplement to published reference works. This paper surveys the impediments which still prevent the Web from realizing its full potential as a linguistic resource and discusses tools to overcome the remaining hurdles. Identifying online documents which are both relevant and reliable presents a major challenge. As a partial solution the author’s Web concordancer KWiCFinder automates the process of seeking and retrieving webpages. Enhancements which permit more focused queries than existing search engines and provide search results in an interactive exploratory environment are described in detail. Despite the efficiency of automated downloading and excerpting, selecting Web documents still entails significant time and effort. To multiply the benefits of a search, an online forum for sharing annotated search reports and linguistically interesting texts with other users is outlined. Furthermore, the orientation of commercial search engines toward the general public makes them less beneficial for linguistic research. The author sketches plans for a specialized Search Engine for Applied Linguists and a selective Web Corpus Archive which build on his experience with KWiCFinder. He compares his available and proposed solutions to existing resources, and surveys ways to exploit them in language teaching. Together these proposed services will enable language learners and professionals to tap into the Web effectively and efficiently for instruction, research and translation.
274 William H. Fletcher
1. Aperitivo Aston (2002) compares learner-compiled corpora to professionally produced corpora through a memorable analogy to fruit salad. While home-made fruit salad (and corpora) can entail various benefits he enumerates, the offthe-shelf variety offers reliability and convenience, supplemented in its corpus analogue by documentation and specialized soware. He proposes that learners can follow a compromise “pick’n’mix” strategy, compiling their own customized subcorpora from professionally selected materials. By now this alimentary analogy (but by no means the strategy) must have passed its “best-by” date, yet I cannot resist adapting it to the World Wide Web. Food-borne analogies seem very appropriate for a conference in Bertinoro, the historic town of culinary and oenological hospitality, so I begin and end on this note. For years the Web has tantalized language professionals, offering a boundless pool of texts whose fruitful exploitation has remained out of reach. It is like an old-fashioned American community pot-luck supper, to which each family brings a dish to share with the other guests. As a child at such events I would taste many dishes in search of the most flavourful; usually I wasted my appetite sampling mediocre fare. Similarly I have spent countless hours online seeking and siing through webpages, too oen squandering my time, then giving up, sated yet unsatisfied. Frustration with finding useful content in the World Wide Haystack inspired me to design and implement the Web concordancing tools and strategies described here which enable users to compile ad-hoc corpora from webpages.2 Reflection on essential needs unmet by this model has led me to chart the course for future development to make sharing of Web corpora easier and more rewarding, and to outline an infrastructure for a search engine tailored to the needs of language professionals and learners. My conviction is simple: if online linguistic research can be made effective and efficient, linguists and learners will not have to take pot-luck with what they find on the Web by chance.
Facilitating the compilation and dissemination of ad-hoc web corpora 275
2. Web as corpus ? ! A haphazard accumulation of machine-readable texts, the World Wide Web is unparalleled for quantity, diversity and topicality. is ever-expanding body of documents now encompasses at least 10 billion (109) webpages publicly available via links, with several times that number in the “hidden” Web accessible only through database queries or passwords. Once overwhelmingly Anglophone, the Web now encompasses languages used by a majority of the world’s population. Currently native English speakers account for only 35% of Web users, and their relative prominence is dwindling as the Web expands into more non-western language areas.3 Online content covers virtually every knowledge domain of interest to language professionals or learners, and incorporates contemporary issues and emerging usage rare in customary sources. With all the Web offers, why have all but a handful of corpus linguists and language professionals failed to exploit this vast potential source for corpora?4 Surely the effort required to locate relevant, reliable documents outweighs all other explanations for this neglect. e quantity of information online greatly surpasses its overall quality. Unpolished ephemera abound alongside rare treasures, and online documents generally seem to consist more of accumulations of fragments, stock phrases and bulleted lists than of original extended text. Among the longer coherent texts, specialized genres such as commercial, journalistic, administrative and academic documents predominate. Assessing the “authoritativeness” of a webpage–the accuracy of its content and representativeness of its linguistic form–demands time and expertise. Despite these challenges, there are compelling reasons to supplement existing corpora with online materials. A static corpus represents a snapshot of issues and language usage known when it was compiled. e great expense of setting up a large corpus precludes frequent supplementation or replacement, and contemporary content can grow stale quickly. In contrast, new documents appear on the Web daily, so up-to-date content and usage tend to be well represented online. In addition, even a very large corpus might include few examples of infrequent expressions or constructions that can be found in abundance on the Web. Moreover, certain content domains or text genres may be underrepresented in an existing corpus or even missing entirely. With the Web as a source one usually can locate documents from which to compile an ad-hoc corpus to meet the specific needs of groups of investigators, translators or learners. Finally, while existing corpora may entail significant fees and
276 William H. Fletcher
require specialized hardware and soware to consult, Web access is generally inexpensive, and desktop computers to perform the necessary processing are now within the reach of students as well as researchers.
3. Locating forms and content on line 3.1 Established techniques Marcia Bates’ “information search tactics” can be adapted to categorize typical approaches to finding useful material online (Fletcher 2001b). Hunting, or searching directly for specific forms or content online, appears to be the most widely-used and productive strategy. For specialized content, grazing, i.e., focusing on predetermined reliable websites, has also proved an effective strategy for corpus construction.5 In contrast to these goal-oriented tactics, browsing, the archetypal Web activity, relies on serendipity for the user to discover relevant material. What follows shows how all three strategies can be improved to make the Web a more accessible corpus for language research and learning. “Hunting” via Web searches is the most effective means of locating online content. Unfortunately this strategy depends on commercial search engines and thus is limited by their quirks and weaknesses. A dozen main search engines aspire to “crawl” and map the entire Web, yet none indexes more than roughly a fih of the publicly-accessible webpages. ousands of specialized search engines focus on narrower linguistic, geographic or knowledge domains. e search process is familiar to all Web users: first one formulates a query to find webpages with specific words or phrases and submits it to a search engine. Some search engines support “smart features” for a few major languages, for example to search automatically for synonyms or alternate word forms (“stemming”). Meta-search engines query several search engines simultaneously, then “collapse” the results into a single list of unique links. In all cases, however, the user must still retrieve and evaluate the documents individually for relevance and reliability. Beyond the tedium of winnowing the wheat from the chaff, this search-andselect strategy has several flaws, starting with the port-of-entry to the Web. Search engines are not research libraries but commercial enterprises targeted at the needs of the general public. e availability and implementation of their
Facilitating the compilation and dissemination of ad-hoc web corpora 277
services change constantly: features are added or dropped to mimic or outdo the competition; acquisitions and mergers threaten their independence; financial uncertainties and legal battles challenge their very survival. e search sites’ quest for revenue can diminish the objectivity of their search results, and various “page ranking” algorithms may lead to results that are not representative of the Web as a whole.6 Most frustrating is the minimal support for the requirements of serious researchers: current trends lead away from sites like AltaVista supporting sophisticated complex queries (which few users employ) to ones like Google offering only simple search criteria. In short, the search engines’ services are useful to investigators by coincidence, not design, and researchers are tolerated on mainstream search sites only as long as their use does not affect site performance adversely.
3.2 KWiCFinder Web concordancer To overcome some limitations of general-purpose search engines and to automate aspects of the process of searching and selecting I have developed the search agent KWiCFinder, short for Key Word in Context Finder. is free research tool7 helps users create a well-formed query and submits it to the AltaVista search engine. It then retrieves and produces a KWiC concordance of 5–15 online documents per minute without further attention from the user; dead links and documents whose content no longer matches the query are excluded from this search report. Here I discuss how it enhances the search process for language analysis as background to the proposals advanced in the solutions sections below; for greater detail see the website referenced in the previous note and Fletcher 2001b.
3.2.1 Searching with KWiCFinder To streamline the document selection process, KWiCFinder features more narrowly focused search criteria than commercial search sites. For example, AltaVista supports the wildcard *, which stands for any sequence of zero to five characters. KWiCFinder adds the wildcards ? and %, which represent “exactly one” and “zero or one” characters respectively. In an AltaVista query, lower-case and “plain” characters match their upper-case and accented counterparts, so that e.g., a in a query would match any of aáâäàãæåAÁÂÄÀÃÆÅ. KWiCFinder introduces the “sic” option, which forces an exact match of lower-case and “plain” characters. For example, choosing “sic” distinguishes
278 William H. Fletcher
the past tense of the German passive auxiliary wurde from the subjunctive auxiliary würde, and both are kept separate from the noun Würde “dignity”. Similarly, KWiCFinder supports the operators BEFORE and AFTER in addition to AltaVista’s NEAR to relate multiple search terms, and permits the user to specify how many words may separate them. ese enhancements do come at a price: KWiCFinder must submit a standard query to AltaVista and retrieve all matching documents, then filter out webpages not meeting the narrower search criteria. In extreme cases, dozens of webpages must be downloaded and analyzed to find one that matches the searcher’s query exactly. Obviously the most efficient searches forgo wildcards by specifying and matching variant forms exactly. Especially in morphologically rich languages, entering all possible variants into a query can be most tedious. KWiCFinder introduces three types of “tamecards,” a shorthand notation for such variants. A simple tamecard pattern is entered between [ ], with variants separated by commas: work[,s,ed,ing] is expanded to work OR works OR worked OR working, but it does not match worker, workers, as would wildcard work*. Indexed tamecards appear between { }; each variant is combined with the corresponding variant in other indexed tamecards in the same query field. For instance, {me,te,se} lav{o,as,a} expands only to the Spanish reflexive forms me lavo, te lavas, se lava, but not to non-reflexive te lavo or ungrammatical *se lavo. KWiCFinder’s “implicit tamecards” with hyphen or apostrophe match forms both with and without the punctuation mark and / or space: on-line matches the common variants on line, online, on-line. is is particularly useful for English, with its great variation in writing compounds with and without spaces and hyphens, and for German, where the new spelling puts asunder forms that formerly were joined: kennen-lernen matches both traditional kennenlernen and reformed kennen lernen, which coexist in current practice and are reunited in a KWiCFinder search. As a final implicit set of tamecards KWiCFinder recognizes the equivalence of some language-specific orthographic variants, such as German ß and ss, ä ö ü and ae oe ue.
3.2.2 Exploring form and content with KWiCFinder KWiCFinder complements AltaVista by focusing searches to increase the relevance of webpages matched. e typical “search and select” strategy requires one to query a search engine, then retrieve and evaluate webpages one by one. KWiCFinder accelerates this operation by fetching and excerpting matching documents for the user. Even with a KWiC concordance of webpages, how-
Facilitating the compilation and dissemination of ad-hoc web corpora 279
ever, the language samples still must be considered individually and selected for usefulness. KWiCFinder’s browser-based interactive search reports allow one to evaluate large numbers of documents efficiently. e data are encoded in XML format, so results from a single search can be transformed into various “views” or formats for display in a Web browser, from “classic concordance” – one line per citation, centred on the key word or phrase – to table or paragraph layout with key words highlighted. Navigation buttons facilitate jumping from one example to the next. In effect, KWiCFinder search reports constitute mini ad-hoc corpora which can include substantial context for further linguistic investigation. Users can add comments to relevant citations and documents, call up original or locally saved copies of webpages for further scrutiny, and select individual citations for retention or elimination from the search report. Browser-based JavaScript tools are integrated into the search report to support exhaustive exploration and simple statistical analysis of the co-text. User-enhanced search reports can be saved as stand-alone HTML pages for sharing with students or colleagues, who in turn can annotate, supplement, save and share them. By merging concordanced content and investigative soware into a single HTML document that runs in a browser, KWiCFinder interactive search reports remain accessible to users of varying degrees of sophistication and achieve a significant degree of platform independence.8
4.
Language-oriented Web search: challenges and solutions
4.1.1 Challenge I: time and effort Each generation of computers has made us users more impatient: we have grown accustomed to accessing information instantly, and a delay of seconds can seem interminable. Tools such as KWiCFinder can download and excerpt several pages a minute, where the exact value of “several” depends on connection speed, document size and processing capability. Frequently I investigate a linguistic question or look for appropriate readings for my students by searching for and processing 100 or more webpages in 10–15 minutes. For example, to compile a sample corpus of Web documents, I downloaded 11,201 webpages in an aernoon while I was teaching through unattended simultaneous searches. Typically I run such searches while doing something else and
280 William H. Fletcher
peruse the results when convenient. Unfortunately this strategy is inadequate for someone like a translator with an immediate information need, and it can be costly for a user who pays for time online by the minute. 9 Downloading and excerpting webpages can be accelerated. In an ongoing study based on my sample Web corpus I have evaluated various “noise-reduction” techniques to improve the usefulness of documents fetched from the Web (Fletcher 2002). Document size is the simplest and most powerful predictor of usability: webpages of 3–150 KB tend to yield more connected text, while smaller or larger files have a higher proportion of non-textual overhead or noise, as well as a higher HTML-file size to text-file size ratio. Since document size can be determined before a file is fetched, one could restrict downloads to the most productive size range and achieve tremendous bandwidth savings. While this and other techniques will realize further efficiencies in search agents, even an automated search and concordancing tool like KWiCFinder remains too slow to be practical for some purposes. Furthermore, formulating a targeted query and evaluating online documents and citations for reliability, representativeness and relevance to a specific content domain, pedagogical concern or linguistic issue can require a significant investment of time and effort. If a search addresses a question of broader interest, the resulting search report and analysis should be shared with others. While one can easily save such reports as HTML files for informal dissemination, there is no mechanism for “weblishing” them or informing interested colleagues about them. Moreover, the relevant, reliable webpages selected by a searcher are likely to lead to productive further exploration and analysis of related issues and to contain valuable links to additional resources, yet as things now stand they will be found in future searches only by coincidence. How can the value added by the (re)searcher be recovered?
4.1.2 Solution I: Web Corpus Archive (WCA) To help searchers with an immediate information need and as a forum for sharing search results I intend to establish an online archive of Web documents which collects, disseminates and builds on users’ searches. KWiCFinder will add the capability for qualified users to upload search reports with broader appeal to this Web Corpus Archive (WCA). In brief comments, users will describe the issues addressed, classify the webpages by content domain, and summarize the insights gained by analyzing the documents. Such informal weblications will enable language professionals and learners world wide to
Facilitating the compilation and dissemination of ad-hoc web corpora 281
profit from an investigator’s efforts. is model extends Tim Johns’ concept of “kibbitzers”, ad-hoc queries from the British National Corpus designed to clarify some fine point of word usage or grammar complemented by analysis and discussion of the evidence which he saves and posts online (Johns 2001). Whenever a user uploads a search report to benefit the user community, the WCA server will download the source documents from the Web and archive them, preserving the original content from “link rot” and enabling others to verify and reanalyze the original data. Since much of a webpage’s message is conveyed by elements other than raw text – images, layout, colour, sounds, interactivity – these elements should be preserved as well. Links from these pages to related content will be explored to extend the scope of content archived. Since this growing online body of webpages selected for reliability and classified by content domain will reside on a single server, it can provide fast, sophisticated searches within the WCA, yielding browser-based interactive search reports similar to those produced by KWiCFinder. Fruitless searches will be submitted to other search engines to locate additional webpages for inclusion in the Web Corpus Archive. Data on actual user searches with KWiCFinder and on my “Phrases in English” site (Fletcher 2004) would also expand the archive. Available topic recognition and text summarization soware could be harnessed to classify and evaluate these automatically retrieved documents. Clearly obtaining permission from all webpage creators to incorporate their material into an archive is unfeasible, which raises the question whether this repository would infringe on copyright. Including entire webpages without permission in a corpus distributed on CD-ROM would obviously be illegal – and unethical to boot. But providing a KWiC concordance via the Web of excerpts from webpages cached in their entirety on a “corpus server” clearly falls well within currently accepted practice. While not a legal expert, I do note that for years search engines like Google and AltaVista have included brief KWiC excerpts from documents in their search reports with impunity. Indeed, both Google and Internet Archive (a.k.a. the Wayback Machine, http:// web.archive.org) serve up entire webpages and even images from their cache on demand. Both these sites’ policy statements suggest an implied consent from webpage owners to cache and pass on content if the site has no standard Web exclusion protocol “robots.txt” file prohibiting this practice and if the document lacks a meta-tag specifying limitations on caching. ey assert this right in daily practice and defend it when necessary in court. Internet Archive’s FAQ
282 William H. Fletcher
explicitly claims that its archive does not violate copyright law, and in accordance with the Digital Millennium Copyright Act it provides a mechanism for copyright holders to request removal of their material from the site as well.10 Besides these familiar sites rooted in the information industry, libraries and institutes in various countries are establishing national archives of online documents to preserve them for future generations. e co-founder of one such repository (who understandably prefers anonymity) has confided in me that his group will proceed despite the unclear legality of their endeavour. Eventually legislation or litigation will clarify the status of Web archives, a recurring topic on the Internet Archive’s [archivists-talk] mailing list.11 Optimistically I assume that a Web-accessible corpus for research and education derived from online documents retrieved by a search agent in ad-hoc searches will fall within legal boundaries. Meanwhile, I intend to assert and help establish our profession’s rights while scrupulously respecting any restrictions a webpage author communicates via industry-standard conventions.12
4.2.1 Challenge II: Commercial search engines Two concerns prompt me to propose a more ambitious project as well. Firstly, the limitations imposed on queries by the most popular search engines for practical reasons reduce their usefulness for serious linguistic research. Secondly, the demands of survival in a competitive market compromise the viability and continuity of the most valuable search engines.13 4.2.2 Solution II: Search Engine for Applied Linguists (SEAL) e observations in 4.2.1 point toward one conclusion: if language professionals want a search site that satisfies their needs for years to come, they will have to create and maintain it themselves. With this conviction I now outline a realistic path to this goal of a Search Engine for Applied Linguists (SEAL). While on sabbatical during the academic year 2004–05 I intend to start on this project and hope to report significant progress toward this goal at TaLC 2006. An ideal Web search site for language learners and scholars would have to support the major written languages and character sets, and allow expansion to any additional language. e search engine would provide sophisticated querying capabilities to ensure highly relevant results, not only matching characters, but also parts of speech and even syntactic structures. Such a site would permit searches on any meaningful combination of wildcards and regular expressions,
Facilitating the compilation and dissemination of ad-hoc web corpora 283
which would be optimized for the character set of the target language.14 It also would furnish built-in language-specific “tamecards” to match morphological and orthographic variants. SEAL should not report merely how many webpages in the corpus contain a given form, but also calculate its total frequency and dispersion as well. While mainstream search engines match at the word level, ignoring the clues to linguistic and document structure contained in punctuation and HTML layout tags, our ideal site would also take such information into account. Above all, a search site for language professionals would stress quality and relevance of search results over quantity. Real-world search sites are resource-hungry monsters. At the input end of the process, “robot” programs “crawl” or “spider” the Web, downloading webpages and adding their content to the search database. Links extracted from these documents point the way to other pages, which are spidered in turn. A “full” Web crawl involves transferring and storing many terabytes (roughly 1012 characters) of data. When the webpage database is completed, indexed and optimized, the search site calls on it to attend to many thousands of user queries simultaneously, with a tremendous flow of data in the other direction. To perform their magic, major search sites boast batteries of thousands of computers, gigabytes of bandwidth, and terabytes of storage. How can we academics hope to match their capabilities? Collectively we too have thousands of computers and gigabytes of bandwidth untapped when our learning laboratories and libraries are closed. Why not employ them to crawl and index the Web for a language-oriented search engine? A central server would coordinate the tasks and accumulate the results of these armies of distributed “crawlers.” e inspiration for this distributed approach comes from a project which processes signals from outer space with a screensaver running on volunteers’ desktops around the world; whenever one of the computers is idle, the program fetches chunks of data and starts crunching numbers. Researching the concept online, I discovered both a blueprint for a search engine with distributed robots spidering the Web (Melnik et al. 2001) and a Master’s thesis on Herodotus, a peer-to-peer distributed Web archiving system (Burkard 2002).15 Clearly we need not reinvent the wheel to implement SEAL, only adapt freely available open-source soware to the specific requirements of our discipline.16 Once the basic search engine framework has been implemented and tested, the model could be extended to a further degree of “distributedness.” Separate servers hosted by different universities could each concentrate on a specific
284 William H. Fletcher
language or region, or else mirror content for local users to avoid overtaxing a single server. Local linguists would provide the language-specific expertise to create tamecards for morphological and orthographic variants, optimize regular expressions for the character set, and implement part-of-speech and syntactic tagging. Due to the relatively low volume of traffic, such sites could support sophisticated processing-intensive searches which are impractical on general-purpose search engines.17 e specialized nature and audience of a linguistic search engine cum archive would limit its exposure to litigation as long as the exact legal status of such services remains unclear. Indeed, since the goal of SEAL is to build a useful representative searchable sample of online documents, not to cover the Web comprehensively, some restrictions on content would be quite tolerable.
5. Alternative solutions is section surveys existing resources comparable to those outlined above. e intention is to be descriptive, not judgemental: while a soware application’s usefulness for a specific purpose should be gauged by its suitability for one’s goals, its success must be assessed only by how well it meets its own design objectives. e list of applications derives from variants on the question “How is your soware x different from y?” Since the Web Corpus Archive and Search Engine for Applied Linguists are vapourware which may never achieve all that I intend to, I acknowledge that I am comparing an ideal concept to implemented facts. Before a detailed discussion of the alternatives it is only fair to reveal my background, biases and intentions. Before programming a precursor of KWiCFinder in 1996, I spent 10 years designing, implementing and evaluating video-based multimedia courseware for foreign language instruction.18 e development cycle entailed extensive direct observation of users as well as analysis of their errors and their evaluations of the courseware. My criteria for a good user interface were heavily influenced by Alan Cooper, who preaches that soware should make it impossible for users to make errors: errors are a failure of the programmer, not the user (1995, 423–40). Usability is a primary concern in all my soware development projects. For instance, studies of online search behaviour such as Körber (2000), Jansen et al. (2000) and Silverstein et al. (1999), summarized in detail in Fletcher (2001a, b), reveal that most
Facilitating the compilation and dissemination of ad-hoc web corpora 285
users avoid complex queries (i.e., ones with multiple search terms joined by Boolean operators like AND, OR and NEAR), and those who do attempt them make errors up to 25% of the time, resulting in failed queries. Many features of KWiCFinder and subsequent applications address specific observed difficulties of students and other casual searchers in order to help them produce appropriate, well-formed queries. As a teacher of Spanish and German, I sought a tool for my students and myself that could handle languages with richer morphology and greater freedom in word order than English. For example, while a typical English verb has only 4–5 variants, Spanish verbs have ten times that number of distinct forms. English sentences tend to be linear, but in German, syntactic and phraseological units are oen interrupted by other constituents. In both languages webpage authors use diacritics inconsistently – Spanish-language pages may neglect acute accents, German pages may substitute ae for ä etc. and ss for ß (standard usage in Switzerland). Complex queries allowing matches with the Boolean operators NEAR / BEFORE / AFTER as well as NOT, AND and OR, tamecards for generating variant forms, and flexible character matching strategies are essential to studying these languages efficiently and effectively. None of the alternatives surveyed below offers the full range of Boolean operators and complex queries supported by KWiCFinder. KWiCFinder was designed as a multipurpose application, to examine not just a short span of text for lexical or grammatical features, but also to assess document content and style when desired. As Stubbs (forthcoming) points out, the classical concordance line may provide too little context to infer the meaning and connotations of a word reliably. In a telling example he shows that the immediate context of many occurrences in the BNC of the phrase horde of appears to suggest neutral or even positive associations. e consistently negative connotations become obvious only aer one examines a much larger amount of co-text. KWiCFinder’s options to specify any length of text to excerpt and to redisplay concordances in various layouts (paragraph and table as well as concordance line) allows the flexibility to examine either the immediate or the larger context.
5.1 Web concordancer alternatives to KWiCFinder Here “Web concordancer” is not to be understood as a Web interface to a fixed corpus like Mark Davies’ (see Davies, this volume) “Corpus del español”
286 William H. Fletcher
(http://corpusdelespanol.org), the Virtual Language Centre’s “Web Concordancer”, (http://www.edict.com.hk/concordance/), or my “Phrases in English” site (http://pie.usna.edu), none of which features language from the Web; I designate the latter “online concordancers”. Rather, the former are Web agents which query search engines and produce KWiC concordances of webpages matching one’s search terms. e first two applications considered are commercial products, but the others were developed by and for linguists. Typically the soware is installed on the user’s computer (KWiCFinder, Copernic, Subject Search Spider, TextSTAT, WebKWiC), but WebCorp and WebCONC run on a Web server and are accessed via a webpage, which makes them less daunting to casual users and avoids platform compatibility issues. For concordancing these applications follow one of three general strategies, client-side, server-side and search-engine-based processing. Client-side concordancers like KWiCFinder, Subject Search Spider and TextSTAT download webpages to the user’s computer for concordancing. With a slow or expensive connection this can be a significant disadvantage, but once downloaded the texts can be saved for subsequent examination and (re)analysis off line.19 e server-side approach shis the burden of fetching and concordancing webpages to the WebCorp or WebCONC server. is requires far less data transfer to the user’s computer, but webpages of further interest must be fetched and saved individually by the user via the browser. Depending on the number of concurrent searches, these services can be slow or even unavailable. WebCorp does offer the option to send search results by e-mail, which prevents browser timeout and saves money for those with metered Internet access. One potential limitation of server-based processing is the unclear legality of a service which modifies webpages by excerpting them; client-side processing avoids any such risk. Search-engine-based concordancing is the fastest approach as it relies on the search engine’s existing document indices; for details of the implementations, see the descriptions of Copernic and WebKWiC below. Copernic (http://www.copernic.com) is a commercial meta-search agent which queries multiple search engines concurrently for a single word or phrase and produces a list of matching pages sorted by “relevance”. While very fast, its concordances are too short and inconsistent to be useful for linguistic research; they appear to derive from the excerpts shown in search engine results. Copernic includes excellent support for a wide range of languages. e free basic version of the soware evaluated constantly reminded me of the many additional features available by upgrading to one of several pay-in-advance vari-
Facilitating the compilation and dissemination of ad-hoc web corpora 287
ants. ese more sophisticated products may offer the flexibility to do serious KWiC concordancing of online texts, and the high-end version (not evaluated) produces text summaries which could be useful for efficient preview and categorization of online content. Another commercial search product, Subject Search Spider (http://www. kryltech.com), produces KWiC concordances of the search terms in a paragraph layout. All features are available in the 30-day free trial download, including full control over the number of concordances per document and the amount of context to show. SSSpider supports 34 languages, virtually all those of Europe, in addition to Arabic, Chinese, Hebrew, Japanese and Korean, and can search usenet (newsgroups) as well as the Web.20 As with Copernic there are companion text summarization and document management suites available. One free product, SSServer, is deployed on a Web server, where it could easily be customized into an online concordancer for any of the languages supported. WebCorp (http://www.webcorp.org.uk; Morley, Renouf and Kehoe 2003) from the University of Liverpool’s Research and Development Unit for English Studies has regularly added new features since its launch in 2000. While it offers but a single field for inputting search terms, its support for wildcards and “patterns” (similar to KWiCFinder’s tamecards) gives it flexibility in matching variant forms, and queries can be submitted to half a dozen different search engines to improve their yield. Up to 50 words of preceding and following context are shown, and options allow displaying any number of concordances per document (up to 200 webpages maximum are analyzed). WebCorp’s concordances give access to additional data analysis (e.g., type / token count, lists of word forms), and other tools are available on the site. Online newspapers can be searched by domain (e.g., UK broadsheet, UK tabloid, US), and searches can be limited to a specific Open Directory content domain. With the numerous choices WebCorp offers, its failure to provide a document language option seems inexplicable, since almost every search engine supports it. e user interface would benefit from client-side checking for meaningful, well-formed queries before submission to WebCorp; mistakes in a query can lead to long waits with no results and no explanation. Zoni (2003) describes WebCorp in greater detail and compares it with KWiCFinder. Matthias Hüning’s WebCONC (http://www.niederlandistik.fu-berlin.de/ cgi-bin/web-conc.cgi), another server-based Web concordancer, performs searches on Google and generates KWiC concordances of the search phrase
288 William H. Fletcher
in paragraph layout. One can also copy and paste text for concordancing onto the search page. Options are minimal: target language, amount of context (maximum of 50 characters before / aer the node!), and number of webpages to process (50 maximum, in practice fewer if some pages in the search results are inaccessible or do not match exactly). ere is no provision for wildcards (not supported by Google) or pattern matching. Matches are literal, and all occurrences of a search string are highlighted in the results, even as a substring of a longer word. A punctuation mark aer word form is matched too, which can be useful, for example to find clause final verb forms in German. e server can be slow and may even time out without producing any concordances. WebCONC could be far more useful if it offered more options for search and output format. Of greater potential interest is the author’s TextSTAT package (http://www.niederlandistik.fu-berlin.de/textstat/soware-en.html), which can download and concordance both webpages and usenet postings. Programmed in Python, it runs on any standard platform (Windows, Macintosh, Unix / Linux). Hüning’s user license permits modification and redistribution of the soware code, making TextSTAT an instructive example and valuable point-of-departure for a customized Web concordancer. WebKWiC (http://kwicfinder.com/WebKWiC/; Fletcher 2001 a, b) is a browser-based application that exploits a feature of Google’s search results: click on the “cache” link to see a version of a webpage from Google’s archives with the search terms highlighted. WebKWiC queries Google, parses the search results, fetches a page from Google’s cache, encodes the highlighted search terms to permit navigation from one instance of the search terms to the next, and displays the page in a new browser window. is “parasitic” approach with JavaScript and DHTML builds on core functionalities of Internet Explorer, works on multiple platforms, and supports any language known to Google. It could be extended to produce and display KWiC excerpts from webpages, or to download and save them in HTML, text or concordance format. A small set of webpages and scripts (70KB installed), WebKWiC takes full advantage of all Google’s search options.
5.2 Alternative to the Web Corpus Archive e Internet Archive (http://web.archive.org) “Wayback Machine” preserves many (but by no means all) webpages back to 1996. Archived sites are represented by a selection of their pages and graphics in snapshots taken every
Facilitating the compilation and dissemination of ad-hoc web corpora 289
few months. For example, a visit to the Archive reminded me that KWiCFinder was not publicly downloadable until November 1999, and it helped me reconstruct the introduction and evolution of WebCorp. e archive is not searchable by text, only by URL. e ability to step back in time, for example, to retrieve a webpage cited in this paper which has since disappeared from the Web, is complemented by comparison of various versions of the same webpage, with the differences highlighted. In contrast to the Internet Archive, the WCA proposed here will not aim to preserve the state of the entire Web, only to ensure immediate text-searchable access to pages which support either a user-uploaded “kibbitzoid” search analysis or documents indexed in its Search Engine for Applied Linguists.
5.3 Alternatives to the Search Engine for Applied Linguists GlossaNet (http://glossa.fltr.ucl.ac.be) analyzes text from over 100 newspapers in eleven languages, providing both more and less than a linguistic search engine as I conceive it. Originally a monitoring tool to track emerging lexical developments (Fairon and Courtois 2000), GlossaNet now offers both “instant search” of the current day’s newspapers with results in a webpage and “subscription search” (aer free registration), which re-queries each daily crop of newspapers and e-mails the results at regular intervals. Concordance lines display 40 characters to the le and right of the node. Clicking on the node displays the original newspaper article with the search terms highlighted, but this feature may be unavailable: an error message warns that most articles are accessible only on the day of publication. Queries can be formulated as any combination of word form, lemma, “regular expression” (less than the name suggests), or word class and morphology, or else as a Unitex “finite state graph” (not documented on the site; manual in French and Portuguese at http://www-igm.univ-mlv.fr/~unitex/). GlossaNet has its limitations: it is restricted to a single genre, newspaper texts, and to the rather small pool (in comparison to the Web) of one day’s newspapers; searches cannot be replicated on another day, and results may not be verifiable in the context of the original article; syntactic analysis and lemmatization can be faulty; search results do not show the grammatical annotation, so the users cannot learn to tailor their queries to the idiosyncrasies of the analysis engine; documentation is minimal. Clearly it has strengths as well compared to KWiCFinder or WebCorp: the ability to search by syntactic or morphological category can eliminate large
290 William H. Fletcher
numbers of irrelevant hits; “instant search” delivers results almost immediately; “subscription search” permits monitoring of linguistic developments in manageable increments; newspaper texts are generally reliable, authoritative linguistic sources. e Linguist’s Search Engine (LSE, http://lse.umiacs.umd.edu:8080) arrived on the scene in January 2004 as a tool for theoretical linguists to test their intuitions by “treating the Web as a searchable linguistically annotated corpus” (Resnick and Elkiss 2004). At its launch LSE had a collection of about 3 million English sentences, a number bound to increase rapidly. e source of these Web documents is the Internet Archive, which ensures their continued availability. New users will likely start with the powerful “Query by Example” feature: enter a sentence or fragment to match, then click “Parse” to generate both a tree and a bracketed representation of the example sentence. LSE uses a non-controversial Penn Treebank-style syntactic constituency annotation readily accessible to most linguists. Queries can be refined in either the graphical tree or the text bracketed representation. For example, I entered “He is not to be trusted”, which yielded this parse in bracketed notation: (S1 (S (NP (PRP He)) (VP(AUX is) (S (RB not) (VP (TO to) (VP (AUX be) (VP (VBN trusted)))))))). Aer being made more general in the tree editor, the bracketed query (S1(S NP (VP(AUX be )(S(RB not )(VP(TO to )(VP(AUX be )(VP VBN )))))))
matched 76 sentences with comparable constructions such as “Any statements made concerning the utility of the Program are not to be construed as express or implied warranties.” and “In clearness it is not to be compared to it.” LSE’s concordances can be displayed or downloaded in CSV format for importation into a database or spreadsheet, and the original webpages can be retrieved from the Internet Archive for examination. While such linguistic search of a precompiled Web corpus via an intuitive user interface is impressive, the LSE really advances Web searching by exploiting this functionality to locate examples matching lexical and syntactic criteria with the AltaVista search engine. e user submits a query to AltaVista and LSE fetches the corresponding webpages, parses them, and filters out the ones that fail to meet the user’s structural criteria. Retrieval and analysis are surprisingly rapid. Queries, their outputs, and the original webpages can be saved in personal collections for later analysis and refinement. e tools can also analyze corpora uploaded from the user’s computer.
Facilitating the compilation and dissemination of ad-hoc web corpora 291
Despite the LSE’s impressive power and usability, it does not fulfil all the needs the SEAL intends to address. Above all it supports only English, and there are no plans to add other languages except possibly in parallel corpora searchable via the annotation of the corresponding English passages (Resnick, personal communication), while SEAL will start with the major European languages, establish a transferable model for branching out into other language families. LSE is aimed at theoretical linguists seeking to test syntactic hypotheses who are sufficiently motivated to master a powerful but complex system. In contrast, SEAL’s target audience is more practically oriented, including language professionals such as instructors, investigators and developers of teaching materials, translators, lexicographers, literary scholars, and advanced foreign language learners as well as linguists. Many in these groups could be overwhelmed by a resource that requires too much linguistic or technical sophistication at the outset. SEAL will offer tools to leverage users’ familiarity with popular search engines and nurture them along the path from word and phrase search to queries that match specific content domains, phrases structures and sentence patterns as well. As an incrementally implemented companion to the Web Corpus Archive, it will benefit both from analysis of search behaviors and use patterns and from direct user feedback. Aer comparing future plans, Resnick and I have determined that LSE and SEAL will complement rather than compete with each other.
6. Web search resources in language teaching and learning Suggestions for language teachers and learners to use these tools are surveyed here. Specific examples of instructor-developed learning activities focussing on the levels of word, phrase and grammar are based on my experience teaching beginning and intermediate German and Spanish. Open-ended learner-directed techniques to develop critical searching skills and to encourage writing by example are also described. While some of these tasks could be performed without the specialized soware described here, they make the process more effective and familiarize the students with valuable research tools and techniques applicable to other disciplines as well. Since 1996 the Grammar Safari site (http://www.iei.uiuc.edu/web.pages/ grammarsafari.html) has been a popular resource on the Web, linked to and expanded on by over 2000 other sites. It offers tutorials and a set of assign-
292 William H. Fletcher
ments for learners of English to hunt for and analyze grammatical and rhetorical structures in online documents. e technique entails querying a search engine, retrieving webpages individually, and finding the desired forms on the page. One of the Web concordancers surveyed above could easily automate the mechanics of such activities, leaving more time for analysis and discussion of the examples. Familiarizing learners with an efficient approach to a beneficial but tedious task will encourage them to apply it even when not directed to do so. Grammatical and lexical exploration can also be based on instructor-prepared mini-corpora. KWiCFinder allows search results to be saved as webpages with self-contained interactive concordance tools which can be used profitably with students. For example, to contrast the German passive auxiliary wurde with the subjunctive auxiliary würde, I assign small groups of students (2–3 per computer) to explore, then describe the grammatical context (e.g., they co-occur with past participle and infinitive respectively) to the class. As instructor I clarify the meaning and use of the structures by translating representative examples. ese few minutes spent on “grammar discovery” prepare the students to understand and retain the textbook explanation better. Recently an in-class KWiCFinder search demonstrated to my students how actual usage can differ from textbook prescription. In a geographical survey of the German-speaking countries I explained that the usual adjective for “Swiss” in attributive position is the indeclinable Schweizer; a student pointed out that our textbook listed only schweizerisch. A pair of KWiCFinder searches rapidly clarified the situation: while forms of the latter typically modified the names of organizations and government institutions, the former was obviously both far more frequent and more general in use. Students can be assigned similar ad-hoc discovery activities in response to recurrent errors or to supplement the textbook. For example, it is instructive for a learner studying French prepositions to discover that merci à / pour parallel English “thanks to / for”, while merci de + infinitive corresponds to English “thanks for” + -ing. A search of the BNC illustrates the advantage of a bottomless corpus like the Web: this English construction occurs only 53 times in this huge corpus, and could well be lacking entirely in a smaller one. Ideally, aer assigned tasks such as this, learners will develop the habit of formulating and verifying usage by example rather than resorting to Babelfish or another online translation engine. Frand (2000) summarizes what he calls the “mindset” of Information-Age students. eir behaviour with an unfamiliar website or soware package typi-
Facilitating the compilation and dissemination of ad-hoc web corpora 293
cally exhibits more action than reflection; learning by trial and error replaces systematic preparation and exploration (“Nintendo over logic”). To encourage development of “premeditated” searching habits, I assign students a written pre-search exercise before they undertake open-ended Web-based research for a report or essay. ey jot down variants of key words and phrases likely to occur on webpages in contexts of interest for their topic as well as additional terms that can help restrict search results to relevant webpages.21 is written exercise forces thought to precede action and allows the group to brainstorm about additional possible search terms and variants. en they search for and evaluate a number of webpages in writing with a checklist based on Barton 2004. Finally, they re-search the sites deemed most useful in order to find additional appropriate content. Without these paper-and-pencil exercises, students tend to choose from the first few hits for whatever search term occurs to them. A concordancing search agent greatly accelerates evaluating webpages for content, reliability, and linguistic level. One venerable stylistic technique I attempt to pass on to my students is imitatio (not plagiarism!), the study and emulation of exemplary (or at least native speaker) texts in creative work. In major languages the Web is a generous source of texts on almost any topic. Aer locating appropriate webpages, advanced learners can immerse themselves in the style and language of the content domain they are dealing with before preparing compositions or presentations. is concept parallels translation techniques outlined by Zanettin (2001) based on ad-hoc corpora from the Web. It is a powerful life-long foreign-language communication strategy which builds knowledge as well as linguistic skills. When the WCA and SEAL and comparable resources become a reality, they will further accelerate the tasks surveyed here. Response time from querying from a single archive will be far faster than fetching and excerpting documents from around the Web. Searching a large body of selected documents by content domain and / or grammatical structure will yield a higher percentage of useful hits than the current query by word form approach. User-submitted kibbitzers will supply ready illustration and explanation for linguistic questions and problems (e.g., the wurde / würde and Schweizer / schweizerisch examples above).22 Finally, the linguistic annotation provided by SEAL will help motivated students gain greater insight into grammar. Admittedly, most of the techniques discussed here are feasible with static corpora as well. By the same token, most applications of corpus techniques to
294 William H. Fletcher
language learning (surveyed in Lamy and Mortensen 2000) could be adapted to Web concordancing instead. e size and comprehensive coverage of the Web are powerful arguments for this approach, as is the availability of free tools with a consistent, adaptable user interface for exploring everything from linguistic form to document content. If we can acquaint our students with responsible online research techniques and instil in them a healthy dose of scepticism toward their preferred information source, we will have accomplished far more than teaching them a language.23
7. Caffè e grappa oppure limoncello In this paper we have considered a wide range of challenges and solutions to exploiting the Web as a (source of) linguistic corpus. Such dense, heavy fare leaves us much to digest. Let’s linger over caffè and grappa or limoncello to discuss these ideas – aer all, this is not just a declaration of intent, but an invitation to a dialogue. ese proposals outline an incremental approach to implementing the solutions which will yield useful results at every milestone along the way – searchers with an immediate information need should not have to delay gratification as a programmer must. e Web Corpus Archive proposed here will give direct search results, if not the first time, then at least when a query is submitted on subsequent occasions. Posted KWiCFinder search report kibbitzers can exemplify techniques for finding the forms or information one requires, much as successful recipes from a pot-luck supper continue to enrich the table of those who adopt them. Building on the infrastructure of this archive, the Search Engine for Applied Linguists sketched here will afford rapid targeted access to an ever-expanding subset of the Web. In the process, all three information-gathering strategies will be served: hunters will profit from a precision search tool, grazers will be able to locate rich pastures of related documents, and browsers will enjoy increased likelihood of serendipitous finds. As other linguists join in the proposed cooperative effort, the search engine’s scope can be extended well beyond European languages. Initially, outside funding may be required to establish the infrastructure, but ultimately this plan will be sustainable with resources from the participating institutions. With time, the incomparable freshness, abundant variety and comprehensive coverage added by this Web corpus-cum-search engine will make
Facilitating the compilation and dissemination of ad-hoc web corpora 295
it an indispensable complement to the more reliable canned corpora for a “pick’n’mix” approach. Linguists and language learners alike will benefit from examples which clarify grammatical, lexical or cultural points. Foreign language instructors and translators will find a concentrated store of useful texts for instructional materials and translation by example. New soware tools will integrate the Web and the desktop into a powerful exploratory environment. e steps outlined here will lead toward fulfilling the Web’s promise as a linguistic and cultural resource.
Notes 1. Research for this paper was supported in part by the Naval Academy Research Council. 2. Ad-hoc corpora – also designated as “disposable” or “do-it-yourself” corpora – are compiled to meet a specific information need and typically abandoned once that need has been met (see e.g., Varantola 2003 and Zanettin 2001). 3. Figures from September 2003 (http://www.global-reach.biz/globstats/, visited 26 February 2004), which estimates the online population of native speakers of English and of other European languages at 35% each, while speakers of other languages total about 30%. ese numbers contrast sharply with the late 1990s, when English speakers comprised over threequarters of the world’s online population. 4. e number of linguists exploiting the Web as a linguistic corpus (beyond the casual “let’s see how many hits I can find for this on Google”) is growing. Kilgarriff and Grefenstette (2003) survey numerous papers and projects in this field. Other representative examples of applying Web data to specific linguistic problems include Banko and Brill (2001), Grefenstette (1999), and Volk (2002). Brekke (2002) and Fletcher (2001 a, b) discuss the pitfalls and limitations of the Web as a corpus. Finally, researchers like De Schryver (2002), Ghani et al. (2001) and Scannell (2004) demonstrate the importance of the Web for compiling corpora of minority languages for which other electronic and even print sources are severely limited. 5. Knut Hofland’s Norwegian newspaper corpus (Hofland 2002) follows a “grazing” strategy to “harvest” articles daily from several newspapers. Using material from a limited number of sites offers several advantages: permission and cooperation for use of texts can be secured; recurring page layouts help distinguish novel content from “boilerplate” materials automatically; the texts’ genre and content domain are predictable, and their authorship, representativeness and reliability can be established. Similarly, GlossaNet (http://glossa.fltr.ucl.ac.be), described in greater detail below, monitors and analyzes text from over 100 newspapers in nine languages, but does not archive them for public access. 6. “Paid positioning” and other “revenue-stream enhancers” may put advertisers’ webpages at the top of the search results. e link popularity ranking strategy exemplified by Google
296 William H. Fletcher
– webpages to which more other sites link are ranked before relatively unknown pages – can mask much of the Web’s diversity by favouring well-known sites. 7. KWiCFinder is available free online from http://KWiCFinder.com. First demonstrated at CALICO 1999 and available online since later that year, it is described in far greater detail in Fletcher 2001b. 8. For a discussion of features of the interactive search reports, refer to http://kwicfinder.com/KWiCFinderKWiCFeatures.html and http://kwicfinder.com/KWiCFinderReportFormats.html. 9. I am indebted to Michael Friedbichler of the University of Innsbruck for this observation and for fruitful discussions of various issues from the user’s perspective. 10. In an interview (Koman 2002) Internet Archive founder Brewster Kahle brushes aside a question about copyright, insists that it is legal and implies that the Internet Archive had never had problems with any copyright holder (subsequent lawsuits nullify that implied claim). e Archive’s terms of use and copyright policy also assert the legality of archiving online materials without prior permission (http://archive.org/about/terms.php [visited 8 October 2004]). Apparently such assertions are based on Title 17 Chapter 5 Section 512 of the US Digital Millenium Copyright Act (DMCA, http://www4.law.cornell.edu/uscode/ 17/512.html [visited 28 February 2004]), which authorizes providers of online services to cache and retransmit online content without permission from the copyright owner under specific conditions, which include publishing “takedown” procedures” for removing content when notified by the owner and leaving the original content unmodified. (Extensive discussion and documentation of these and related issues are found on the websites Chilling Effects http://www.chillingeffects.org/dmca512/ and Electronic Freedom Foundation http://www.eff.org.) Excerpting KWiC concordances from a webpage clearly constitutes modification, as does highlighting of search terms in a cached version, both services provided by Google and other search engines. Two legal experts I have consulted who requested anonymity find no authorization in US copyright law for these accepted practices, but case law seems to have established and reinforced their legitimacy. Obviously the legal status of these practices under US law has little bearing on the situation in other countries, whose statutes and interpretation may be more or less restrictive. 11. In the United States, a KWiC concordance of webpages appears to fall under the fair-use provisions of copyright law as well. Crews (2000) and Hilton (2001) both argue for more liberal interpretations of this law than that found in the typical academic institution’s copyright policy. I am seeking an official ruling from my institution’s legal staff before establishing the WCA on servers at USNA. If our lawyers do not authorize exposing the Academy to possible risk in this gray area, I can implement the WCA on my KWiCFinder.com website. As a “company” KWiCFinder has neither income nor assets, making it an unlikely target for litigation. 12. An approach proposed by Kilgarriff (2001), the Distributed Data Collection Initiative, would create a virtual online corpus: a classified collection of links to relevant webpages would compile subcorpora from webpages retrieved from their home sites on demand and
Facilitating the compilation and dissemination of ad-hoc web corpora 297
serve them to users; as pages disappear they would be replaced by others with comparable content. is alternative avoids liability for caching implicitly copyright documents, but it does not provide an instantly searchable online corpus, nor does it guarantee availability of the original data for verification and further analysis. 13 When this was written, AltaVista was the only large-scale international search engine that supported wildcards and the complex queries necessary for efficient searching. Originally a technology showcase for Digital Equipment Corporation, it passed from one corporation to another over the years. In March 2004, the latest owner Yahoo dropped support for wildcards on the AltaVista site and apparently ceased maintaining a separate database for AltaVista. ese developments reinforce my point that linguists must establish their own search engine to ensure that their needs will be met. 14. “Regular expressions” are powerful cousins of wildcards which allow precise matching of complex patterns of characters. Unfortunately most implementations are Anglo-centric and thus ignore the fact that characters with diacritics can occur within word boundaries. Regular expression pattern-matching engines could be optimized for specific languages by matching only those characters expected to occur in a given language. 15. Links to these and other resources related to the concepts proposed here are on http:// kwicfinder.com/RelatedLinks.html. In early 2003, LookSmart, a large commercial search engine provider, acquired Grub, a distributed search-engine crawling system. Now almost 21,000 volunteers have a Grub client screensaver which retrieves and analyzes webpages, thus helping LookSmart to increase the coverage and maintain the freshness of its databases. (http://looksmart.com [visited 19 June 2004]; http://grub.org [visited 19 June 2004]; http://wisenut.com, [visited 19 June 2004]). 16. Open-source soware is developed cooperatively and distributed both freely and free. Specific open-source technologies proposed here are the “LAMP platform”: Linux operating system, Apache web server, MySQL database, PHP and / or Perl scripting, all of which cost nothing and run competently on standard desktop PCs costing at most a few hundred dollars. Storage costs have dropped well below 50 cents a gigabyte and are set to plummet as new terabyte technologies are introduced in a few years. e expertise required to develop and maintain a search site encompasses Web protocols, database programming, and serverand client-side scripting, all skills typically available at universities. 17. For example, due to the high processing requirements, Google – currently the most popular search engine in the world by far – does not support any wildcards, and even AltaVista restricts them severely. 18. My experience negotiating rights to incorporate authentic video into multimedia courseware explains my hypersensitivity to copyright issues. 19. KWiCFinder provides the option to save the Web document files automatically in original HTML and / or text format for later analysis by a full-featured concordancer like WordSmith or MonoConc. 20. SSSpider’s heuristics for determining the language of the source text are not entirely reliable: a search for pages in Afrikaans returned many Dutch pages; aer switching to
298 William H. Fletcher
a search term that does not exist in Dutch, I got pages in French and Romanian as well as Afrikaans. 21. KWiCFinder’s inclusion and exclusion criteria are terms which help narrow but are not concordanced in the search results. For example, in a search for TaLC, words like “corpus”, “corpora”, “language”, “linguistics” are good discriminators of relevant texts, while “powder, talcum” are likely to appear on irrelevant webpages. 22. Perhaps Philip King’s term “kibbitzoids”, premiered at TaLC 5 in Bertinoro (2002), is more appropriate, as these are not strictly speaking what Tim Johns means by kibbitzers. 23. As Frand (2000:16) puts it, “Unfortunately, many of our students do believe that everything they need to know is on the Web and that it’s all free.”
References Aston, G. 2002. “The learner as corpus designer”. In Teaching and Learning by Doing Corpus Analysis: Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19–24 July, 2000 [Series Language and Computers, Vol. 42], B. Kettemann and G. Marko (eds), 9–25. Amsterdam: Rodopi. Banko, M. and Brill, E. 2001. “Scaling to very very large corpora for natural language disambiguation”. ACL–01. Online at http://research.microsoft.com/~brill/Pubs/ACL2001 .pdf [visited 2.3.2004] Barton, J. 2004. “Evaluating web pages: Techniques to apply and questions to ask”. Online: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html [visited 1.3.2004]. Evaluation form online: http://www.lib.berkeley.edu/TeachingLib/ Guides/Internet/EvalForm.pdf [visited 1.3.2004]. Brekke, M. 2000. “From the BNC toward the cybercorpus: A quantum leap into chaos?” In Corpora Galore: Analyses and Techniques in Describing English: Papers from the Nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998), Kirk, J. M. (ed.), 227–247. Amsterdam and Atlanta: Rodopi. Burkard, T. 2002. “Herodotus: A peer-to-peer web archival system”. Cambridge, MA, Massachusetts Institute of Technology Master’s Thesis. Online: http://www. pdos.lcs.mit.edu/papers/chord:tburkard-meng.pdf [visited 8.10.2004]. Burnard, L. and McEnery, T. (eds.). 2000. Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora. Frankfurt am Main: Peter Lang. Cooper, A. 1995. About Face: The Essentials of User Interface Design. Foster City, CA: IDG Books. Crews, K.D. 2000. “Fair use: Overview and meaning for higher education”. Online: http: //www.iupui.edu/~copyinfo/highered2000.html [visited 8.10.2002]. De Schryver, G. M., 2002. “Web for / as corpus: A perspective for the African languages”. Nordic Journal of African Studies 11(2):266–282. Online: http://tshwanedje.com/publications/webtocorpus.pdf [visited 26.2.2004]. Fairon, C. and Courtois, B. 2000. “Les corpus dynamiques et GlossaNet: Extension de la
Facilitating the compilation and dissemination of ad-hoc web corpora 299
couverture lexicale des dictionnaires électroniques anglais”. JADT 2000:5es Journées Internationales d’Analyse Statistique des Données Textuelles. Online: http: //www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2000/pdf/52/52.pdf [visited 18.2. 2003]. Fletcher, W. H. 2001a. “Re-searching the web for language professionals”. CALICO, University of Central Florida, Orlando, FL, 15–17 March 2001. PowerPoint online: http: //www.kwicfinder.com/Calico2001.pps [visited 2.3.2004]. Fletcher, W. H. 2001b. “Concordancing the web with KWiCFinder”. American Association for Applied Corpus Linguistics, Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001. Online: http: //kwicfinder.com/FletcherCLLT2001.pdf [visited 8.10.2004]. Fletcher, W. H. 2002. “Making the web more useful as a source for linguistic corpora”. American Association for Applied Corpus Linguistics Symposium, Indianapolis, IN, 1–3 November 2002. Online: http://kwicfinder.com/FletcherAAACL2002.pdf [visited 25.8.2003]. Fletcher, W. H. 2004. “Phrases in English”. Online database for the study of English words and phrases at http://pie.usna.edu [visited 26.2.2004]. Frand, J. 2000. “The Information-Age mindset: Changes in students and implications for higher education”. EDUCAUSE Review 35 (5):14–24. Online: http://www. educause.edu/pub/er/erm00/articles005/erm0051.pdf [visited 29.2.2004]. Ghani, R., Jones, R. and Mladenic, D. 2001. “Using the web to create minority language corpora”. 10th International Conference on Information and Knowledge Management (CIKM–2001). Online at http://www.cs.cmu.edu/~TextLearning/corpusbuilder/ papers/cikm2001.pdf [visited 7.7.2004]. Grefenstette, G. 1999. “The World Wide Web as a resource for example-based machine translation tasks”. Online at http://www.xrce.xerox.com/research/mltt/publications/Documents/P49030/content/ gg_aslib.pdf [visited 12.10.2001] Hilton, J. 2001. “Copyright assumptions and challenges”. EDUCAUSE Review 36 (6):48–55. Online: http://www.educause.edu/ir/library/pdf/erm0163.pdf [visited 8.10.2002]. Hofland, K. 2002. “Et Web-basert aviskorpus”. Online: http://www.hit.uib.no/aviskorpus/ [visited 8.10.2004]. Jansen, B. J., Spink, A. and Saracevic, T. 2000. “Real life, real users, and real needs: A study and analysis of user queries on the web”. Information Processing and Management 36 (2):207–227. Johns, T. F. 2001. “Modifying the paradigm”. Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001. Kilgarriff, A. 2001. “Web as corpus”. In Proceedings of the Corpus Linguistics 2001 conference, UCREL Technical Papers:13, P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja, (eds), 342–344. Lancaster: Lancaster University. Online: http://www. itri.bton.ac.uk/~Adam.Kilgarriff/PAPERS/corpling.txt [visited 8.10.2004]. Kilgarriff, A. and Grefenstette, G. 2003. “Introduction to the special issue on the web as corpus”. Computational Linguistics 29(3):333–47. Online: http://www-mitpress.mit. edu/journals/pdf/coli_29_3_333_0.pdf [visited 7.1.2004]. Körber, S. 2000. “Suchmuster erfahrener und unerfahrener Suchmaschinennutzer im
300 William H. Fletcher
deutschsprachigen World Wide Web: ein Experiment”. Unpublished master’s thesis, Westfälische Wilhelms-Universität Münster, Germany. Online: http://kommunix.uni-muenster.de/IfK/examen/koerber/suchmuster.pdf [visited 9.1.2004]. Koman, R. 2002. “How the Wayback Machine works”. Online: http://www.xml.com/pub/a/ ws/2002/01/18/brewster.html [visited 8.10.2004]. Lamy, M. N. and Mortensen, H. J. K. 2000. “ICT4LT Module 2.4. Using concordance programs in the modern foreign languages classroom”. Information and Communications Technology for Language Teachers. Online: http://www.ict4lt.org/en/en_mod2–4.htm [visited 1.3.2004]. Melnik, S., Raghavan, S., Yang, B. and García-Molina, H. 2001. “Building a distributed full-text index for the web”. WWW10, 2–5 May 2001, Hong Kong. Online: http: //www10.org/cdrom/papers/275/index.html [visited 8.10.2004]. Morley, B., Renouf, A. and Kehoe, A. 2003. “Linguistic research with XML/RDF-aware WebCorp tool”. http://www2003.org/cdrom/papers/poster/p005/p5-morley.html [visited 19.2.2004]. Pearson, J. 2000. “Surfing the Internet: Teaching students to choose their texts wisely”. In Burnard and McEnery, 235–239. Resnik, P. and Elkiss, A. 2004. “The Linguist’s Search Engine: Getting started guide”. http: //lse.umiacs.umd.edu:8080/lse_guide.html [visited 23.1.2004]. Scannell, K. P. 2004. “Corpus building for minority languages”. Online at http://borel. slu.edu/crubadan/ [visited 19.3.2004]. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. 1999. “Analysis of a very large web search engine query log”. SIGIR Forum, 33(3). Online at http://www.acm.org/sigir/ forum/F99/Silverstein.pdf [visited 26.2.2004]. Smarr, J. 2002. “GoogleLing: The web as a linguistic corpus”. Online at http://www. stanford.edu/class/cs276a/projects/reports/jsmarr-grow.pdf [visited 7.7.2004]. Stubbs, M. forthcoming. “Inferring meaning: Text, technology and questions of induction”. In Aspects of Automatic Text Analysis,. R. Köhler and A. Mehler (eds). Heidelberg: Physica-Verlag. Varantola, K. 2003. “Translators and disposable corpora”. In Corpora in Translator Education, F. Zanettin, S. Bernardini and D. Stewart (eds), 55–70. Manchester: St Jerome. Volk, M. 2002. “Using the web as corpus for linguistic research”. In Tähendusepüüdja: Catcher of the Meaning. A Festschrift for Professor Haldur Õim [Publications of the Department of General Linguistics 3], Pajusalu, R. and Hennoste, T. (eds.). Tartu: University of Tartu. Online at http://www.ifi.unizh.ch/cl/volk/papers/Oim_Festschrift_2002.pdf [visited 7.7.2004]. Zanettin, F. 2001. “DIY corpora: The WWW and the translator”. In Training the Language Services Provider for the New Millennium, B. Maia, H. Haller and M. Urlrych (eds), 239–248. Porto: Facultade de Letras, Universidade do Porto. Online: http: //www.federicozanettin.net/DIYcorpora.htm [visited 16.10.2004]. Zoni, E. 2003. “e-Mining – Software per concordanze online”. Online: http://applicata. clifo.unibo.it/risorse_online/e-mining/e-Mining_concordanze_online.htm [visited 15.1.2004].
Index 301
Index A Aarts 67, 78, 80, 82 ad-hoc corpus see corpus AltaVista 277, 278, 281, 297 Altenberg 116, 208 Apple Pie Parser (APP) see NLP tools Appraisal 5, 125, 126, 128, 130, 132 authenticity 11, 14, 152, 200 autonomy learner 247-249, 251 awareness 171, 248 cultural/social 172 discourse 170, 172, 182 language 170, 172, 174, 176, 252 literary 172 methodological/metatheoretical 170, 172, 185 B Bank of English 1, 33 Benjamin 259, 261, 267 Bernardini 234 Biber 36 Bilogarithmic Type-Token Ratio (BTTR) see statistical measures Boolean operators 285 Borin 3 Brill tagger see NLP tools British National Corpus (BNC) 1, 4, 7, 12, 13, 33, 52, 53, 56, 113, 127, 153, 155, 158-162, 281, 285, 292 British National Corpus (BNC) Sampler 3, 4, 67, 72, 81, 89, 92, 93 Brown 1 Burston 234 Butt 259, 261, 267 C Cantos-Gómez 9 characterization 175 ChaSen see NLP tools child language 138, 140
Chinese/English 127 Chipere 5 chi-square see statistical tests Chomsky 22, 39, 47 classroom concordancing see concordancing CLAWS tagger see NLP tools cluster analysis see statistical tests Cobuild 6 cohesion 21, 34-37, 252 colligation 27, 28, 30, 31, 37 collocation 23, 28, 31, 37, 161 Comlex Lexicon 55-58, 60, 61 Compara 8, 213, 223, 225, 228 comparable corpus see corpus Computer-Assisted Language Learning (CALL) 71 concordancing 235, 244, 286-289, 292, 293 classroom use 216, 233, 234 parallel 213, 214, 217, 226, 227 self-access 216 strategies 239, 243 tool 280, 285 connotation 129, 130 constituency 252 construction grammar 22 Contrastive Analysis (CA) 70, 216 Contrastive Interlanguage Analysis (CIA) 69, 217 Cook 17, 154 Copernic see web concordancer corpus ad-hoc 274, 279, 295 apprentice writing 125-127, 130-133, 137, 143 children’s writing 137, 143 ELT 18, 57 learner 69, 78, 109, 119 interlanguage 46 L1 reference 46 target 46
302 Index
literary 173, 184, 191 parallel 183, 218, 223, 227 professional writing 125-127, 130-133 reference 1, 14 representativeness 81, 82 spoken 196, 205, 208 web archive (WCA) 273, 280, 281, 284, 288, 289, 291, 293, 294, 296 comparable 83 Corpus de Referencia del Español Actual (CREA) 259, 261-264, 266, 267 Corpus del Español 9, 259, 262-264, 266, 285 Corpus Diacrónico del Español (CORDE) 261-264, 266, 267 Corrected Type-Token Ratio (CTTR) see statistical measures Critical Discourse Analysis (CDA) 174 cultural/social awareness see awareness D D measure see statistical measures Dagneaux 68 data-driven learning (DDL) 9, 16, 242, 248 Davies 9 demonstratives 89-98, 103-106 discourse awareness see awareness E Ellis 68 ELT corpus see corpus ELT materials 3, 4, 6, 7, 18, 51, 52, 89, 91, 106, 112, 113, 151, 153, 156 Emmott 35 English as Lingua Franca (ELF) 6, 14, 106-208 English as Lingua Franca in Academic Settings (ELFA) Corpus 207 English Norwegian Parallel Corpus (ENPC) 223 error analysis 68 evaluation 125, 129, 133 Evoking lexis 128-133 F feedback 247, 252
Finnish/English 7 Firth 26, 27 Fletcher 11 form focus 247, 252 formulaic sequences 205 Frankenberg-Garcia 8 Freiburg London Oslo Bergen (FLOB) corpus 80 Frown Corpus 80 G Gellerstam 78 German 278, 285, 288, 291, 292 German Corpus of Learner English (GeCLE) 114 German English as a Foreign Language Textbook Corpus (GEFL TC) 155-162 German/English 4, 110, 112, 152, 153, 155, 158 Gledhill 33, 34 Gleitman 47 GlossaNet 289, 295 Goldberg 47 Google 259, 261-264, 266, 267, 277, 281, 287, 289, 295 Grammar Safari 291 grammar-translation approach 215 Granger 2, 67, 69, 78, 80, 82, 116, 154, 208 Grimshaw 47 Grosz 35 Guardian, The 29, 33, 36 H Halliday 27 Hawthorne, Nathaniel 183, 184 Hoey 10-12 Hofland 217 HTML 279, 280, 283, 288 Hunston 7 I ICAME 1 ideational function 129 idiom principle 205 if-clauses 7, 158, 162 Inagaki 49, 50 Inscribed lexis 128-133
Index 303
interference 68, 71, 104, 217, 218 interlanguage (IL) 68 International Corpus of English (ICE) 1 International Corpus of Learner English (ICLE) 1, 4, 52, 53, 69, 78, 84, 109, 112 International Sample of English Contrastive Texts (INTERSECT) Corpus 223 Internet Archive 281, 282, 288-290, 296 interpersonal function 125, 126, 178 IPAL Electronic Dictionary Project 57 J Japanese EFL Learner Corpus (JEFFL) 3, 46, 50, 51 Japanese/English 3, 45, 48, 52, 55, 57, 60 Johansson 217 Johns 16, 242, 281 K Kettemann 7 keyword analysis 127, 128, 130-133 kibbitzers 281 KwiCFinder see web concordancer L language awareness see awareness language production 219, 220 language reception 219, 221, 227 learner autonomy see autonomy learner corpus see corpus learner, analytical 205 learner, holistic 205 Leńko-Szymańska 4 Levin 57 lexical density 252 lexical item 15, 160 Linguist’s Search Engine (LSE) 290, 291 literary awareness see awareness literary corpus see corpus London, Jack 183, 184, 195 London Oslo Bergen (LOB) corpus 1 Louvain Corpus of Native English Essays (LOCNESS) 53, 78, 82, 84 Louw 25
M Malvern 5, 139, 141, 142 Mann-Whitney (or U) test see statistical tests Marco 7 Martin 128, 129 Mauranen 14 McEnery 69 Mean Segmental Type-Token Ratio (MSTTR) see statistical measures methodological/metatheoretical awareness see awareness Michigan Corpus of Academic Spoken English (MICASE) 6, 7, 197, 198, 208 Monoconc 197 Montrul 49 Morley 35 N Nesselhauf 4, 14 newspaper text 29, 33, 36, 289, 295 NLP tools Apple Pie Parser (APP) 56 Brill tagger 72 ChaSen 56 CLAWS tagger 93, 98 NP movement 49 O open-choice principle 205 open-source software 297 oral production see production Oxford English Dictionary 22 P parallel concordancing see concordancing parallel corpus see corpus Partington 35, 129, 130 Part-of-Speech (POS) n-grams 67, 71-73, 79 pattern grammar 22, 28 Pearson’s correlation see statistical tests PELCRA Corpus 4, 50, 89, 92 Pérez-Paredes 9 performative verbs 178 Pinker 46-48 Polish/English 4, 89, 90, 92, 103-106
304 Index
Portuguese/English 8, 220-226 priming 10, 23-25, 28 problem-solution pattern 125-127, 131, 141 production oral 247, 249, 256 prosody 252 Prütz 3 R Real Academia Española 259, 261 reference corpus see corpus Renouf 110 Richards 5, 139, 141, 142 Roget’s Thesaurus 22 Römer 6, 7, 9, 12 Root Type-Token Ratio (RTTR) see statistical measures Roussel 217 S sample size 137, 139, 140, 145 schema theory 174 Second Language Acquisition (SLA) 48, 55, 63, 68 Seidlhofer 14 self-access concordancing see concordancing semantic association 25, 26, 28, 31, 37 semantic prosody 129 Shakespeare, William 179 Sidner 35 Sinclair 12, 13, 15, 16, 25, 27, 35, 110, 154 Spanish 9, 259-264, 266, 267, 285, 291 Spanish/English 247 speech acts 178, 180, 181 spoken corpus see corpus Sripicharn 8, 23 statistical measures Bilogarithmic Type-Token Ratio (BTTR) 141 Corrected Type-Token Ratio (CTTR) 141 D measure 137, 139, 142, 143, 145 Mean Segmental Type-Token Ratio (MSTTR) 140, 141 Root Type-Token Ratio (RTTR) 141 Type-Token Ratio (TTR) 5, 137-143, 145,
146 statistical tests chi-square 57, 61, 93-96, 98 cluster analysis 249, 250 Mann-Whitney (or U) 72 Pearson’s correlation 144 stemming 276 Stockholm Umeå Corpus (SUC) 3, 67, 71, 72, 82 Stubbs 25 stylistics 170, 175 Subcategorisation Frame (SF) patterns 3, 45, 48, 61 support verb constructions 109-118, 120, 122 Swedish/English 3, 4, 67, 71, 72, 80, 82, 84 syntactic variation 259, 261, 268 Systemic Functional Linguistics (SFL) 22, 125, 126, 128, 133, 174 T tagset English 87 Swedish 87 TOSCA-ICE 78 TextSTAT see web concordancer Thai/English 9, 235, 236, 242 theme/rheme 27, 125 Thomas, Dylan 38 TOEFL 2000 Spoken and Written Academic Language (T2K- SWAL) Corpus 206 Tono 3 TOSCA-ICE tagset see tagset transfer see interference translation 223 translationese 78, 81 TTR see statistical measures Turnbull 234 Twain, Mark 183, 184, 195 Type-Token Ratio (TTR) see statistical measures U Universal Grammar (UG) 47, 49 Uppsala Student English (USE) Corpus 3, 67, 71-73, 82
Index 305
V vocd 142, 144 W web browsing 276 web concordancer Copernic 286 KwiCFinder 11, 273, 278-281, 284, 285, 289, 292, 294, 296, 297 TextSTAT 288 Web Concordancer 286 WebCONC 286-288 WebCorp 286, 287, 289 WebKWiC 288 Subject Search Spider 287, 297 Web Corpus Archive (WCA) see corpus web crawling 283 web search engine 273, 274, 276, 277, 282284, 286, 287, 289, 291-294, 296, 298 web spidering 283 WebCONC see web concordancer
WebCorp see web concordancer WebKWiC see web concordancer Whitley 268 wh-questions 13 Widdowson 2, 12, 15, 16, 155 wildcard 277, 278, 282, 283, 287, 288, 297 Wilkins, Mary Freeman 175 Wilson 69 Winter 36 Wordsmith Tools 93, 127, 141, 187 World Wide Web 11, 259, 261, 273-276, 279-282, 284, 289, 294-296 Wray 205, 206 X XML 279 Y yes/no questions 153 Yule 35
306 Bionotes
Bionotes 307
Bionotes Guy Aston is professor of English linguistics and Dean of the School for Interpreters and Translators of the University of Bologna at Forlì, Italy. His main research interests: contrastive pragmatics, conversational analysis, corpus linguistics, autonomous language learning. Silvia Bernardini is a research fellow at the School for Interpreters and Translators of the University of Bologna at Forlì, Italy, where she currently teaches translation from English into Italian. Her main research interests are corpora as aids in language and translation teaching and the study of translationese through parallel, comparable and learner corpora. Lars Borin is Professor of Natural Language Processing in the Department of Swedish, Göteborg University. His main research interests straddle the boundary of computational and general linguistics; in particular he has published on contrastive corpus linguistics, on the use of language technology in linguistic research and in the teaching of languages and linguistics, and on language technology in the service of language diversity. Pascual Cantos is a Senior Lecturer in the Department of English Language and Literature, University of Murcia, Spain, where he lectures English Grammar and Corpus Linguistics. His main research interests are in Corpus Linguistics, Quantitative Linguistics, Computational Lexicography and Computer Assisted Language Learning. He has published extensively in the Welds of CALL and Corpus Linguistics and is also co-author of various CALL applications: CUMBRE Curso Multimedia para la Enseñanza del Español, 450 Ejercicios Gramaticales and Practica tu Vocabulario, published by SGEL. Ngoni Chipere is Lecturer in Language Arts at the University of the West Indies. He completed his doctoral thesis in experimental psycholinguistics at the University of Cambridge in 2000. His post-doctoral studies at the University of Cambridge Local Examinations Syndicate and the University of Reading
308 Bionotes
were concerned with quantitative analysis of developmental trends in a corpus of children’s writing. His research interests straddle theoretical and applied concerns in linguistics and the psychology of language. His publications include: Understanding Complex Sentences: Native Speaker Variation in Syntactic Competence, published by Palgrave in 2003 and – with David Malvern, Brian Richards and Pilar Durán – Lexical Diversity and Language Development: QuantiWcation and Assessment, soon to be published by Palgrave. Mark Davies is an Associate Professor of Corpus and Computational Linguistics at Brigham Young University in Provo, Utah, USA. He has developed large corpora of historical and modern Spanish and Portuguese, which have been used (by him and by many students) to investigate several aspects of syntactic change and current syntactic variation in these two languages. William H. Fletcher is Associate Professor of German and Spanish at the United States Naval Academy. His current research focusses on exploiting the Web as a source of linguistic data. He also has authored numerous papers on the role of multimedia in language learning and on the linguistic description of modern Dutch. Lynne Flowerdew coordinates technical communication skills courses at the Hong Kong University of Science and Technology. Her research interests include corpus-based approaches to academic and professional communication, textlinguistics, ESP and syllabus design. Ana Frankenberg-Garcia holds a PhD in Applied Linguistics from Edinburgh University and is an auxiliary professor at ISLA, in Lisbon, where she teaches teaches English language and translation. She is joint project leader of the COMPARA parallel corpus of English and Portuguese, a public, online resource funded by the Portuguese Foundation for Science and Technology. Her current research interests focus on the use of corpora for language learning and translation studies. Bernhard Kettemann is professor of English linguistics at Karl-Franzens-University Graz and currently head of the Department of English Studies. His main research interests are corpus linguistics, (media) stylistics, and the teaching and learning of EFL. Recent publications include Teaching and Learning by
Bionotes 309
Doing Corpus Analysis (co-edited with Georg Marko, 2002, Rodopi). Agnieszka Leńko-Szymańska is a graduate of the University of Łódź, where she is Adjunct Professor and Head of the Teaching English as a Foreign Language (TEFL) Unit. Her research interests are primarily in psycholinguistics, second language acquisition and corpus linguistics, especially in lexical issues in those Welds. She has published a number of papers on the acquisition of second language vocabulary. She teaches applied linguistics, foreign language teaching methodology and topics in psycholinguisitcs and SLA. David Malvern is Professor of Education and Head of the Institute of Education at the University of Reading. A mathematical scientist by training, he read physics at Oxford and has been a Research OYcer at the Royal Society, Visiting Professor in the Department of Educational Psychology, McGill University, Montreal and a European Union and British Council Consultant. He has been collaborating with Brian Richards on various aspects of language research since 1988. Georg Marko teaches English linguistics at the Department of English Studies at Karl-Franzens-University Graz and Professional English at the University for Applied Sciences FH-Joanneum Graz. He is interested in the application of corpus linguistics to Critical Discourse Analysis. He is currently Wnishing his PhD dissertation on the discourse of pornography. Anna Mauranen is professor of English at Tampere University. She has published widely in corpus studies, translation studies and contrastive linguistics. Her current research focuses on speech corpora and English as a lingua franca. She is running a research project on lingua franca English, and compiling a corpus of English spoken as a lingua franca in academic settings (the ELFA corpus). Nadja Nesselhauf is an Assistant (“wissenschaftliche Assistentin”) at the English Department of the University of Heidelberg, Germany. She holds a PhD in English Linguistics from the University of Basel, Switzerland, where she taught various courses in Linguistics from 1999 to 2003. Her main research interests are linguistics and language teaching, phraseology, second language acquisition, and corpus linguistics.
310 Bionotes
Pascual Pérez-Paredes works at the English Department in the University of Murcia, Spain. He completed his doctorate in English Philology in 1999, and currently teaches English Language and Translation. His main academic interests are the compilation and use of language corpora, the implementation of Information and Communication Technologies in Foreign Language Teaching/Learning, and the role of aVective variables in Foreign Language Learning. He is responsible for the compilation of the Spanish component of the Louvain International Database of Spoken English Interlanguage (LINDSEI) corpus. Recent publications include articles in Extending the Scope of Corpus-based Research. New Applications, New Challenges, edited by S. Granger and S. Petch-Tyson and How to Use Corpora in Language Teaching, edited by J. Sinclair. Having studied Egyptology, Linguistics and Computational Linguistics at Uppsala University, Klas Prütz currently works as a Corpus Research Assistant at the Centre for Language and Communication Research, CardiV University. His research focuses on the development and evaluation of methodologies for large-scale corpus investigations. He is working on a PhD thesis concerning multivariate analysis of part-of-speech-determining contexts for word forms in Swedish texts. Brian Richards is Professor of Education at the University of Reading and Head of the Section for Language and Literacy. A former teacher of German and English as a Foreign Language, his research interests have extended to early language development and language assessment, as well as foreign and second language teaching and learning. He obtained a doctorate on auxiliary verb acquisition from the University of Bristol in 1987 before moving to Reading to train teachers of French and German. He is the author of Language development and individual diVerences (1990) and editor of Input and interaction in language acquisition (1994) (with Clare Gallaway) and Japanese children abroad (1998) (with Asako Yamada-Yamamoto). He is the author of numerous articles and book chapters on language and language education and has been a member of the editorial team of the Journal of Child Language since 1992. Ute Römer studied English linguistics and literature, Chemistry and Education at Cologne University and now works as a researcher and lecturer in English
Bionotes
311
linguistics at the University of Hanover. She is currently Wnalising her PhD thesis entitled Progressives, Patterns, Pedagogy: A corpus-driven approach to English progressive forms, their functions, contexts, and didactics. The study is based on more than 10,000 progressives in context and tries to demonstrate how corpus work can contribute to an improvement of ELT. Main research and teaching interests include corpus linguistics, linguistics & language teaching, and language & gender. Her most recent research project centres around a monitor corpus of linguistic book reviews and its possible use in corpus-driven sociolinguistics. She has recently co-edited Language: Context and Cognition. Papers in Honour of Wolf-Dietrich Bald’s 60th Birthday (2002) and published articles on corpus linguistics and language teaching. Passapong Sripicharn is currently a lecturer in the English Department, Faculty of Liberal Arts, Thammasat University, Thailand. He received his Ph.D. in Applied Linguistics from the University of Birmingham, UK. His research interests include corpus linguistics, second language writing, and ESP/EAP. Dominic Stewart is a research fellow at the School for Interpreters and Translators of the University of Bologna at Forlì, Italy, where he currently teaches English language and linguistics. His research interests include issues relating to the validity of corpus data within both the language and the translation classroom, and the use of reference corpora for translation into the foreign language. Yukio Tono is Associate Professor of Applied Linguistics at Meikai University, Japan. He holds a Ph.D. in corpus linguistics from Lancaster University. His research interests include second language vocabulary acquisition, corpusbased second language acquisition, learner corpora, applications of corpora in language learning/teaching, dictionary use, and corpus lexicography. He serves on the editorial board of the International Journal of Lexicography. His recent work includes Research on Dictionary Use in the Context of Foreign Language Learning (Max Niemeyer Verlag, 2001). He has also led two major learner corpus-building projects, JEFLL and SST, in Japan.
In the series Studies in Corpus Linguistics (SCL) the following titles have been published thus far or are scheduled for publication: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
PEARSON, Jennifer: Terms in Context. 1998. xii, 246 pp. PARTINGTON, Alan: Patterns and Meanings. Using corpora for English language research and teaching. 1998. x, 158 pp. BOTLEY, Simon and Tony McENERY (eds.): Corpus-based and Computational Approaches to Discourse Anaphora. 2000. vi, 258 pp. HUNSTON, Susan and Gill FRANCIS: Pattern Grammar. A corpus-driven approach to the lexical grammar of English. 2000. xiv, 288 pp. GHADESSY, Mohsen, Alex HENRY and Robert L. ROSEBERRY (eds.): Small Corpus Studies and ELT. Theory and practice. 2001. xxiv, 420 pp. TOGNINI-BONELLI, Elena: Corpus Linguistics at Work. 2001. xii, 224 pp. ALTENBERG, Bengt and Sylviane GRANGER (eds.): Lexis in Contrast. Corpus-based approaches. 2002. x, 339 pp. STENSTRÖM, Anna-Brita, Gisle ANDERSEN and Ingrid Kristine HASUND: Trends in Teenage Talk. Corpus compilation, analysis and findings. 2002. xii, 229 pp. REPPEN, Randi, Susan M. FITZMAURICE and Douglas BIBER (eds.): Using Corpora to Explore Linguistic Variation. 2002. xii, 275 pp. AIJMER, Karin: English Discourse Particles. Evidence from a corpus. 2002. xvi, 299 pp. BARNBROOK, Geoff: Defining Language. A local grammar of definition sentences. 2002. xvi, 281 pp. SINCLAIR, John McH. (ed.): How to Use Corpora in Language Teaching. 2004. viii, 308 pp. LINDQUIST, Hans and Christian MAIR (eds.): Corpus Approaches to Grammaticalization in English. 2004. xiv, 265 pp. NESSELHAUF, Nadja: Collocations in a Learner Corpus. xii, 326 pp. + index. Expected Winter 04-05 CRESTI, Emanuela and Massimo MONEGLIA (eds.): C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. ca. 300 pp. (incl. DVD). Expected Winter 04-05 CONNOR, Ulla and Thomas A. UPTON (eds.): Discourse in the Professions. Perspectives from corpus linguistics. 2004. vi, 334 pp. ASTON, Guy, Silvia BERNARDINI and Dominic STEWART (eds.): Corpora and Language Learners. 2004. vi, 311 pp.