Corpus Linguistics: Refinements and Reassessments
LANGUAGE AND COMPUTERS: STUDIES IN PRACTICAL LINGUISTICS No 69
edi...
17 downloads
1050 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Corpus Linguistics: Refinements and Reassessments
LANGUAGE AND COMPUTERS: STUDIES IN PRACTICAL LINGUISTICS No 69
edited by Christian Mair Charles F. Meyer Nelleke Oostdijk
Corpus Linguistics: Refinements and Reassessments
Edited by
Antoinette Renouf and Andrew Kehoe
Amsterdam - New York, NY 2009
Cover image: Collocational “heat map” for the word credit (detail); from the paper “Weaving web data into a diachronic corpus patchwork”, by Andrew Kehoe & Matt Gee. Cover design: Pier Post The paper on which this book is printed meets the requirements of "ISO 9706:1994, Information and documentation - Paper for documents Requirements for permanence". ISBN: 978-90-420-2597-4 E-Book ISBN: 978-90-420-2598-1 ©Editions Rodopi B.V., Amsterdam - New York, NY 2009 Printed in The Netherlands
Contents Introduction Antoinette Renouf and Andrew Kehoe
1
1. Looking more closely at existing boundaries of the discipline Corpus linguistics meets sociolinguistics: the role of corpus evidence in the study of sociolinguistic variation and change Christian Mair
7
Creating corpora from spoken legacy materials: variation and change meet corpus linguistics Joan C. Beal
33
Discourse linguistics meets corpus linguistics: theoretical and methodological issues in the troubled relationship Tuija Virtanen
49
'Tis well known to barbers and laundresses: Overt references to knowledge in English medical writing from the Middle Ages to the Present Day Turo Hiltunen and Jukka Tyrkkö
67
Comparing type counts: The case of women, men and -ity in early English letters Tanja Säily and Jukka Suomela
87
2. Examination of a known language feature from a new point of view Does English have modal particles Karin Aijmer
111
A reassessment of the syntactic classification of pragmatic expressions: the positions of you know and I think with special attention to you know as a marker of metalinguistic awareness Julie Van Bogaert
131
The functions of expletive interjections in spoken English Magnus Ljung
155
3. Examination of the potential of a new corpus, tool, model or technique to extend linguistic knowledge Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991 Geoffrey Leech and Nicholas Smith
173
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form: A critical discussion of selected lexicographic parameters and query options Alexander Onysko, Manfred Markus and Reinhard Heuberger
201
How representative are the ‘Philosophical Transactions of the Royal Society’ of 17th-century scientific writing? Lilo Moessner
221
A multi-dimensional analysis of a learner corpus Bertus van Rooy and Lize Terblanche
239
Weaving web data into a diachronic corpus patchwork Andrew Kehoe and Matt Gee
255
4. Re-examination of known linguistic phenomenon in light of further/new data “To each reader his, their or her pronoun”. Prescribed, proscribed and disregarded uses of generic pronouns in English Elisabetta Adami
281
The interpersonal function of going to in written American English Anna Belladelli
309
Re-analysing the semi-modal ought to: an investigation of its use in the LOB, FLOB, Brown and Frown corpora Marta Degani
327
On the use of split infinitives in English Javier Calle-Martín and Antonio Miranda-García
347
Exploring change in the system of English predicate complementation, with evidence from corpora of recent English Juhani Rudanko
365
Encoding of goal-directed motion vs resultative aspect in the COME + infinitive construction Sara Gesuato
381
A corpus-based analysis of invariant tags in five varieties of English Georgie Columbus
401
Discourse presentation in EFL textbooks: a BNC-based study Christoph Rühlemann
415
Awful adjectives: a type of semantic change in present-day corpora Göran Kjellmer
437
5. Discussion Panel Global English – Global Corpora: Report on a panel discussion at the 28th ICAME conference Marianne Hundt
451
Introduction Corpus Linguistics: Refinements and Reassessments Antoinette Renouf and Andrew Kehoe Research & Development Unit for English Studies, Birmingham City University Stratford-upon-Avon as conference venue for the 28th International ICAME Conference provided an ideal setting for a field which sits on a methodological continuum of word-based English textual enquiry stretching from the index verborum, primarily biblical, of the years of early printing1, to today’s technologically full-blown corpus-based studies, by way of the miscellany of ‘partial’ and ‘complete’ concordances of Shakespeare2, produced with lesser or greater degrees of computational assistance, over the past 200 years. That continuum inevitably encompasses an evolution in the definitions and assumptions underlying notions such as ‘index’ and ‘concordance’ which are central to the study of English corpora. Throughout history, linguists and literary scholars have been impelled by their curiosity about a particular linguistic or literary phenomenon to seek to observe it in source texts by means of the prevailing technological tools. The fruits of each earlier enquiry in turn nourish the desire to acquire further knowledge, through more detailed or extensive observation, of other or newer linguistic facts becoming available at the frontiers of newer technology. As time goes by, the corpus linguist operates increasingly from a position of awareness of the known linguistic facts, the standard methodologies, the existing corpora and the available tools and text-processing technology. Corpus Linguistics, thirty years on, is less characterisable as an innocent sortie into corpus territory on the basis of a hunch, and increasingly as an informed, critical reassessment and/or extension of existing analytical orthodoxy and descriptions, in the light of the potential offered by new data and tools coming on stream. The role of ICAME conference host afforded us the opportunity to foreground this aspect of corpus linguistics, and accordingly, the theme of our conference was ‘Corpus Linguistics Reassessed’. The response to this invitation was rich and, though diverse, showed that critical and informed reappraisal of the available facts, data, methods and tools was indeed a central preoccupation of the corpus linguistic research community. The title of this volume is thus ‘Corpus Linguistics: Refinements and Reassessments’. The selected papers, whilst categorisable across all these aspects, are grouped under the following headings: 1. 2. 3. 4. 5.
Looking more closely at existing boundaries of the discipline Examining a known language feature from a new point of view Examining the potential of a new corpus, tool, model or technique to extend linguistic knowledge Examining a known linguistic phenomenon in the light of further/new data Discussion Panel
2
Antoinette Renouf & Andrew Kehoe
1.
Looking more closely at existing boundaries of the discipline
Christian Mair opens section one on cross-boundary studies with a paper which looks beyond corpus linguistics, to the issues arising at its intersection with sociolinguistics. He shows how corpus data can provide new insights into sociolinguistic variation and change, specifically into patterns of variation not noticed or accurately described in previous sociolinguistic research, with reference to new data: the Jamaican component of the International Corpus of English. Joan Beal examines the intersection between traditional corpus linguistics and variationist studies, the latter traditionally focussing on spoken language and collecting private data sets. Professor Beal discusses 1960s Tyneside speech, the challenges and solutions involved in converting data on audio tape into a conventional corpus (NECTE), and plans for developing further corpora and common standards. Tuija Virtanen explores the ‘troubled relationship’ between corpus linguistics and discourse linguistics. She considers the theoretical and methodological issues involved in the application of corpus linguistic techniques to discourse analysis. She acknowledges that the two fields are difficult to interweave, but sets out the primary areas of commonality, discussing the potential benefits to practitioners in both fields of combining forces. Turo Hiltunen and Jukka Tyrkkö explore the intersection between traditional corpus linguistics and one aspect of discourse linguistics, namely discourse analysis. They examine the benefits of using corpus-linguistic techniques and tools to search for key lexis in the diachronic study of certain discourse features from Late Middle English onwards. This sheds light on unexplored discourse features and suggests interesting new hypotheses. Tanja Säily and Jukka Suomela venture beyond the standard repertory of corpus linguistic methods of quantification, and draw on the field of lexical statistics for more advanced measures, namely non-parametric statistics, in order to study morphological productivity and gender issues in a corpus of early English letters 2.
Examining a known language feature from a new point of view
Karin Aijmer opens this section with a re-analysis of English modality in the light of translation correspondences across parallel corpora. Professor Aijmer builds on her argument for the existence of a ‘modal particle’ in English, this time with reference to the discourse marker of course, which example she uses to demonstrate that ‘discourse marker’ and ‘modal particle’ are not just alternative labels for the same concept, but denote a functional split. Aijmer is one of the inspirations for Julie Van Bogaert’s study of the pragmatic expressions you know and I think, which she points out have been referred to as both ‘modal particle’ and ‘discourse particle’ by Aijmer (1997,
Corpus Linguistics: Refinements and Reassessments
3
2002). Bogaert reassesses the syntactic classification of these pragmatic expressions in the literature, and overcomes limitations found there with a new classificatory system based on ‘scope’. Magnus Ljung makes novel use of an existing linguistic model of spoken interaction (Stenstrom, 1994) to conduct a pragmatic reassessment of expletive interjections. He acknowledges that the notion of interjections being pragmatic markers is controversial, but references Aijmer (2002) as supporting his position. 3.
Examining the potential of a new corpus, tool, model or technique to extend linguistic knowledge
Geoffrey Leech and Nicholas Smith contribute the first paper to this section, reporting on their exploitation of the important new Lancs-31 corpus, the final part of the trio of corpora of text covering the period 1931 to 1991 to reassess how far trends of change already observed in the comparison of LOB (1961) and FLOB (1991) have themselves been undergoing change over the period in question, and to suggest motivations for aspects of ‘grammaticalization, colloquialization, Americanization and densification’. The next two papers each discuss the benefits and challenges of transforming an existing electronic textual data resource into a corpus. Alexander Onysko, Manfred Markus and Reinhard Heuberger discuss critically issues of digitisation, dialectology, lexicography and computational linguistics in the processing of Joseph Wright’s English Dialect Dictionary. Lilo Moessner examines critically the Philosophical Transactions of the Royal Society of 17th-century scientific writing and the degree to which they can be deemed to be representative. She achieves this by submitting them to Biber’s multidimensional analysis. Bertus van Rooy & Lize Terblanche also take Biber’s model but they adapt it to separate ‘style dimensions from grammar and information presentation dimensions in a way that the original model did not allow’. Their new multidimensional model allows the authors to move on from isolated linguistic features and examine their combined functional effects. The Tswana Learner English corpus is compared to the Louvain LOCNESS corpus. Andrew Kehoe and Matt Gee round off section one with an account of the special role of the WebCorp Linguist’s Search Engine in supplementing the picture of language provided by existing corpora, not simply by supplying the latest coinages from the web but by filling in vital information gaps about lexical change across time in British and American English from a ‘patchwork’ of corpora. They distinguish this approach from that of Leech and Smith, who take the ‘thirty-year interval’ approach to the study of grammatical change.
4
Antoinette Renouf & Andrew Kehoe
4.
Examining a known linguistic phenomenon in the light of further or newer data
Section four gathers together a set of nine papers in which authors assess the literature on a known feature of language, and then seek to extend the established description in one direction or another with reference to further, often newer, data. Most take as their object of study an aspect of grammar, though where they steer their fresh investigation varies, facilitated by the nature of the new data which is consulted. Elisabetta Adami reassesses the uses of generic pronouns, contrasting established descriptions with her new findings in recent British and American corpus data, namely the academic written sections of BNC, ANC, the Brown family, and several ICE components. Several writers bring a newly diachronic perspective to existing studies. Anna Belladelli takes a diachronic look at the causes of a spread in the use of going to, going beyond the ‘colloquialisation’ explanation offered by others, including Leech and Smith (this volume), with reference to the Brown and Frown corpora of American English. Marta Degani fills a gap in the description of modals by analysing the use of the hitherto poorly investigated semi-modal ought to more fully, from a similarly short-term diachronic perspective, in the Brown corpus family of British and American English. She finds that her data confirm the general pattern of decrease in the frequency of modal verbs from the period found by Leech (2003, 2004, 2006 and this volume), and ‘sustain Leech’s observation that this decline has been more drastic in the case of infrequent modals such as shall, ought to and need (Leech 2003: 228-9)’. Javier Calle-Martín and Antonio Miranda-García seek to account for a longer-term diachronic change, reporting on their survey into existing work on the use and acceptability of split infinitives from the 17th Century to the present day. They are able to improve on this through the evidence provided by the Lampeter Corpus of Early Modern English Tracts, CLMET, CEN and the BNC. Two writers extend existing descriptions by taking a semantic perspective: Juhani Rudanko reassesses English predicate complementation in this light, using CLMET (3rd part) and the ‘UK Books’ subcorpus of the Collins-Cobuild Demonstration Corpus; while Sara Gesuato supplements existing descriptions of complex predicates with new findings in the Collins-Cobuild Bank of English online about the semantic preferences, as well as the frequency and syntactic environments, of resultative come constructions. Adding a variationist component to existing descriptions of the sociolinguistic features and functions of single invariant tags, Georgie Columbus moves beyond individual language varieties to devise a full corpus linguistic description of the class conducted across five ICE corpus varieties of English (British, Indian, New Zealand, Singapore and Hong Kong). Meanwhile, Chris Rühlemann takes a new, discourse-oriented perspective on the class of reporting verb BE + like. Building on previous studies, Rühlemann
Corpus Linguistics: Refinements and Reassessments
5
examines this structure in relation to its presentation in the BNC and in EFL textbooks. Göran Kjellmer rounds off section four by shifting the focus from grammar to lexis, and in particular to lexical semantics, and studies the change undergone in the CobuildDirect corpus by adjectives conventionally expressing the sense of ‘awfulness’. 5.
Discussion Panel
Section five reports on an ICAME panel discussion, entitled ‘Global English – Global Corpora’. A panel, consisting of Anna Mauranen, Joybrato Mukherjee, Pam Peters and with Marianne Hundt as Chair, take the timely opportunity to air their views on what are widely-used varieties of ‘International English’, touching on a number of issues ranging from ‘ownership’ to whether adequate descriptions are available or even possible from the language learning point of view. Peters assesses the adequacy of language corpora to support such ambitions, deciding that there is a need for improvement, not just in corpus content but in range; a concluding note to the panel report also criticises a current lack of corpus compilation documentation which could ensure caution in interpretation. Mauranen and Mukherjee usefully set up an opposition on the status of these language variants. Mukherjee sees ELF (English as a lingua franca) not as ‘a well-defined variety of English’ but as ‘an umbrella term for a multitude of variants’, a ‘makeshift code’ without a locality; while Mauranen asserts that ‘many communities of practice have adopted ELF and their de facto language, and… the ensuing norms of use are regulated by the participants…ELF is also the language of wide and diffuse networks of uses and users’. Questions from the floor are summarised, together with discussion on such issues as accommodation, nativeness, norms and ‘common core’ English. The assembled gathering concludes that the international core of English cannot yet be described; that ‘ownership’ is still a controversial question; and that what Mair & Mollin (2007) call ‘standard ideology’ is an issue affecting the status of ELF and norms for teaching. Notes 1
The first concordance to the New Testament in English was published in London ca.1535 by Thomas Gybson; the first English concordance to the whole Bible was that of John Marbeck (London, 1550); Alexander Cruden's concordance to the whole English Bible, completed 1737 (London, 1738).
6
Antoinette Renouf & Andrew Kehoe
2
Shakespeare concordances were first created manually, as in Bartlett (1889) or Steveson (1953); and later on electronically derived, as in Spevack (1968-80).
References Bartlett, J. (1960 [1889]) A Complete Concordance or Verbal Index to Words, Phrases and Passages in the Dramatic Works of Shakespeare with a Supplementary Concordance to the Poems. London: Macmillan. Mair, C. & S. Mollin (2007), “Getting at the standards behind the standard ideology: what corpora can tell us about linguistic norms”, in: S. VolkBirke and J. Lippert (eds.) Anglistentag 2006 Halle: Proceedings, Trier: WVT, 341-353. Spevack, M. (1968-1980) A Complete and Systematic Concordance to the Works of Shakespeare. 9 vols. Hildesheim: Georg Olms. Steveson, B. (1953), The Folger Book of Shakespeare Quotations, New Jersey: Folger.
Corpus linguistics meets sociolinguistics: the role of corpus evidence in the study of sociolinguistic variation and change Christian Mair University of Freiburg Abstract The contribution opens with a general discussion of the relationship between sociolinguistics and corpus-linguistics. The point is made that while the concerns of these two traditions in the study of linguistic variability and variation were rather different at the outset they have meanwhile developed in such a way as to make co-operation fruitful and, indeed, necessary. This point is illustrated from the author’s own work on the recently completed Jamaican component of the International Corpus of English. The variables analysed are the use of person(s) as a synonym for people, the presence or absence of subject-verb inversion in questions, the modals of obligation and necessity, negative and auxiliary contraction and, finally, the use of the “new” quotative be like.1
1.
Introduction
By a chronological accident both computer-aided corpus linguistics and variationist sociolinguistics emerged as new subfields of linguistic research at about the same time – in the early 1960s. Both, as we know, have gone on to expand and prosper. However, in the early days there was little to suggest that important contact zones might develop in which the two fields would crossfertilise in unforeseen ways. In early corpus linguistics an understandable bias developed towards the study of the written standard (that is precisely the variety which remained outside the scope of classical sociolinguistics) and towards the study of lexico-grammar (whereas the investigation of phonetic variation dominated in early sociolinguistics). With few commendable exceptions, such as, for example, the London-Lund Corpus, which contained extensive prosodic mark-up, corpora of spoken English reduced the complexity of live speech to orthographic transcription, thus rendering the material unsuitable for the study of pronunciation. This bias towards written and standard English in corpus linguistics is now gradually being redressed. Owing to the immense amount of work necessary in the compilation, there is still a dearth of spoken-language corpora which allow access to pronunciation and prosody, but unlike the earliest corpora of spoken English more recently compiled resources such as the British National Corpus (spoken-demographic component) or the Longman Corpus of Spoken American English (http:/www.pearsonlongman.com/dictionaries/pdfs/Spoken-American.pdf) make available the speech of a broad social range of informants. Corpora devoted
8
Christian Mair
to the New Englishes and emerging standards inevitably contain instances of nonstandard usage, and a small number of corpus projects – such as the Freiburg Corpus of English Dialects (FRED) or the Lancaster Corpus of Written British Creole – are explicitly devoted to the documentation of non-standard varieties. In sociolinguistics there has been a similar broadening of the database. Whereas in the early days the focus was almost exclusively on the spontaneous language use of precisely defined “local” speech communities, recent work has placed emphasis also on communities of practice, larger, more unstable and more difficult-to-define networks of communication, frequently characterised by elements of stylized and conscious language use.2 One result of this trend has been that public speech, language use in the media and even written language are no longer beyond the pale in sociolinguistics. Consider, for example, an important recent (2003) special issue of the Journal of Sociolingustics on “Sociolinguistics and globalisation,” which will be referred to again in Section 7 below and which, alongside more mainstream sociolinguistic fare, devotes three articles to subjects such as “Global schemas and local discourses in Cosmopolitan” (Machin & Leeuwen 2003), language use in Japanese rap music (Pennycook 2003) or inflight magazines (Thurlow & Jaworski 2003). The technicalities of corpus compilation and use of corpora came to the fore as one of the central concerns at a recent major sociolinguistics conference (cf. Beal et al., eds. 2007). The successive widening of the database both in corpuslinguistics and in sociolinguistics has led to a blurring of formerly fixed boundaries and the emergence of a contact zone between the two subfields. A corpus linguist working on the spoken-demographic portions of the BNC requires profound knowledge of the urban dialectology of contemporary Britain; conversely, the rapidly growing number of publicly available corpora of English contains an increasing amount of material which sociolinguists would disregard at their peril. We have thus arrived at a situation in which the question providing the title for Meyer (2004) – “Can you really study language variation in linguistic corpora?” – tends to convey not so much genuine scepticism as a note of irony and mock-disbelief. One controversial point between sociolinguists and corpuslinguists will probably remain the definition of what constitutes proper fieldwork. To the purist, true fieldwork requires that the researcher has full control over every aspect of data collection, annotation and processing. However, less risky and less laborious strategies – such as researchers inviting international student informants into their offices to elicit data on non-standard usage – have been known to be honoured by the encomium “field work”. On such a more generous definition, a corpus linguist looking for instances of the affirmative aye in the speech of middle-class and working-class males in the spoken-demographic portions of the BNC could well claim that he or she was engaged in sociolinguistic fieldwork of sorts. To sound the programmatic claim that corpuslinguistics and sociolinguistics have now developed to a stage where they simply must pool their resources for mutual benefit is one thing. To look for existing successful
Corpus linguistics meets sociolinguistics
9
corpuslinguistic contributions to variation studies which might impress sociolinguists sufficiently to consider closer cooperation is, of course, another. Corpus studies can boast a proud record in one area of variation which is somewhat marginal to sociolinguistics, i.e. the study of variability within the standard conditioned by style, register, medium (speech/writing) or text type (see Johansson (forthcoming) for a convenient summary). The work of Douglas Biber and his associates may be singled out here, both for its quality and originality and for its comprehensiveness, in that it places equal emphasis on synchronic regional and stylistic variability (Biber 1988, Biber, ed. 1994, Biber et al. 1999) and diachronic change (Biber & Finegan 1989, Biber 2003). Another area of success comes a little closer to the core concerns of sociolinguistics: empirical documentation of regional variation in standard Englishes around the world. The study of contrasts between British and American English, first on the basis of the Brown and LOB corpora and subsequently including many further resources, has been one mainstay of corpuslinguistic research since its inception. Prominent among current projects devoted to this problem is, of course, the International Corpus of English (ICE – see Greenbaum, ed. 1996). Interestingly enough, however, the most substantial dialogue between corpuslinguistics and sociolinguistics so far has developed not around the study of present-day English but of variability and change in older stages of the language. Corpus-based historical sociolinguistics has already come to the fore as a mature area of research (Nevalainen & Raumolin-Brunberg 2003, Nevalainen, ed. 2006) – probably because in this area the data is scant and no battles of faith can arise about the proper methods of fieldwork. After this general introductory survey, I will focus on the discussion of specific empirical and theoretical issues which are bound to arise when corpuslinguistics meets sociolinguistics. I will do so mainly on the basis of my own experience working on the recently completed Jamaican component of the International Corpus of English (ICE). 2.
ICE Jamaica: potential and limitations of corpus-based sociolinguistics
Linguistic research on the language situation in the Anglophone Caribbean has traditionally focused on the English-lexifier creole languages of the region (or the basi- and mesolectal parts of the creole-English continuum), neglecting the emerging local variety of standard English. To redress this imbalance, the English Department of the University of Freiburg and the Department of Language, Linguistics and Philosophy at the University of the West Indies, Mona, Jamaica, have cooperated to produce the Jamaican component of the International Corpus of English (ICE). In line with ICE guidelines,3 the corpus comprises about one million words, sampled over a broad range of written and spoken textual genres but generally produced by educated speakers (and not a demographically representative cross-section of the population as a whole).
10
Christian Mair
With text-collection, transcription and mark-up approaching completion, project-related research is currently moving from the pilot stage into the main phase. The project aims at contributing toward a linguistic geography of English in the Caribbean by providing detailed phonetic and lexico-grammatical descriptions of Jamaican English, as well as by examining important pragmatic and sociolinguistic aspects of the use of this variety by educated Jamaicans, including its use in code-switching with Creole/ Creolised English. Furthermore, it is hoped that our results will help to shed further light on questions of standardisation in the context of English as a world language, by comparing the language situation of former colonies with English as a native language (e.g. New Zealand) or a second or official language (e.g. India) to that of the Caribbean, which is of particular interest in this respect due to the existence of its creole substrate. Such “cross-variety,” comparative research is much needed in studies on World Englishes and was one of the foremost research goals envisaged by the founders of the ICE project. Important among the “beyond the corpus” questions are attitudes towards this emerging standard held by speakers and writers and its position with regard to Jamaican Creole, the local mass vernacular. The emerging Jamaican standard is being shaped by three major forces: (i) (ii) (iii)
the persistent but probably declining influence of a traditional colonial British norm; growing influence from the US; growing direct and indirect influence of the Jamaican Creole substrate.
In addition to these, some independent innovation of the type to be expected in any living language is likely to be encountered, as well. Clearly, none of the available ICE corpora was originally designed for sociolinguistic research. The focus was on regional variability in standard English, on the documentation of the New Englishes, including the secondlanguage varieties that have arisen in the wake of decolonisation in the second half of the 20th century, and on stylistic variation within any one of these standards. High hopes were pinned on the opportunity to compare features across varieties in currently ten, and ultimately sixteen, parallel corpora.4 Indeed, this comparative perspective figures prominently in current research undertaken on the basis of ICE Jamaica. Thus, Andrea Sand (2004, and forthcoming) has used ICE Jamaica in conjunction with several other ICE corpora in order to identify the pre-determined breaking points in English grammar or, in other words, those intransparent or otherwise fragile areas of the linguistic system which will give rise to variability whenever the language is transported into new regions, adopted by new groups of speakers as a second or first language or even learned by foreigners. The focus in this type of corpus-based variation studies is on grammatical theory and typology as much as on the narrowly sociolinguistic issues of community-internal social variation and the assignation of prestige and stigma to variant forms of a given variable.
Corpus linguistics meets sociolinguistics
11
Being a sample of the local acrolect or emerging standard, ICE Jamaica is obviously unsuitable as a stand-alone resource for a sociolinguistic investigation of the use of English in Jamaica. Any analysis based on it would have to be complemented by studies of language use in the mesolectal range (such as were carried out – using a Labovian approach – by Patrick (1999)). As I intend to demonstrate in the following five case studies, though, ICE Jamaica does have considerable sociolinguistic potential once ways are found to identify that portion of corpus-internal variability which is sociolinguistically relevant. In other words, the question is how to use the corpus in order to access and reconstruct a sociolinguistic space beyond the corpus. The first of the variables to be investigated is a lexical one – choice between neutral people and formal persons to refer to a plurality of human beings. The second and third – subject-operator inversion in main-clause whquestions and modal expressions of obligation and necessity – are grammatical. The fourth is morphological in terms of form, but pragmatic-stylistic in terms of textual function: choice between full and cliticised or contracted forms of certain auxiliaries and the negator not. The fifth and final phenomenon to be looked at will be instances of the “new” quotative be like in Jamaican English. At first sight, this seems to be a straightforward case of lexical innovation under American influence, but on closer inspection it turns out to involve complicated discursive processes of the “globalisation of vernacular features.”5 3.
Too much person? “Person/people” as a sociolinguistic marker in Jamaican English
Before becoming tangled in the complexities of the Creole-English continuum which informs the actual use of English in Jamaica, it is useful to establish its two extreme ends with regard to the variable studied here. In traditional Creole the noun/pronoun smadi (from English somebody) is the most general reference to an individual human being. It functions as an indefinite pronoun but, depending on the context, could also be considered one translational equivalent of English person.6 The plural of smadi is piipl (obviously derived from the English people). In all varieties of English the noun person can, of course, be pluralized but persons is rarely used outside formal or technical contexts; the usual way of referring to a plurality of human beings is people. In Jamaican English, however, the word person (in singular and plural) is firmly established in mesolectal and acrolectal usage and even displays a number of interesting grammatico-semantic properties which have no immediate equivalent in other varieties of English (as we shall see below). On the basis of the then available written data from ICE-Jamaica, Mair (2002: 48) noted that the plural form persons was far more frequent in Jamaican texts than in texts from corresponding ICE material from Britain, New Zealand and East Africa. With ICE-Jamaica now completed, it is of course tempting to
12
Christian Mair
investigate whether this peculiarity is confined to written usage only or also evident in the spoken domain. As the following examples show, the first lesson to be learnt was that it was not feasible to restrict attention to the plural persons in the spoken data from ICE Jamaica: (1)
(2)
(3)
No no but they’re not around but what you find is that the persons who are teaching JAMALs [Jamaican Movement for the Advancement of Literacy teaching modules] are person like me who no know nutten but are scared of word … Worst if you value the person friendship and you think the person is somebody you want to keep in touch with there’s no way you’re going to I mean let that candle [go] out – you’re going to always try to keep the candle burning … And who was you, uhm, who were the person you [word] Who was the person that you went with?
Example (1) exhibits a clear code-switch into (fairly basilectal) Jamaican Creole, and the second mention of person is, hence, not marked for plural. In (2), the genitive is not marked, which, like the absence of inflectional plural marking, is an occasional option in (upper mesolectal) informal Jamaican English. Example (3) similarly shows the two conflicting or complementary linguistic systems interacting in the online production of speech, this time involving subject-verb agreement and inflectional plural marking. Table 1 below summarises the findings from the now available face-toface conversations in ICE-Jamaica (=texts S1A-1 through 90, c. 180,000 words), in comparison to the corresponding British, New Zealand and Irish material from ICE: Table 1: Frequency of people vs. person(s) in the direct conversations of ICEGB, ICE-NZ and ICE-JA ICE-GB
ICE-NZ
ICE-IE
ICE-JA
people 411 449 275 663 person 76 66 48 157 persons 2* 113 [*of which one read aloud] significances (Ȥ2): people:person – p < 0.01; people:person+persons – p = 0 Note the virtual absence of the plural persons from contemporary spoken British, Irish and New Zealand English, whereas it remains a viable synonym for people in spoken Jamaican English. A first explanation for this state-of-affairs might be that we are dealing with archaic usage. Some support for this view is provided by data from the OED quotation base which are summarised in Figure 1 below:
Corpus linguistics meets sociolinguistics
13
Proportion of people:persons in the OED quotation base
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
persons people
1351- 1451- 1551- 1651- 1751- 1851- 19511400 1500 1600 1700 1800 1900 2000
Figure 1: People vs. persons in the OED quotation base The relative frequencies of people vs. persons were calculated for the second half of every century since the 14th and, as can be seen, the frequency of persons has diminished from a high of c. 40 per cent in the latter half of the 17th century to below 10 per cent. What we find in written (Mair 2002) and spoken Jamaican usage today is roughly comparable to the British English of the 18th and 19th centuries (as it is documented in the very heterogeneous written quotations from the OED). As Jamaican English is certainly not the only ex-colonial variety which has on occasion been considered to tend towards archaic or old-fashioned usage, it is instructive to compare the findings from ICE-Jamaica to secondlanguage varieties from India, Singapore, Hong Kong and the Philippines:7 Table 2: Frequency of people vs. person(s) in the direct conversations of ICE-JA, ICE-India and ICE-Singapore, ICE Hong Kong, and ICE Philippines ICE-JA
ICE-India
ICE-Sin
ICE-HK
ICE-PH
people 663 556 345 1302 330 person 157 103 109 155 143 persons 113 35 3 6 4 significances (Ȥ2): people:person – p = 0; people:person+persons – p = 0 As Table 2 shows, parallels are restricted to the singular. As for the plural persons, Indian English displays some weak similarity with Jamaican English, whereas Singaporean, Hong Kong and Philippines English pattern like the two natively spoken varieties (GB and NZ).8 Once again, the “colonial lag” has not provided an over-arching explanatory framework for developments in World
14
Christian Mair
Englishes but has been exposed as the myth it probably is (cf. Görlach 1987, Hundt forthcoming). The appropriate strategy of investigation thus is to treat each variety in its own right and draw up synchronic formality profiles which ideally would be based on a large number of lexical and morphosyntactic formality markers – for example pairs of near synonyms of etymologically Germanic and Romance origin such as fight-combat, help-assist(ance), spending-expenditure or surviving archaisms such as upon for on. Unfortunately, though, given the size of the ICE corpora, search results for most purely lexical variables are bound to remain tentative. For example, it is interesting to note that the direct conversations from ICE-GB contain not a single relevant instance of either the verb assist or the noun assistance, two formal synonyms of help. (The one instance of assistant found occurred in the collocation assistant manager, in which it is not interchangeable with helper). By contrast, fifteen instances were found in the corresponding portions of ICE Jamaica.9 The results for the on-upon variable are inconclusive in the specific instance of Jamaican English because in this variety upon is not necessarily archaic but could be motivated by Jamaican Creole pan “on”. One relevant morphosyntactic formality indicator, namely auxiliary and negative contractions, will, of course, be treated in depth in Section 6 below. Seen in conjunction with evidence from other formality markers, it is plausible to assume that the noticeable frequency of the word person(s) is at least partly due to the fact that in the Jamaican sociolinguistic situation English is per se a formal choice, particularly in the spoken domain. Additionally, there may well be a tendency towards hyper-correction, i.e. to avoid lexical material such as piipl which is also present in Jamaican Creole. Note, however, that the corpus contains many examples, including those listed in (1) to (3) above, which are far from formal, as is shown, for example by the fact that the noun person occurs in passages otherwise displaying Jamaican Creole features and itself occasionally lacks standard English inflectional endings. Therefore, we should consider a third factor: incipient grammaticalisation, with person developing features of an indefinite pronoun. A “general process whereby generic nouns give rise to pronominal categories” is richly attested in the languages of the world, and person is indeed one of several starting points for this pathway of grammaticalisation (Heine/ Kuteva 2002: 232-233). It is familiar from English-based pidgins and creoles, particularly in West Africa.10 In Caribbean English-lexifier creoles, person is not the typical exponent of the category “indefinite pronoun”, but cases of incipient grammaticalisation are documented. Thus, Allsopp (1996: 437, s.v. person) draws attention to a number of common uses in which person is a translational equivalent of various English indefinite and interrogative pronouns, giving the following citations from Barbadian usage: No person is there, at the door, Which person goin[g] pay all dat money?, and Who the person is? Of these constructions at least the last two seem to be of wider currency in Caribbean Englishes. Thus, which person gwine pay all dat money? and who de person? are acceptable in informal Jamaican English.11
Corpus linguistics meets sociolinguistics
15
Allsopp notes a similar tendency for the word people to be used “as a casual indef[inite] pron[oun], in contexts signalling contempt” (1996: 436, s.v. people). One of his illustrative examples is Those are the underlying evils of Trinidad society. Each man thinks he is people. Is time to stop all that, which shows people being used as equivalent of somebody [important]. It is tempting to assume that the occasional vernacular use of person in pronominal function is a direct boost on the frequency of the word in acrolectal regional Englishes, and that a similar usage involving plural people is an added indirect motivation to use persons – on the assumption that a spontaneous impulse to use people is checked in formal English through the tendency towards hypercorrection noted above, which is expected to encourage a realisation as persons instead. Regardless of how we account for the diachronic origin of the phenomenon, however, one thing is clear. Synchronically, the use of person(s) for people is attested in Jamaican English to an extent which goes beyond any other available ICE corpus, be it native- or second-language, and thus presents a clear case of a statistical regionalism.
4.
Main-clause order in wh-questions
Along with the use of me instead of I in co-ordinate noun phrase subjects (me and my Dad went fishing), the use of never as an invariable past-tense negator (I never met him last night) and the use of the base form of adjectives in adverbial function (some people work good under pressure), the lack of subject-operator inversion (or do-support) in questions is one of the four non-standard morphosyntactic features which Kortmann and Szmrecsanyi (2004: 1193) have shown to have the widest distribution in non-standard varieties of English around the world in their discussion of “vernacular universals” or “Angloversals.” The direct conversations of ICE-Jamaica contain more than enough material to investigate the spread of this phenomenon in the emerging local standard. A search for all “wh”-interrogative pronouns (including, of course, how) was undertaken which showed that while “correct” question grammar of course remains the statistical norm in the data, questions without inversion are common and thus belong among the non-standard syntactic variants which apparently have very little stigma attached to them, comparable, for example, to the stopping of the voiced dental fricative ([ð] [d]) on the phonetic plane (on which see Irvine 2004). Note that the absence of inversion in main-clause questions seems to be exceedingly rare in ICE-GB. A spot-check of the 77 relevant questions in texts S1A 1 to S1A 10 did not yield a single clear example.12
16
Christian Mair
Table 3: Subject-verb inversion in main-clause wh-questions in ICE Jamaica (direct conversations)13
inversion no inversion total
extrapolated frequency wh*
extrapolated frequency how
1259 378 1637
261 60 321
extrapolated frequency/ all 1520 438 1958
per cent 77.6 22.4 100.0
Apart from the syntactically motivated absence of inversion, there is phonetically driven ellipsis of do/did or the auxiliary are through assimilation in rapid speech which is found in many kinds of informal English (what did she say what she [] say; what are you doing what you [] doing). A possible instance from ICE-GB could be the following, for which we could assume the pronunciation [], without an overtly realised operator do: (4)
Oh what d’you mean by programming in Pascal (S1A 8)
However, the original sound recording made available with the second release of ICE-GB has [ ] in this case and thus supports the transcription. By contrast, ICE Jamaica contains several examples which could be regarded as phonetically conditioned deletion of do or are. (5) (6)
What you think about that? How we going to do it?
However, in view of the far greater number of instances which are unambiguously syntactic in nature it is questionable whether there is even a need to invoke such phonetic factors. Consider the following typical instances: (7) (8) (9) (10) (11)
And where you went to high school? Why you choose to do psychology? What exactly they do up here honestly? So why it not happening at that school? What that has to do with it though?
In none of these examples could phonetic assimilation lead to the deletion of the operator (where did you go …, why did you choose …, what exactly did they do …, so why’s it not, what does it have14). In many others, the operator is retained, but stays in place after the subject: (12)
Why you don’t like to stay home with your mother?
Corpus linguistics meets sociolinguistics (13) (14)
17
So how long you’ve been working here? When you’re going?
Note that all examples so far have been taken from passages of text which are located very much at the (standard) English end of the Creole-English continuum, as with the exception of the lacking subject-operator inversion they display no direct influences of the Creole substrate. This means that this construction does not have much stigma associated with it, and that we should not assume codeswitching into Creole when we find it occurring on its own. Such code-switches, however, do occur when absence of inversion combines with clearer (and more stigmatised) Creole features such as lack of inflection for the 3rd person singular or absence of the copula be, as it does in a small number of cases: (15) (16)
How much it cost? So I went to him afterwards and I said uhm what wrong?
The material additionally contains a number of self-corrections by speakers which open up interesting discourse-analytical and processing perspectives. There are cases in which speakers move from an inverted question to an uninverted one, presumably in an attempt to create a more relaxed conversational atmosphere (17 and 18 below), and there are instances of the reverse, speakers correcting a spontaneously produced non-standard form to a standard one (19): (17)
(18) (19)
A: So how do you think that impact because they see it as a drug B: Impact on what? A: On the children and on society on y you know because they associate Rasta with uhm weed and you do smoke so how you think that impact on on on your relationship So what do you suggest What do you suggest What you suggest that we do to to uhm to rectify that situation What was primary school what uhm primary school prep school primary school you go to did you go to
Given that the absence of subject-operator inversion (or do-support) in questions has been identified as one the most widespread grammatical features of the New Englishes and non-standard varieties in general, its presence in Jamaican English is not a surprise in itself. A comparative look the spread of the phenomenon across several ICE corpora (which because of the extremely high frequency of questions remains beyond the scope of the present paper) would be very useful, however, in order to find out whether we are dealing with an “Angloversal,” an unmarked choice in the New Englishes which tends to arise irrespective of the particular local linguistic ecology, or with a contact phenomenon, because – after all – uninverted questions are normal in Jamaican Creole. Assessing the relative impact of universal and language-specific factors in variety formation is an important task in contact linguistics. With regard to a more specific socio-
18
Christian Mair
linguistic research agenda, the role of the variable in managing conversational atmosphere and accommodation among participants, which has become obvious from the illustrations in examples (17) to (19), is of great interest in a qualitative interaction-based sociolinguistic approach. 5.
The modals of obligation and necessity
The modals of obligation and necessity represent one of the well-documented areas of grammatical contrast between British and American English, the globally dominant reference standards. Moreover, this fragment of the grammar has been subject to fairly rapid diachronic change over the past three centuries, with relevant phenomena including the spread of have got to (on the back of earlier have to – see Krug 2000), the decreasing frequency of must and the rapid spread of need to (Mair 2006: 103-108, Mair/ Leech 2006: 326-329). These modals are thus an almost perfect diagnostic to assess the synchronic regional orientation of a New English with regard to British or American norms and also its degree of linguistic conservatism. Table 4 below presents the findings from the Santa Barbara Corpus of Spoken American English (in the absence of an ICE-USA) and five ICE corpora. Table 4: Obligation and necessity in the Santa Barbara Corpus of Spoken American English and the conversation components of four ICEcorpora (S1a 1 – 100) Form:
Santa Barbara 59 0
ICEGB 97 6
ICENZ 136 3
ICEIE 118 3
ICEIndia 206 1
ICEJA 124 3
must must not/ mustn’t need not/ needn’t 0 1 0 3 11 0 NEED* to 111 51 57 50 18 156 NOT* need to 7 8 15 6 1 4 HAVE* to 448 269 364 430 585 627 NOT* have to 51 27 29 22 16 14 HAVE* got to 12 118 114 11 4 2 HAVE* gotta 18 0 0 0 0 1 got to 4 9 42 0 4 6 gotta 96 0 0 1 0 6 *CAPITALISED forms stand for all morphological variants, in this case need, needs, needed, needing; NOT stands for do not, does not, did not, don’t, doesn’t, didn’t, shouldn’t, etc. Owing to the different sizes of the corpora, the findings from the Santa Barbara corpus and the ICE-GB conversations are not straightforward to compare, but one thing which they do show is the expected contrast in the frequency of have got to
Corpus linguistics meets sociolinguistics
19
– high in British English and very low in American English. Note also, on an issue which is not directly related to the concerns of the present paper, that while HAVE got to is attested at a rate comparable to British English in New Zealand, it is rare Irish English. As regards the findings from the five ICE corpora themselves there is no easy explanation for the fact that have to, the most common form in all corpora, should be so much more frequent outside Britain.15 Other than that, we note an almost uncanny similarity of preferences between British English and New Zealand English. Indian English stands out through its markedly conservative profile, reflected in high frequencies for must and low frequencies for the innovative forms need to and have got to. Jamaican speakers do not align with British norms in the same way that New Zealanders seem to be doing. Note that while they even lead in the use of the innovative need to, on the whole they avoid the British have got to. The resulting profile thus resembles an American English one. For the time being, we must leave open the question of whether this similarity has come about gradually and independently or whether it reflects recent exposure to and re-orientation towards a US English norm on the part of a growing number of Jamaicans. The most intriguing explanandum in Table 4 is the frequency of need to in Jamaican English. As this form is spreading rapidly in British and American English at the moment (Mair/ Leech 2006: 326-329), the conservative explanation would be to point out that the spoken texts of ICE Jamaica were recorded in the early 2000s, that is at least ten years later than those of most other ICE corpora (except ICE Ireland). However, whether this factor is enough to account for the entire disparity must remain open. The most robust result of Table 4, on the other hand, is the solidly nonBritish or even North American profile of variation in the use of modals which emerges from the ICE-JA data. This profile is only partly corroborated by searches for several other demarcators of British and American usage. British and British-influenced Englishes, for example, are known to be characterised by a preference for towards over toward, whereas the reverse is true for American English and varieties related to it. Table 5 lists some pertinent figures from a number of ICE corpora and an American reference database, namely the Corpus of Spoken Professional American English (CSPAE): Table 5: Towards vs. toward in selected corpora ICE- ICEGB NZ towards 311 342 toward 9 25 * Figures are based on the writing. significances (Ȥ2): p = 0
ICEIE 253 5 470 out
ICEJA* 204 50 of 500
ICEICEIndia Philippines 273 126 7 61 texts available at the
CSPAE 124 264 time of this
20
Christian Mair
All historically British-influenced varieties, and even Philippine English, share the British preference for towards over toward, though the “American” variant has slightly higher frequencies in the Jamaican and Philippine corpora than in the others. Similar observations can be made for the use of gotten as a variant of the past participle of the verb get. While at frequencies of 2, 2, 6 and 8 in ICE-GB, ICE-India, ICE-NZ and ICE-Ireland respectively, the form is marginal in these varieties, ICE-JA has 34 instances. 6.
Contractions
The contraction of certain auxiliary verbs (e.g. he’s for he is) and of the negation particle not (e.g. isn’t for is not) are variables which are extraordinarily well suited to an approach combining corpuslinguistics and sociolinguistics. As precisely definable search strings, such forms are easily retrievable from digitised text, and at the same time contractions of this type are one of the most reliable indicators of stylistic (in)formality (cf., e.g., Diller 1999, Peters 2001, YaegerDor, Hall-Lew & Deckert 2002). Formality levels in the conversational texts of ICE Jamaica provide crucial evidence when it comes to determining the status of standard English in Jamaica. If the level of formality were high and if the range of observed stylistic variability were narrow,16 this would mean that the role of acrolectal English is marginal in spoken usage and that, unlike writing, where it clearly dominates, it is an extraneous or “adoptive” (Shields-Brodber 1989, 1997) standard in oral communication. The great advantage of the corpuslinguistic working environment provided by ICE is that the frequency of contractions in spontaneous speech can be compared across varieties. Thereby, contraction frequencies in ICE Great Britain, ICE Ireland and ICE New Zealand can be taken to represent the norm for uncontroversial instances of contemporary native-speaker usage in largely monolingual contexts. By contrast, ICE India illustrates the situation in a typical multilingual environment in which English serves as a prestigious and formal second language. The working hypothesis is that contraction rates will be uniformly high in native-speaker usage, because here it is English which is the default choice for the informal baseline style of face-to-face talk. Whether it is possible to have a conversation in English and remain informal is an open question in the Indian sociolinguistic context, and – probably to a lesser extent – also in the Jamaican one. For the following experiment, all combinations of a pronominal subject and a form of the verb be in the present tense were investigated in the spontaneous-dialogue sections of ICE Great Britain, ICE Zealand, ICE India and ICE Jamaica. The findings are thus based on text samples S1A-1 to S1A-100, that is a total of c. 200,000 words of transcribed dialogue per corpus.17 Table 6 lists the search strings in question:
Corpus linguistics meets sociolinguistics
21
Table 6: Be-contractions searched in five ICE corpora not contracted/ not negated I am you are he/ she/ it is we are they are
not contracted/ negated I am not you are not he/ she/ it is not we are not they are not
subject-verb contraction I’m (not) you’re (not) he/ she/ it’s (not) we’re (not) they’re (not)
negative contraction I amn’t you aren’t he/ she/ it isn’t we aren’t they aren’t
Recall that our working assumption was that contraction frequencies would be uniformly high in spoken British, Irish and New Zealand English. Figures for Indian English were expected to be low. As is shown in Table 718, this expectation is substantially borne out. Interestingly enough, Jamaican English does not reach the very high contraction rates of the uncontroversially nativespeaker corpora, but remains nevertheless much closer to them than to a clear second-language variety such as Indian English. Table 7: Be-contractions in five ICE corpora – global frequencies19 uncontracted ICE-GB 232 ICE-NZ 90 ICE-IE 336 ICE-JA 582 ICE India 2297 significances (Ȥ2): p = 0
contracted
total
4036 3809 4092 3214 1588
4258 3899 4428 3796 3885
contraction rate in per cent 94.8 97.7 92.4 84.7 40.9
It is, of course, possible to refine the analysis also by looking at the returns for individual pronouns and the corresponding forms of the verb be (see Appendix for figures). This more delicate analysis shows, for example, that contractions of is are significantly more common than contractions of are in Indian English, or that the form amn’t, a marginal presence in British English, is practically absent from all other varieties. In addition, the relatively low values for negator contractions (n’t) are, of course, due to the fact that the search was restricted to pronominal subjects. Such considerations notwithstanding, the general trend documented in Table 7 remains robust. In sum, the analysis shows that with regard to the variable at issue Jamaican English does not exhibit the formality-profile of a typical secondlanguage variety (Indian English), but tends towards the native ones without fully reaching their high contraction rates. Seen as a corpus, ICE-JA thus appears to present material which is very much like natively spoken English. However, this does not mean that English should be considered the native variety of each and every speaker recorded in the corpus. A promising direction for further
22
Christian Mair
sociolinguistic analysis would thus be to determine the extent of inter-speaker and intra-speaker variability in the corpus material, as the somewhat “mixed” character of Jamaican English might result from the fact that the sample contains a number of speakers who have contraction rates comparable to those found in British or New Zealand English (i.e. native speakers of English who use the language across the entire formality range) and others whose profile matches that of second-language speakers (i.e. the speakers of “adoptive” English in the sense of Shields-Brodber 1989 whose natural mode of informal expression is a mesolectal variety of Jamaican Creole). 7.
“New quotatives” in Jamaican English and the globalisation of vernacular features
The new quotatives go and be like – first identified as innovations in American English by Butters (1980, 1982)20 – are among the fastest-spreading grammatical constructions in English today. In particular, be like is not only spreading in the variety in which it originated, American English, but has been reported as an innovation in Australian English, Canadian English and Newfoundland English, British (=English) and Scottish English (see Barbieri 2005: 223 and Buchstaller 2006b: 363 for a review of pertinent research). Thus, its presence in ICE Jamaica does not come as a surprise. (20)
I don’t know what they were thinking some chicken stuff and fish and whatever it is with uhm what’s that dressing vegetable dressing on the chicken and Okay well who eat that I’m like hello we are black people from the Caribbean please no white people here You know No maybe white people would eat stuff like that
(21)
You know she knows nothing about these people. Me fraid you know the man a call her she run gon go go take picture So I’m like where’s the picture we thought it was a instant thing. She’s like no him have it.
Note that while the direct conversations of ICE Jamaica contain c. 50 clear instances of quotative be like, quotative go seems to be absent from the data. There is, however, one informal quotation-introducing device which is in competition with be like, namely Jamaican Creole mi say, him say etc. As is not surprising in such a case of rapid change in progress, the use of be like is influenced by diverse extralinguistic and structural factors “such as age and sex of the speaker […] grammatical person of the subject, discourse function of the quotation and tense” (Barbieri 2005: 223) and – the point of Barbieri’s (2005) own paper – register. Summarising the results of previous research on the new quotatives, Buchstaller reports that “a number of studies have suggested that be like might eventually push out go” and that “U.S. respondents associate quotative be like […] with younger speakers and women. It also triggers a range
Corpus linguistics meets sociolinguistics
23
of associations with personality traits, many of which can be subsumed in the category ‘social attractiveness’, or solidarity traits” (2006b: 363). Buchstaller subsequently investigates the use of and attitudes towards the new quotatives in British English, focussing specifically on the question of whether the adoption of a new form also implies the adoption of the functional and attitudinal indexicality associated with it in the variety in which it originated. She concludes: […] that if be like has been imported from the U.S., speakers in the British Isles have not simply passively adopted the social attitudes attached to it. Rather, the adoption of global resources is a much more agentive process, whereby travelling features are actively re-evaluated and manipulated on the perceptual level. As linguistic resources are borrowed across the Atlantic, they may lose or gain associations during the process or, alternatively, already existing percepts may be re-analyzed and re-evaluated. Consequently, for speakers of the borrowing variety, new associations interact with possibly secondhand ones and aspects of existing meaning can become more or less salient during the process. (2006b: 375) There is reason to assume that similar processes of dissociation and “re-allocation of attitudes” (Meyerhoff & Niedzielski 2003) are at work in the spread of be like in Jamaican English. What is in line with many observations made on varieties of English spoken outside Jamaica is the concentration of be like among younger female speakers: of the ca. 50 instances collected, all but 5 examples are from speakers younger than 25 years, and only three are produced by males (by two different speakers, both in the 26-45 age bracket). However, what is sociolinguistically unique about the linguistic situation in Jamaica is that the strongest non-standard and informal competitor of quotative like is not go, but Jamaican Creole quotation-introducers such as mi say/ dem say/ him say. This means that there are two different ways of being informal, an international one imported recently from informal American English and a local one, from Jamaican Creole, with a long historical standing. Note finally, that the sheer frequency with which be like is attested in the Jamaican data is striking. Although normalised frequencies per million words are difficult to reconstruct from Buchstaller’s analysis,21 it is safe to say that quotative be like seems, somewhat surprisingly, to be as common in Jamaican English as in American English, the variety it originated in a mere four decades ago. Given its rapid recent spread in so many varieties of English, we would, of course, have to ask whether be like would not even be more frequent in more recent British and American material. As for the ICE working environment in general, the lesson taught by this exploratory look at quotatives in Jamaican English is that the various corpora are clearly not comparable to a sufficient degree in this case of rapid change in progress. When ICE-GB was sampled, quotative like had barely reached Britain and is therefore not attested. Quotative like is amply attested in ICE Ireland,
24
Christian Mair
whose spoken texts were recorded 10 to 15 years later, at roughly the same time as those of ICE Jamaica. Quotative like is by and large unattested in all those second-language ICE corpora (East Africa, India, Singapore, Philippines) which were collected in the time in between. But whether this is a sign that secondlanguage varieties resist this particular innovation more than natively spoken ones is uncertain; it may well be that the spoken texts for these corpora were sampled too early. 8.
Conclusion
The study of selected types of local or non-standard usage in ICE Jamaica shows very clearly what corpuslinguistics and sociolinguistics have in common, namely an interest in linguistic variation. However, it also shows very clearly what still sets them apart, namely their different analytical perspectives. In Barbieri’s terms, corpus linguists start out from charting the “frequency patterns of use” observed in their data, whereas sociolinguists working in the variationist paradigm first define the variable and then aim to identify “the contribution of particular factors to the probability of the choice” between particular variants (2005: 224). The two perspectives are by no means incompatible, but the different emphases they engender for research practice need to be spelled out. First, while both corpuslinguistics and sociolinguistics generally use quantification and statistics, their approaches differ. A typical corpus-linguistic frequency measure, for example, is absolute or normalised frequency (say, per million words). Sociolinguists, on the other hand, give (and tend to think in) group-specific realisation rates (e.g. per cent of realisation of a variable as variant X). In many sociolinguistic studies (including Buchstaller’s study of quotatives reported above), absolute corpus size is thus difficult to infer, which may make comparison to corpus-based studies rather difficult. Secondly, corpus data is usually in the public domain, which allows easy replication of studies and, ideally, cumulative progress as a research community builds up around a corpus and profits from and builds on one another’s work. The raw data of sociolinguistic studies, by contrast, is rarely made available to the general academic public. The starting point for most corpus-analysis is concordancing. Quantification chiefly focuses on establishing collocational patterns, the influence of structural context on the choice of variants, and on corpus-internal variability by register or genre. The chief aim of variationist sociolinguistics, on the other hand, remains finding out about “the correlation of dependent linguistic variables with independent social variables [which] has been at the heart of sociolinguistics since its inception more than three decades ago” (Chambers 1995: Preface). Of course, this does not mean that the linguistic context in which a variable occurs is irrelevant for sociolinguists. Any decent variationist study of word-final consonant-cluster deletion or some such classic variable would distinguish between utterance-final, pre-consonantal or pre-vocalic environments
Corpus linguistics meets sociolinguistics
25
at least. It merely means that such aspects will usually not remain the major preoccupation of a study. Similarly, a corpus linguist is free in principle to access any ICE corpus as a stratified sample of speech produced by older and younger speakers, male and female speakers, and so on. In practice, though, this approach is not supported by standard corpus-analytical software tools and may therefore tend to be avoided. And if one is willing to shoulder the necessary work, one may still be disappointed, as the sociolinguistic information in many a file-header may be very generic (“male, English”) or even missing for many a participant in a conversation. Among many hundreds of individual informants contributing to the spoken-demographic portions of the BNC there is “‘Rudy,’ 61, West Indian, warehouse manager, social class C1 (junior management, supervisory or clerical”, who has contributed his c. 10,000 words to text KCP, but it is a long way to get to him. To turn from these general considerations to the specific sociolinguistic constellation investigated in the present paper: what does ICE-Jamaica tell us about the current state of development of the emerging standard of English usage in Jamaica? As was pointed out above, this emerging standard is developing in a pull among three competing orientations: British, American, and local (that is, in contact with Jamaican Creole). In addition, it is a legitimate question to ask whether Jamaican English shares features with other New Englishes with which it has not been in direct contact (in the spirit of the “Angloversals” debate reported on in Section 4). The following tabular survey shows how this pull plays itself out with regard to the five variables investigated here. They are displayed along the vertical axis of the Table. The horizontal axis lists historical and current contact influences and orientations and, in the rightmost column, possible similarities to other New Englishes which are not motivated by direct contact. A “+” sign indicates similarity between Jamaican usage and the norm in question; a “-” stands for distance to it: Table 8: Competing orientations in the Jamaican standard Variable Ļ
Orientation ĺ
people/ persons +/- inversion in mainclause questions modals of obligation and necessity contractions quotative be like
GB
US
-
local/ Jam. Creole + +
-
+
-
+
-
-
-
+
n.a. -
-
“Angloversals”
On the evidence of this partial survey (restricted as it is to five variables), there is little reason to continue including Jamaican English among British-influenced post-colonial standards such as Australian English or New Zealand English.
26
Christian Mair
Jamaican Creole, mesolectal informal English and even American English seem to have become more important contact varieties today than the now remote former colonial British standard. In addition, there limited parallels between Jamaican English and second-language standards such as Indian English, which show that English tends to be restricted to formal domains of use in spoken communication. While many speakers of educated Jamaican English continue to believe in the essentially “British” nature of their standard, hard evidence for such a view seems to be disappearing outside the relatively firmly regulated area of spelling. Such is the state of linguistic development 47 years after Jamaican independence in 1962. Notes 1 This research is supported by external funding from the Deutsche Forschungsgemeinschaft (DFG MA 1652/4 “Educated Spoken English in Jamaica: Phonetische/ lexikogrammatische Normierung und soziolinguistischer Status”), which is gratefully acknowledged. In addition I would like to thank Dr. Dagmar Deuber, Freiburg, for her insightful comments on a previous version of this paper. Dr. Birgit Waibel and LuminiĠa-Irinel Traúcă have helped with the corpus counts. 2
To describe the successive extensions of scope in sociolinguistics over the past four decades, Penelope Eckert has recently used the metaphor of three “waves.” The first wave is classic variationism as exemplified in Labov’s 1966 Social Stratification of English in New York City, exploring the “big picture” by establishing quantitative correlations between independent social variables and dependent linguistic variables. Like the first wave, the second wave of sociolinguistic studies is focussed on the use of a given variety by its community of speakers, but uses ethnographic methods to gain a deeper understanding of how variation operates in and for a community. The third and most recent wave goes beyond the study of variables in localised speech communities and studies variation “not as a reflection of social place, but as a resource for the construction of social meaning” (Eckert 2005: 1). This means that the focus of interest shifts from the linguistic variable, chosen frequently because of its intrinsic linguistic interest – for example as a presumed instance of change in progress –, to the study of communicative styles which are not necessarily localisable any longer.
3
For further details see, e.g., Greenbaum, ed. 1996 or the project’s homepage at http://www.ucl.ac.uk/english-usage/ice/.
4
The following ICE corpora are publicly available: Great Britain, New Zealand, East Africa, India, Hong Kong, Ireland, Singapore, Philippines. ICE Australia is completed and can be consulted on request through a server at Macquarie University. Work on ICE Jamaica is substantially
Corpus linguistics meets sociolinguistics
27
complete, and publication is imminent. Data collection is still in progress for ICE Canada, Fiji, Malaysia, South Africa, Sri Lanka, USA. Cf. http://www.ucl.ac.uk/english-usage/ice/index.htm. Further projects, such as, for example, a corpus documenting Maltese English, are in the planning stage. 5
Cf. Buchstaller 2006b: 362, who writes that in such “cases of borrowing, the stereotypes attached to linguistic items are not simply taken over along with the surface item. Rather, the adoption of global resources is a more agentive process, whereby attitudes are re-evaluated and re-created by speakers of the borrowing variety.”
6
See the entries for smadi, s’madi and somebody in Cassidy/ LePage 1980.
7
ICE-East Africa was not included in this comparison, because it contains an insufficient amount of spontaneous speech.
8
The exceptionally high figure for people in ICE Hong Kong is a matter which cannot be pursued here. It is partly due to an apparent preference in this variety for analytical expressions such as Hong Kong people (rather than Hong Kongers) or Chinese people (rather than the Chinese).
9
The total returns were 19, from which four irrelevant hits were discarded. For comparison, ICE India yielded 15 returns, from which 3 turned out to be genuine. The figures obtained for help* were 49, 107 and 89 in ICEGB, ICE-JA and ICE-India respectively.
10
The process of grammaticalisation has been completed in Nigerian Pidgin, for example (Dagmar Deuber, personal communication).
11
Joseph Farquharson (personal communication) points out that for him as a native speaker there is an assumption that which person, unlike who, implies that there is a known group from which an individual is selected, while who makes no such assumption.
12
In fact, there was one clear instance of the opposite of what was looked for: inversion in an apparently dependent clause: “Well we’re heading to how d’you get into working with disabled people.” (S1A 4)
13
The following procedures were adopted. To identify the relevant questions from the corpus a search was undertaken for all instances of wh* and how in S1A 1 to S1A 90, which yielded 5246 returns for manual post-editing. The extrapolated frequencies and percentages in Table 3 are based on an inspection of 400 instances of wh* and 100 of how (i.e. a total of 500 cases). 191 (= 143 + 48) of the 500 concordance hits were identified as syntactically independent questions. Of these 42 (= 33 + 9) did not display inversion. From among the borderline cases, I excluded why (not) + inf.
28
Christian Mair
questions, what if questions, echo-questions (e.g. you do what?), and verbless or incomplete wh-/how questions (i.e. Why?, How?, What else?, What about + NP?, etc.). Questions in passages of direct speech, on the other hand, were treated as syntactically independent and therefore included (e.g. Sometimes even if you just ring the phone one time and say hi how are you doing). 14
This analysis presupposes that the use of the operator do is normal with have to in questions and negations in Jamaican English, and that an older British variant – what has that to do with it, though? – is no longer relevant. If it was, the example would have to be re-classified with (12) to (14) below.
15
Standard significance tests are not available for this table, as there are too many cells with less than five members. Although hafi, “have to,” is common in Jamaican Creole, it is difficult to gauge the extent of “substrate” influence in Jamaican English here, as similarly high values can be observed in Indian English. The presence of hafi in Jamaican Creole, on the other hand, may work as an impediment to the spread of have got to/ gotta.
16
Shields-Brodber 1989 observes a tendency towards “monostylistic” usage among contemporary habitual users of English in Jamaica.
17
In addition to the 90 samples of direct conversations analysed in sections 3 and 4, the investigation thus includes the 10 samples of telephone conversations.
18
Table 7 gives global frequencies; for a detailed break-down of individual results from five corpora see Appendix.
19
Note that only those uncontracted forms were counted which could in theory have been contracted. Thus, I am would have been counted in I am here, but not in the short affirmative answer Yes, I am. For the sake of completeness it should be added that in addition to the forms listed in Table 2 these figures contain two instances of ain’t from ICE-NZ and one from ICE Jamaica.
20
For important follow-up studies on the phenomenon in American English see Blyth, Recktenwald & Yang 1990, Romaine & Lange 1991, or Barbieri 2005.
21
Buchstaller (2006a: 8-9) reports finding 93 instances in a corpus of British English spontaneous speech comprising roughly a million words and 121 in the portion of the American Switchboard Corpus which she used (which apparently is about a quarter of the total 3 million words). As the conversations from ICE Jamaica make up only c. 200,000 words, the
Corpus linguistics meets sociolinguistics
29
normalised frequency (per million words) for this variety would have to be estimated at about 250. References Allsopp, R. (1996), Dictionary of Caribbean English usage. Oxford: OUP. Barbieri, F. (2005), ‘Quotative use in American English: a corpus-based crossregister comparison’, Journal of English Linguistics, 33: 222-256. Beal, J., K.P. Corrigan and H. Moisl (eds.) (2007), Creating and digitizing language corpora. Vol 1: Synchronic databases. Basingstoke: Palgrave Macmillan. Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (ed.) (1994), Sociolinguistic perspectives on register. New York: OUP. Biber, D. (2003), ‘Compressed noun-phrase structures in newspaper discourse: the competing demands of popularization vs. economy’, in: J. Aitchison and D.M. Lewis (eds.) New media language. London: Routledge. 169181. Biber, D. and E. Finegan (1989), ‘Drift and evolution of English style: a history of three genres’, Language, 65: 487-517. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), The Longman grammar of spoken and written English. London: Longman. Blyth, C., S. Recktenwald and J. Wang (1990), ‘I’m like, ‘Say what?!’: a new quotative in American oral narrative’, American Speech, 65: 215-227. Buchstaller, I. (2006a), ‘Diagnostics of age-graded linguistic behaviour: the case of the quotative system’, Journal of Sociolinguistics, 10: 3-30. Buchstaller, I. (2006b), ‘Social stereotypes, personality traits and regional perception displaced: attitudes towards the ‘new’ quotatives’, Journal of Sociolinguistics, 10: 362-381. Butters, R. (1980), ‘Narrative Go ‘Say’’, American Speech, 55: 304-07. Butters, R. (1982), ‘Editor’s note [on be like ‘think’]’, American Speech, 57: 149. Cassidy, F.G. and R.B. LePage (1980), Dictionary of Jamaican English. Cambridge: CUP. Diller, H.-J. (1999), ‘Some thoughts on the stylistic function of contractions in written texts’, in: U. Carls and P. Lucko (eds.) Form, function and variation in English. Frankfurt: Lang. 235-245. Eckert, P. (2005), ‘Variation, convention, and social meaning’ [Presidential Address, 2005 LSA Meeting]. http://www.stanford.edu/~eckert/EckertLSA2005.pdf Görlach, M. (1987), ‘Colonial lag? The alleged conservative character of American English and other ‘colonial’ varieties’, English World-Wide, 8: 41-60. Greenbaum, S. (ed.) (1996), Comparing English worldwide: the International Corpus of English. Oxford: Clarendon Press.
30
Christian Mair
Irvine, A. (2004), ‘A good command of the English language: phonological variation in the Jamaican acrolect’, Journal of Pidgin and Creole Studies, 19: 41-76. Johansson, S. (forthcoming), ‘Interpreting textual distribution: social and situational factors’. Arbeiten aus Anglistik und Amerikanistik 34. Heine, B. and T. Kuteva (2002), World lexicon of grammaticalization. Cambridge: CUP. Hundt, M. (2009), ‘Colonial lag, colonial innovation, or simply language change?’ in: G. Rohdenburg and J. Schlüter (eds.) One language, two grammars: morphosyntactic differences between British and American English. Cambridge: CUP. 13-37. Kortmann, B. and B. Szmrecsanyi (2004), ‘Global synopsis: morphological and syntactic variation in English’, in: B. Kortmann et al. (eds.) A handbook of varieties of English. Vol II: Morphology and syntax. Berlin: Mouton de Gruyter. 1142-1202. Machin, D. and T. Leeuwen (2003), ‘Global schemas and local discourses in Cosmopolitan’, Journal of Sociolinguistics, 7: 493-512. Mair, C. (2002), ‘Creolisms in an emerging standard: written English in Jamaica’, English World-Wide, 23: 31-58. Mair, C. (2006), Twentieth-century English: history, variation, standardization. Cambridge: CUP. Mair, C. and G. Leech (2006), ‘Current changes’, in: B. Aarts and A. McMahon (eds.) The handbook of English linguistics. Oxford: Blackwell. 318-342. Meyer, C. (2004), ‘Can you really study language variation in linguistic corpora?’ American Speech, 79: 339-355. Meyerhoff, M. and N. Niedzielski (2003), ‘The globalization of vernacular variation’, Journal of Sociolinguistics, 7: 534-555. Nevalainen, T. and H. Raumolin-Brunberg (2003), Historical sociolinguistics: language change in Tudor and Stuart England. London: Longman. Nevalainen, T. (ed.) (2006), Types of variation: diachronic, dialectal and typological interfaces. Amsterdam: Benjamins. Patrick, P.L. (1999), Urban Jamaican Creole: variation in the mesolect. Amsterdam: Benjamins. Pennycook, A. (2003), ‘Global Englishes, Rip Slyme, and performativity’, Journal of Sociolinguistics, 7: 513-533. Peters, P. (2001), ‘Corpus evidence on Australian style and usage’, in: D. Blair and P. Collins (eds.) English in Australia. Amsterdam: Benjamins. 163178. Romaine, S. and D. Lange (1991), ‘The use of like as a marker of reported speech and thought: a case of grammaticalization in progress’, American Speech, 66: 227-279. Sand, A. (2004), ‘Shared morpho-syntactic features of contact varieties: article use’, World Englishes, 23: 281-298.
Corpus linguistics meets sociolinguistics
31
Sand, A. (forthcoming), ‘Angloversals? Shared morpho-syntactic features in contact varieties of English’, unpublished “habilitation” thesis, University of Freiburg. Shields, K. (1989), ‘Standard English in Jamaica: A case of competing models’, English World-Wide, 10: 41-53. Shields-Brodber, K. (1996), ‘‘Old skeleton, new skin’: the relationship between open syllable structure and consonant clusters in Jamaican English’, in: P. Christie (ed.) Caribbean Language Issues: Old and New. Kingston: UWI Press. 4-11. Shields-Brodber, K. (1997), ‘Requiem for English in an ‘English-Speaking’ Community’, in: E. Schneider (ed.) Englishes around the World II: Caribbean, Africa, Asia, Australasia – Studies in Honour of Manfred Görlach. Amsterdam: Benjamins. 57-67. Thurlow, C. and A. Jaworski (2003), ‘Communicating a global reach: inflight magazines as a globalizing genre in tourism’, Journal of Sociolinguistics, 7: 579-606. Yaeger-Dor, M., L. Hall-Lew and S. Deckert (2002), ‘It’s not or isn’t it? Using large corpora to determine the influences on contraction strategies’, Language Variation and Change, 14: 79-118. Appendix A: Be-contractions in five ICE corpora (conversations only) – raw data ICE-GB I am I am not I’m I’m not I amn’t
25 2 678 135 0
ICENZ 4 1 505 88 0
ICE-IE
ICE-JA
39 2 620 95 0
73 12 732 150 0
ICEIndia 275 41 397 66 1
you are you are not you’re you’re not you aren’t
30 1 388 63 2
6 2 272 33 1
29 1 346 56 0
86 9 462 66 2
321 27 91 2 0
he/she/it is he/she/it is not he’s/she’s/it’s
114 7 2087
51 3 2099
244 5 2234
229 19 1201
905 81 881
32
Christian Mair
he’s/she’s/it’s not he/she/it isn’t
208
216
231
169
87
26
9
13
9
2
we are we are not we’re we’re not we aren’t
16 1 147 11 1
9 0 152 11 0
12 0 113 14 0
60 2 158 12 0
185 28 22 2 0
they are they are not they’re they’re not they aren’t
32 4 258 32 0
14 0 387 34 0
28 0 340 27 3
77 15 227 25 0
390 44 32 5 0
B: Be-contractions in five ICE corpora (conversations only) – summary not contracted/ not negated
not contracted/ negated
subject-verb contraction
ICE-GB 217 15 4007 ICE-NZ 84 6 3797 ICE-IE 352 8 4076 ICE-JA 525 57 3202 ICE-India 2076 221 1585 * Figure contains two and one instance of ain’t respectively.
negative contraction
29 12* 16 12* 3
Creating corpora from spoken legacy materials: variation and change meet corpus linguistics Joan C. Beal University of Sheffield Abstract Contrasting the aims and methodologies of corpus linguists and variationists, Charles Meyer writes that the latter ‘have been more interested in spoken language’ and ‘have tended to collect data for private use and have not generally made public their data sets’ (2006: 169). Since the advent of sociolinguistics in the 1960’s, individual scholars and research teams have been amassing recordings of spoken data, often for the purpose of investigating variation across a limited number of linguistic features. Surprisingly little of this material has, however, been made accessible to the wider community of scholars. As John Widdowson points out, ‘much of this data remains hidden and inaccessible, scattered in numerous, often obscure, repositories’ (2003: 81). What is more, these valuable legacy materials are often kept in inadequate storage facilities, and in obsolescent media, leading to the danger of them being lost forever. The Newcastle Electronic Corpus of Tyneside English (NECTE) was created with the aid of a Resource Enhancement Grant from the then AHRB with the primary objective of ‘rescuing’ legacy materials from the Tyneside Linguistic Survey collected c.1969 and creating an accessible corpus by combining these with more recently-collected data from the Phonological Variation and Change project, collected c.1994. More specifically, the resultant corpus was designed to be of use to as wide a range of end-users as possible and therefore available in a number of formats: sound, phonetic transcription, orthographic transcription and grammatical mark-up. The challenges posed by this project, and the ways in which the project team overcame them, will be the main focus of this paper, and should provide useful pointers to anybody intending to embark on creating a corpus of spoken language, whether from legacy materials or from newly-collected data. The topics to be covered are: (i) ethical and legal issues surrounding the making accessible of data collected in an era before ethics review or the UK’s 1998 Data Protection Act; (ii) the challenges involved in gathering metadata and digitising ‘old’ audio material; (iii) standards of transcription and mark-up. Finally, there will be some discussion of plans to process other ‘legacy’ materials, and progress made towards developing common standards, as set out in Kretzschmar et.al. (2006).
1.
Introduction: Corpus Linguists and Sociolinguists
In his introduction to the special volume of Journal of English Linguistics devoted to papers from ICAME 2005, Charles Meyer notes that ‘although corpus linguists and variationists…have always had a shared interest in the analysis of empirical data, they have approached the analysis of variation in different ways’ (2006: 169). He goes on to contrast the approaches of corpus linguists and variationists in the following ways:
34
Joan C. Beal 1.
Whilst corpus linguists have tended to study both spoken and written language, variationists have concentrated on spoken data;
2.
Corpus linguists create public corpora, whilst sociolinguists mainly collect data for private use;
3.
Corpus linguists have concentrated on standard varieties, whereas sociolinguists have paid more attention to non-standard accents and dialects.
The title of the conference session whose papers appear in this issue of JEL ‘Corpora and the Study of Regional and Social Variation’, itself indicates that there is increasing convergence between Variationists and Corpus Linguists on point 3. The availability of corpora of different national varieties of English, most notably the ICE corpora, and of regional varieties, such as the Freiburg corpus of Region English Dialects (FRED), has allowed corpus linguists to turn their attention to variation and variationists to have access to large amounts of comparable data. At Sociolinguistics Symposium 15 in 2004, a workshop entitled ‘Models and Methods in the Handling of Unconventional Digital Corpora’ included fourteen contributions from a diverse group of scholars, some of whom would consider themselves corpus linguists, some socio-or historical linguists but all of whom had developed or were developing corpora which incorporated historical, regional or social variation. The very fact that such a wide range of scholars participated in this workshop bears witness to this convergence between the disciplines.1 Point 1 is true to some extent, though some variationists have looked at corpora of written language: for instance, Sali Tagliamonte (2007) has compiled a corpus of data from instant messaging in order to analyse adolescents’ use of language online. What I would like to concentrate on in this paper is point 2: is it still true that variationists collect data for private use, and, if so, what are the obstacles to making this public? 2.
‘Hidden and Inaccessible’: the legacy of sociolinguistics
In a paper first delivered at the first UK Language Variation and Change conference in Reading (1997), but published in 2003, John Widdowson called for a corpus to be created from all the material collected by variationists during the 20th century, or at least as much as survives. He laments the fact that: much remains in often widely dispersed and inaccessible locations in departmental collections, or, we must admit to our shame, kept in inadequate storage conditions in our own offices, or even at home, gathering dust, wow and flutter, print-through and meltdown, silently shedding the hard-won sounds of twentieth-century speech in the
Creating corpora from spoken legacy materials
35
constantly dispersing particles of ferric oxide of an obsolescent recording system. (Widdowson 2003: 84) Widdowson’s description of the vast amount of linguistic data languishing unloved and undiscovered would melt the hardest heart, but the idea of gathering all these into a national repository is impractical, to say the least. Issues of copyright, ownership and data-protection alone would strangle such a project at birth. Any scholar with a box of audio-tapes in the attic, perhaps recorded for a student project, needs to ask questions such as: who owns the intellectual property in them, the researcher or the university at which he or she was studying at the time? Was informed consent obtained from the speakers recorded, and is there a record of this? Did this consent include the recording being made available to other researchers? Did the World Wide Web even exist when the recordings were made? Is there a record of the speakers’ names and addresses from which they could be contacted in order to obtain consent retrospectively? As I hope to demonstrate, these problems are not insuperable, and for recentlycollected data the requirement of the major research councils that data from funded projects be deposited with Qualidata or AHDS will protect the legacy for future researchers2, but when dealing with legacy materials I would argue that it is better to start from the bottom up, dealing with individual collections whose provenance is known, rather than attempting the mass rescue advocated by Widdowson. 3.
A Case Study: The Newcastle Electronic Corpus of Tyneside English
3.1
Overview
The Newcastle Electronic Corpus of Tyneside English (NECTE)3 can be described as a legacy corpus in that it brings together materials that had been collected for two sociolinguistic projects collected in Tyneside, North-east England, at the beginning and the end of the second half of the 20th century. These were (i) the Tyneside Linguistic Survey (TLS), collected in 1969 in Gateshead on the South bank of the River Tyne and Newcastle on the North bank, and (ii) the Phonological Variation and Change (PVC) project, collected in 1994 in Newcastle. The aim of the NECTE project team was to create an accessible database which would make the materials available to as wide a range of users as possible, and which would be, as far as this is possible, ‘future proof’. NECTE is in no sense a ‘balanced’ corpus like the BNC: it simply preserves and makes available the data that we inherited. In the case of the more recent of the two sub-corpora, this is less problematic, in that the research design of the ESRC-funded PVC project required a balanced sample, and the data, already digitised and properly stored, did not need to be rescued. The TLS materials are another story, exemplifying Widdowson’s notion of ‘hidden and inaccessible’ data.
36
Joan C. Beal
The aims and methodology of the TLS project are outlined in Strang (1968) and Pellowe, Nixon, Strang & McNeany (1972). The plan was to conduct loosely-structured interviews with 150 informants drawn from a stratified random sample of Gateshead. A grid was drawn over a map of Gateshead and equal numbers of informants were contacted from within each square on the grid. We are lucky in that a single individual conducted all the interviews - Vincent McNeany - as we have learned within sociolinguistics that the kind of data produced often depends very much on who is collecting the data. Different interviewers can potentially produce different kinds of data and this is not what you would want if you are trying to compare speakers with one another. McNeany was a postgraduate student at the time, but had been born and raised in the community from which he was collecting the data, and still lived there at the time of the project. He had a local accent and was able to put participants at their ease, often referring to shared experiences. The interviews were recorded onto reel-to-reel tapes, 103 of which remain, of which 3 are badly damaged. The whereabouts of the remaining tapes are, at the time of writing, unknown. The TLS team also set out to interview a matching number of informants from Newcastle, but, sadly, none of these recordings have ever been found. John Pellowe, the Principal Investigator on the TLS project, left Newcastle in 1980. Thereafter, the only published work based on the TLS material was Jones-Sargent (1983), though the data was occasionally used by individual researchers. I remained aware of its existence and whereabouts because, during the period between 1977 and 2001 when I was employed by the University of Newcastle, I was frequently asked for samples of ‘traditional’ Tyneside speech, and, with a small legacy from an alumnus, had one recording transcribed and transferred to audiocassette for this purpose. The majority of the recordings and other materials remained in storage in what is now the School of English Literature, Language and Linguistics at Newcastle University. By ‘storage’, I am not referring to controlled archival conditions, but to boxes in cupboards, not exactly ‘hidden and inaccessible’ but at the very least in danger of deterioration. Some came to light only after our project began: John Local, who had worked on the TLS project as a graduate student, but subsequently took up a post at the University of York, brought in a number of recordings which he had taken with him, and alerted us to the fact that others had been deposited with the British Library. There may, for all we know, be others ‘out there’. In 1994, I began the resurrection of the project with a small grant from the Catherine Cookson Foundation, a charitable trust financed by the eponymous author of historical romances. This involved transferring the original reel-to-reel materials onto audio-cassettes: without this intervention, much of the corpus would today be unusable. As it happens, this transfer to what has now become an obsolescent medium, happened not a moment too soon. By the time we were able to digitise the recordings, some of the reel-to-reel tapes had deteriorated so much that we had to digitise from the audio-cassette copies. We subsequently learned that the shelf-life of reel-to-reel analogue tapes is estimated at about 25 years.
Creating corpora from spoken legacy materials
37
Having thus ‘rescued’ the data, the NECTE project team faced a number of challenges. The following sections will outline the nature of these challenges and provide an account of the NECTE team’s response to each of them in turn. 3.2
Challenge 1: Ethics and ‘informed consent’
To comply with the ethical review procedures of the AHRC, and of our own universities, the NECTE project team had to be able to demonstrate that the subjects of both the TLS and PVC had given their informed consent to be recorded, and, more importantly given that the whole purpose of the NECTE project was to make this data more widely available, that they agreed to the recordings being accessed by other researchers. In the case of the PVC project, this was unproblematic, since it was conducted under the auspices of the ESRC, and in compliance with the 1984 Data Protection Act. However, the TLS researchers in 1969 had no Data Protection Act to comply with, and there were, of course, no university ethics committees. However, the SSRC, precursor to the ESRC, even at this early stage, had an ethics policy in place, and we were fortunate enough to be able to recover evidence that the subjects had indeed given informed consent to being recorded, and to the recordings being made available to future researchers. A letter to subjects was found which stated ‘The results of the survey will in due course be published, but no resident who has helped by talking in this way will be referred to in such a way that they could be identified’ and which was signed by Barbara Strang, Professor of English Language and General Linguistics, University of Newcastle upon Tyne. Of course, these subjects could have had no idea that there would one day be such a thing as the World Wide Web, and that the recordings might be available to anybody in the world at the click of a mouse. This creates something of a grey area: the 1969 agreement guarantees anonymity, but is a voice ever truly anonymous? From the outset of the project, we were aware of the importance of taking advice from the Arts and Humanities Data Service, but it also became apparent that we were breaking new ground, and were subsequently invited to give a paper on the legal and ethical issues involved at an AHDS one-day course on copyright and data-protection issues in 2003. We also took advice from Newcastle University’s Data Protection Officer. Although compliance with the DPA is essential, where the material is older and the subjects no longer alive, it may be necessary to take a more pragmatic view. The ‘Sounds Familiar’ website at the British Library which, in connection with the BBC Voices project, has made some of the recordings from the Survey of English Dialects (SED) available, could not have got off the ground had such a strict view of data protection been taken. There is no official record of consent for publishing the recordings from the SED informants themselves and any attempt at securing these retrospectively would have been impossible given that none of the speakers is still alive. It was felt that sufficient time had elapsed to consider making the recordings more widely available. Much consideration was given to the close relationships that the fieldworkers developed with the informants -and there is a great deal of reference in the SED peripheral literature
38
Joan C. Beal
to the pride the informants felt in being asked to take part in the survey. It was felt that using extracts (sympathetically selected so that no individual would be compromised in any way) would be appropriate. In any case, the informants at the time were all aware and comfortable with the idea that their responses would be published (e.g. in Orton & Halliday (1962)), used in lectures, talks etc. and even occasionally broadcast on the BBC. The only condition was that the recordings should be streamed and therefore not downloadable. In the light of the numerous responses that the BL have had from descendants of the original informants they feel it was indeed the right decision – they have had contact with a number of people and been able to supply copies of the recordings for their family archives, for instance.4 The TLS subjects had been promised anonymity. To achieve this, the NECTE project removed all names from recordings and transcripts. A table with names and ID codes was created which could only be accessed by the project team, and this was securely stored. The original audio data are now stored in a safe, on two password-restricted computers, and on a computer in a locked archive with access restricted to the NECTE research team and legitimate associated scholars. Because the free-wheeling nature of the TLS interviews meant that subjects spoke about matters considered ‘sensitive’ under the 1998 Data Protection Act: health, religion, politics, trade union membership, and because some were minors at the time of recording, it would not be acceptable to make the recordings freely available on the web. For this reason, researchers wishing to access the NECTE corpus must complete and sign a form, stating their credentials and reasons for wanting to use the material and agreeing to comply with the DPA. Projects such as the SCOTS corpus (www.scottishcorpus.ac.uk), for which spoken data is deliberately collected rather than ‘rescued’, can build informed consent into the design from the beginning, and thus make their material much more widely available, but with legacy materials this is not possible, unless a difficult and lengthy process of contacting subjects is undertaken. Where subjects have died, we were advised that we would be in a ‘Catch 22’ situation: to gain the informed consent of their family would mean breaching the confidentiality of the subject. Compliance with the Data Protection Act (1998) has thus involved putting in place a number of safeguards which restrict immediate access to the NECTE materials. However, these safeguards have not made NECTE inaccessible. To access the corpus, one has to be serious and put in some effort, so it is not likely to be accessed by the casual ‘surfer’. Nevertheless it has proved useful for research and pedagogy at various levels: it has been used for research on phonology, discourse, morphology and syntax; for teaching at high school (GCE AS and A2), undergraduate and Masters levels; and by scholars in the UK, Europe, North America and China.
Creating corpora from spoken legacy materials 3.3
39
Challenge 2: gathering the materials
In a recent account of the NECTE corpus, Allen et. al. admit that ‘as restoration and digitization efforts progressed, it became evident that only a fragment of the projected TLS corpus had survived’ (2007: 20).The information in unpublished TLS project documentation (as well as that in the public domain) did not allow the NECTE team to decide with any certainty how large the corpus originally was. We are not sure, for example, how many interviews were conducted, and the literature gives conflicting reports of 150 and 200. It is also unknown how many of the original interviews were orthographically and phonetically transcribed. Jones-Sargent (1983) used 52 (digitally-encoded) phonetic transcriptions in her computational analysis, but the TLS material includes seven electronic files that we recovered from the Oxford Text Archive, but that she did not use. As such, there were clearly more than 52 phonetic transcriptions, but was the ultimate figure 59, or were further files digitized but never passed to the OTA? The ‘legacy’ of the TLS project currently held by the NECTE project is as follows: • 103 audio recordings, of which 3 are badly damaged. For the remaining interviews, the corresponding analogue tape is either blank or simply missing. • 57 index card sets, all of which are complete. • 61 digital phonetic transcription files. • 64 digital social data files. This is still a lot of data, but mystery surrounds the missing materials: were there ever 200 or even 150 recordings, and if so, where are the others? The TLS was innovative and ground-breaking, in many ways ahead of its time. It is difficult to get anyone under the age of 30 to understand the concept of a reel-to-reel analogue tape, but when I start talking about the fact that data for the TLS had to be input to a vast computer and in the form of cards punched by a team of data processors, people of this age are astonished. The TLS team pioneered multivariate analysis, using an early version of the cluster analysis programme, CLUSTAN5. Rather than transcribe the data into IPA, they developed a hierarchical coding system, and the research associate Vince McNeaney became so familiar with this that he transcribed straight into the code. Figure 1 shows an extract from the TLS coding system, which was preserved both in a manual, and on a chart made out of old wallpaper. We were able to digitize this historical artefact for posterity. It shows the meticulous phonetic detail of the TLS transcriptions and coding.
40
Joan C. Beal
Figure 1: The TLS coding system The coding system involves three levels: the symbols in the boxes at the top of Figure 2 represent Overall Units (OUs), equivalent to the lexical sets used by Wells (1982) to enable comparison of different accents. The next level is that of the Putative Diasystemic Variant (PDV): these are represented by the IPA symbols in the left-hand column under each OU, and are roughly equivalent to the phonemic level of transcription. The symbols which appear to the right of each PDV are ‘states’, each representing a different phonetic variant. Each of these has a number, such that the code for any output indicates not only its precise phonetic nature but the phoneme of which it is an allophone and the lexical set in which it was used. The TLS transcriptions were hand-written on index cards like the one that appears in Figure2.
Figure 2: TLS transcription card Initially, from NECTE’s perspective, these electronic files appeared to be a labour- and time-saving alternative to keying in the numerical codes from the index cards. However, a peculiarity that stems from the original electronic data
Creating corpora from spoken legacy materials
41
entry system used by the computing staff who had input the data from the TLS team’s original index cards meant that the resulting files had to be extensively edited by members of the NECTE team when they were returned to us from the OTA. The problem arose from the way in which the five-digit codes were laid out by the TLS researchers on the index cards as you can see in figure 2. For reasons that are no longer clear, all the consonant codes (beginning (0294(1)) in line 4) were written on one line, and all of the vowel codes appear on the line below ((0134(1)) on line 5). When the TLS gave these index cards to the University of Newcastle data entry service, the typists entered the codes line by line, with the result that, in any given electronic line, all the consonant codes come first, followed by the vowel codes. This difficulty pervades the TLS electronic phonetic transcription files. While it had no impact on the output of the TLS team (given that they were examining codes in isolation and that phonetic environment had already been captured by their hierarchical scheme), it was highly problematic for the NECTE enhancement of the original materials. Simply to have kept this ordering would have made the phonetic representation difficult to relate to the other types of representation planned for the NECTE enhancement scheme. The TLS files were therefore edited with reference to the index cards so as to restore the correct code sequencing, and the result was proof-read for accuracy. The example in figure 3 shows the intermediate (PDV) TLS phonetic representation – equivalent to a broad segmental phonetic IPA representation. In the corpus, each PDV segment is, however, indexed into up to 10 state variants – equivalent to a (very) detailed phonetic IPA representation. Orthographic
Down by Clark Chapman’s
Segmental Phonetic (PDV)
dũƘn baŸ klşk Ƶæpmԥnz
Figure 3: Example of NECTE transcriptions As already indicated in 3.1, the TLS recordings were, in the event, digitised just in time. Some of them had deteriorated considerably, and even where the sound quality was still acceptable, there were problems. The interviewer had carried an UHER portable recorder to subjects’ houses. These machines allow recording and playing at different speeds. If he thought the tape was going to run out before the end of the interview, he would simply increase the speed. This meant that the digitised recordings would change speed at random points, and the speakers would sound like the cartoon characters The Chipmunks. This had to be put right at a later stage. The original analogue recordings, both reel-to-reel and cassette versions, were first digitized at a high sampling rate, a graphic equalisation process was then applied to clarify the sound, a hiss reduction filter and a click eliminator were applied and variations in tape recording speed were eliminated. 6 Other consequences of recording in subjects’ houses include traffic noise, interruptions, and in one case a rather loud budgie in the background. Nevertheless, the recordings available on the NECTE website, whilst perhaps not
42
Joan C. Beal
suitable for acoustic analysis, are clear enough to be comprehensible, and to bring the voices of late 1960’s Gateshead to life. 3.4
Challenge 3: transcription
A more detailed account of the principles and methods we used for transcription of the NECTE corpus can be found in Beal, Corrigan, Smith and Rayson (2007). The audio content of the TLS and PVC corpora has been transcribed into British English orthographic representation, and this, too, is included in its entirety in the NECTE corpus. Two problems were encountered and, we hope, resolved in creating this representation: (i) application of English orthography to nonstandard spoken English and (ii) transcription accuracy. Since NECTE makes sound files and some phonetic transcriptions available, and since the practice of representing non-standard phonology semi-phonetic spelling has been discredited by e.g. Preston (1985, 2000), we took the principled decision to use Standard British English spelling in our orthographic transcriptions, except where the item was lexically or morphologically distinct . Thus, for example, the characteristic Tyneside pronunciation of /na:/ for SE know would be spelt in popular representations of the dialect7, but it is transcribed in NECTE. Transcribers adhered to a strict protocol, which can be found on the NECTE website. Any large-scale textual transcription project will be subject to human error so, to maximize accuracy, we conducted two correction passes on our primary transcription. These were carried out by two different members of the NECTE team who were themselves not involved in the primary transcription; the decision criterion was majority agreement. [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052]
and eh I I lived in with my mother for not quite two year but varnigh aye and I went to lobley-hill that was my first house ah yes yes and I shifted I got an exchange to be near my mother you-know yeah {xx} in the flat oh aye well I lived in there for about oh .. eighteen or nineteen year, maybe a little bit longer I divven’t know but eh then I come over here because they were modernising the flats you see
Figure 4: Extract from a TLS transcription Figure 4 is an example of the kind of transcription file that was produced by the NECTE transcribers. Notice that /na:/ is spelt ‘know’, but ‘divven’t’ is not
Creating corpora from spoken legacy materials
43
represented as Standard English ‘don’t’. This is because, in this case, the difference is morphological rather than just phonetic. In fact Heike Pichler of the University of Aberdeen has accessed the corpus to provide comparable material for her (2008) study of ‘divven’t’ in Berwick upon Tweed. Had we not decided to represent morphological alternations like this in the transcription, her task would have been much more difficult. Varnigh is a rather archaic word meaning ‘very nearly’ and, as such, is transcribed according to an agreed protocol recorded in the NECTE appendix 2, which is a lexicon of dialect terms used in the corpus and can be found at http://www.ncl.ac.uk/necte/appendix2.htm. 3.5
Challenge 4: Tagging
With regard to tagging, the challenge presented to the NECTE team was that existing tagging software had to be used and the tools in question had to encode non-standard English reliably, that is, without the need for considerable human intervention in the tagging process and / or for extensive subsequent proofreading. As was the case with transcription, I do not intend to go into too much detail here concerning the tagging of NECTE, because Nick Smith and Paul Rayner have covered this in the paper from the 2006 ICAME conference which is published as Beal, Corrigan, Smith and Rayner (2007). What I can say is that both the NECTE team, and our colleagues at UCREL learned a great deal from our successful attempt to modify the CLAWS tagger for use with non-standard English. The additions to CLAWS include the following: • pronouns: wor = ‘our’ (= possessive form of personal pronoun); tagged APPGE; • mesel, hisself, theirself, theirselves, etc. (=reflexive personal pronoun); tagged PPX1 or PPX2; • auxiliaries: div = a regional variant of the auxiliary do, non-3rd singular present tense; tagged VAD0. Some of the more idiosyncratic usages in Tyneside English could simply be added to the lexicon, even though I might prefer to classify them as morphological variants. Tyneside English is distinguished from Standard English, or at least the kind of English found in standard corpora like BNC, by its diversity of discourse markers. In Tyneside English you can get strings of discourse markers like ‘way ye bugger man’ which together simply express surprise. Examples of discourse markers found in the NECTE corpus are wey, like, aye, well, uhhuh, huh, ah, you know, and I mean. CLAWS did not have a specific tag for these, but it proved a satisfactory solution to use the existing CLAWS tag for an interjection, UH, for these. Certain forms still proved difficult to tag automatically, especially where forms in Tyneside English have different functions to the same surface form in Standard English, Examples of this are: went as past participle, as in ‘If I’d went’; give, come; seen and done as preterits; and we as first person plural object pronoun, as in ‘She sent we’. The use of forms identical to the Standard English
44
Joan C. Beal
preterite as past participle, such as ‘If I’d went’, could be caught if an auxiliary is detected before it, and preterite ‘give’ could be identified as such if a 3rd person singular pronoun preceded it, but these forms proved impossible to tag. However, we were pleasantly surprised by the extent to which the CLAWS tagger could be adapted to deal with this non-standard variety, and, in practice, any researcher investigating morphological or morpho-syntactic variation in Tyneside would be aware of these forms and search for them in context. 3.6
Challenge 5: ‘Future-proofing’
One of the principal aims of NECTE was to ‘future proof’ this important resource. Since I became involved in the world of archives, I have encountered a great deal of scepticism about the longevity of digital materials. When a similar collection of recordings made in Sheffield in the early 1980’s, the Survey of Sheffield Usage, was digitized and made available on CD, questions were asked about the relative shelf-lives of CD versus archival audiotape. The truth is, we do not know, but digitising these audio collections gives them their best chance of survival. By depositing the digitised materials with the AHDS as well as on a secure server at the University of Newcastle, we have done the best we can to future proof them. Of course, we do not envisage these materials being left in a cupboard, virtual or real, again, and the many users to whom DVD copies of the corpus have been distributed provide further safeguards against loss. We keep a record of all these requests and so, in the event of catastrophe, could ask them to return the favour. In order to ensure that the corpus would work on all platforms and with all software applications, we encoded NECTE using Text Encoding Initiative (TEI)-conformant Extended Markup Language (XML) syntax. XML (http://www.w3.org/XML/) aims to encourage the creation of information resources that are independent both of the specific characteristics of the computer platforms on which they reside (Macintosh versus Windows, for example), and of the software applications used to interpret them. To this end, XML provides a standard for structuring documents and document collections. TEI defines an extensive range of XML constructs as a standard for the creation of textual corpora in particular. Together, these are emerging as world standards for the encoding of digital information, and it is for this reason that NECTE adopted them. The AHRC in fact strongly recommends that XML is used, but we were surprised to find that NECTE was the first AHRC-funded linguistic corpus to use XML. The reason for this is probably the perceived lack of ‘user-friendliness’ of XML: as we state elsewhere ‘users not familiar with these standards may find the pervasive markup tags in the NECTE files a distracting encumbrance and yearn for the good old days of plain text files’ (Allen et. al 2007: 36). Complaints about the lack of user-friendliness are perhaps not entirely unjustified. XML is a markup language that provides a standard for the structuring of documents and document collections, and, although XML-encoded documents are plain text files that can be read by humans, in general they should not be. For an XML document to be readily legible, software that can represent the structural markup in a visually-accessible way is required. XML-aware
Creating corpora from spoken legacy materials
45
software visualization and analysis tools are gradually becoming available. The Oxford University Computing Service’s Xaira system, for instance, is ‘a general purpose XML search engine, which will operate on any corpus of well-formed XML documents (http://www.oucs.ox.ac.uk/rts/xaira/). It is, however, best used with TEI-conformant documents’. Nicolas Ballier of the University of Paris 13, has successfully used Xaira with NECTE. Mike Scott has reported to us that, with minimal adaptation, he has been able to use NECTE with Wordsmith, and Anita Auer has been able to remove the mark-up to present the files to MA students as more user-friendly files for small-scale analysis projects. NECTE is thus fulfilling our aim of making available a corpus which can be used on a variety of platforms and with a variety of analysis tools. 4.
Next Steps
I hope that this paper has demonstrated that, whilst the mass rescue envisaged by John Widdowson (2003) may not be feasible, we should not give up hope of creating useful corpora from legacy materials. The NECTE team learned a great deal from colleagues in both sociolinguistics and corpus linguistics in the course of the project, and we hope that our corpus will provide a model for future ‘rescue’ operations which would likewise be informed by corpus linguistics. The Survey of Sheffield Usage, held in the Archives of Cultural Tradition at the University of Sheffield, has been partially digitised and transcribed according to the principles outlined in 3.4.8, and I hope to produce an accessible Corpus of Sheffield Usage in due course. The networking opportunities offered by events such as the ICAME conferences have led to a group of researchers working towards agreement on common methods for producing corpora for regional and social analysis of languages and varieties (Kretzschmar et. al. 2006). Perhaps the bleak future predicted by Widdowson can be avoided, after all. Notes 1
The papers from this workshop, along with invited contributions from scholars who were not able to attend but had developed similar corpora, have been published in a two volume collection: Creating and Digitizing Language Corpora (eds. Beal, Corrigan and Moisl 2007). For details of the workshop see http://www.ncl.ac.uk/ss15/panels/
2
This information was correct at the time of the ICAME conference, but, shortly afterwards, the AHRC released the news that they were no longer able to finance AHDS.
3
This project was financed by Resource Enhancement Grant AHRB RE11776 from what was then the Arts and Humanities Research Board
46
Joan C. Beal
(now AHRC), Principal Investigator K.P. Corrigan. The project website is at www.ncl.ac.uk/necte 4
Thanks to Jonnie Robinson, Lead Content Specialist: Sociolinguistics and Education, Social Sciences Collections and Research at the British Library, for this information. The websites can be viewed at http://www.bl.uk/learning/langlit/sounds and http://www.bbc.co.uk/voices.
5
Updated versions of CLUSTAN have since been successfully applied in a wide range of disciplines: see http://www.clustan.com/
6
Thanks to Jonathan Marshall, now at the University of Gloucester, for carrying out this essential restoration work.
7
See Beal (2000) for further discussion of orthographic representation of Tyneside speech in popular literature.
6
I acknowledge the assistance of the British Academy in providing a Small Grant to finance transcription.
References Allen, W., J.C. Beal, K.P. Corrigan, W. Maguire and H.L. Moisl (2007), ‘A linguistic time capsule: the Newcastle Electronic Corpus of Tyneside English’, in: Beal, J.C., K.P. Corrigan and H.L Moisl (eds.), Creating and Digitizing Language Corpora, volume 2: Diachronic Databases, Basingstoke: Palgrave Macmillan. 16-48. Beal, J.C. (2000), ‘From Geordie Ridley to Viz: Popular Literature in Tyneside English’, Language and Literature, 9. 4: 343-359. Beal, J.C., K.P. Corrigan and H.L. Moisl (eds.) (2007), Creating and Digitizing Language Corpora, volume 1: Synchronic Databases, volume 2: Diachronic Databases, Basingstoke: Palgrave Macmillan. Beal, J.C., K.P.Corrigan, N. Smith and P. Rayner (2007), ‘Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English, Studies in Variation, Contact and Change, 1 http://www.helsinki.fi/varieng/journal/volumes/01/beal_et_al Jones-Sargent, V. (1983), Tyne Bytes. A computerised sociolinguistic study of Tyneside, Frankfurt am Main: Peter Lang Kretzschmar, W.A., J.C. Beal, J. Anderson, K.P. Corrigan, L. Opas-Hänninen and B. Plichta (2006), ‘Collaboration on Corpora for Regional and Social Analysis’, Journal of English Linguistics, 34, 3: 172-205. Meyer, C.F. (2006), ‘Editor’s Note’, Journal of English Linguistics, 34, 3: 169171. Orton, H. and W. J.Halliday (eds.) (1962), Survey of English Dialects by Harold Orton and Eugen Dieth. B, The Basic Material, Vol. 1, The Six Northern
Creating corpora from spoken legacy materials
47
Counties and the Isle of Man, Leeds: E.J.Arnold for the University of Leeds. Pichler, H. (2008), A qualitative-quantitative analysis of negative auxiliaries in a northern English dialect: I DON'T KNOW and I DON'T THINK, _innit_?, University of Aberdeen PhD Thesis. Preston, D.R. (1985), ‘The Li’l Abner syndrome: Written representations of speech’. American Speech 60(4): 328-336. Preston, D.R. (2000), ‘Mowr and mowr bayud spellin: Confessions of a sociolinguist’. Journal of Sociolinguistics 4(4): 614-621. Tagliamonte, S. (2007), ‘Corpora from the virtual world: teenagers, instant messaging and language change’, paper presented at ICAME 28, Stratford upon Avon. Widdowson, J.D.A. (2003), ‘Hidden depths: Exploiting archival resources of spoken English’, Lore and Language, 17(1&2):81-92.
Discourse linguistics meets corpus linguistics: theoretical and methodological issues in the troubled relationship Tuija Virtanen Åbo Akademi University, Finland
Abstract Discourse linguistics and corpus linguistics have an uneasy relationship because of their inherent ontological and epistemological differences. Yet it is a steady relationship going back well into corpus-linguistic history, and one that both fields are highly motivated to maintain despite its many hazards and challenges. Singling out five complementary dimensions of discourse, understood here in a broad sense, this paper shows that not all of them will be equally accessible to users of corpus methods. Two fundamental aspects of discourse are identified as particularly challenging to corpus-linguistic enquiry, i.e. the distinction between product- and process-oriented approaches; and the status of the primary notion of context. The latter raises the issue of authenticity, suggesting a need to rethink what we mean by the notion. The important methodological distinction between a corpus-based and a corpus-driven approach to discourse serves to highlight key issues in the joint history of discourse linguistics and corpus linguistics. The paper is rounded off with a discussion of the benefits to be gained by a combination of discourse linguistic and corpus linguistic approaches and methods: each party can complement the other in constructive ways; to uncover new aspects of discourse that may suggest a reconsideration of our present understanding, and disclose our tacit assumptions about it.
1.
Introduction
Discourse linguistics and corpus linguistics have an uneasy relationship because of their inherent ontological and epistemological differences. Yet, it is an established relationship, going back well into corpus-linguistic history, and one that both parties are highly motivated to keep up and develop, despite its many hazards and challenges. The aim of this paper is to contemplate some of the major stumbling blocks in this relationship. I set out to identify similarities and differences between the two approaches to the study of text and discourse, with reference to concrete research projects, in order to consider the ‘added value’ to be gained from combining methods. Keeping the theoretical and methodological discussion as general as possible, the label ‘discourse linguistics’ is here used as an umbrella term for discourse analysis, discourse studies, text linguistics, pragmatics, conversation analysis and other related approaches to the study of text and discourse. ‘Corpus linguistics’ here broadly refers to any linguistic framework which uses computer corpora as data and associated method of enquiry, irrespective of whether we are dealing with ‘linguistics’ of a particular kind (i.e. corpus ‘linguistics’, rather than
50
Tuija Virtanen
corpus ‘studies’). The focus is on the area of overlap between discourse linguistics and corpus linguistics.1 2.
Major stumbling blocks to the relationship
The use of corpus data in analyses of text and discourse raises two issues: (i) the difference between a product and a process view of discourse; and (ii) the status of the textual, situational and socio-cultural context in the particular study. In discourse linguistics, the object of study is the process, rather than its outcome, the product. But it is this product that is stored in the form of a corpus. Furthermore, context is as important as the pieces of speech or writing under analysis, in investigations of discourse as process and as social action. But it is far from straightforward to figure out how linguists can best integrate this inherent aspect of discourse into studies of corpus data. While easy to identify, these two fundamental aspects of discourse, i.e. its process orientation and the interdependency within a particular context, still constitute the major stumbling blocks on the road towards ‘discourse and corpus linguistics’. Corpora are essentially static, consisting of records of spoken or written text that discourse linguists explore in the hope of being able to reconstruct the processes through which these products were shaped to serve particular communicative goals and to function as situated social action for interlocutors, readers and writers. Even though corpora increasingly code contextual information, the inherently dynamic character of context as instigating and affecting discourse, and being in turn created through discourse as social action, remains beyond the reach of corpus linguistics. An analysis of five complementary dimensions of discourse singled out in Section 5 reveals that not all of them will be equally accessible to users of corpus methods. And corpora can be of many different kinds, some more suited to investigations of discourse phenomena than others. 3.
Rethinking authenticity
Discourse linguists and corpus linguists both rely on discourse data and each values authenticity, often understood in the sense of ‘real-life’ data, i.e. discourse that has been produced, used or co-constructed by people in a given communicative situation for particular purposes. Although widely used to justify the chosen method, the term ‘authenticity’ is far from straightforward, as testified by recent discussions across disciplines (see e.g. Gill 2008). Questions raised by Gill (2008), which are worth considering in any kind of study of discourse, include whether the data we are investigating are regarded as authentic because they seem, in one way or another, ‘original’, i.e. directly related to some kind of origin. But the dialogism of discourse makes such origins very difficult to define. Another question is whether we talk about ‘authenticity’ because we are, consciously or not, concerned with an object of discovery (in a corpus). What
Discourse linguistics meets corpus linguistics
51
about the values that we are, perhaps implicitly, attaching to the data at hand; are we, for instance, exploring something as ‘authentic’ in the sense of ‘desirable’ or ‘normative’, including or excluding what we will then interpret as less so? This question is all too familiar to students of EFL data, the status of the ideal native speaker being of central concern. Authenticity in linguistic enquiry may also refer to unedited, non-manipulated data, to discourse that is viewed as relatively spontaneous. It is indeed worthwhile to give these and other questions concerning the notion of authenticity due attention in studies of discourse, irrespective of whether we are using corpus data. One of the main problems is, however, that what is authentic in corpus studies need not be so in discourse studies, because of the status of context in the investigation. Linguists are repeatedly confronted with ethical issues connected to the procedures of collecting data and the extent to which they are at liberty to use such materials. This is especially acute in studies of impromptu speech. Choices have to be made between optimally ‘natural’ data and materials which bear a trace of metalinguistic awareness on the part of the interlocutors who are engaged in the particular discourse practices. Such decisions are bound to affect the degree of authenticity of our data. There is also the classic issue of ‘transcription as theory’ (Ochs 1979) in recontextualizing data for research purposes, highly relevant in both corpus linguistics and discourse linguistics. But students of writing are also confronted with problems of authenticity: corpora are the outcome of the processes of decontextualization and recontextualization of discourse. Our data are not the ‘original’ or ‘authentic’ pieces of writing that they represent, nor are we studying them in a communicative situation matching those of their writers or the expected readership. Even linguists vouching for unedited, non-manipulated discourse are still aware of the recontextualization processes that have taken place for the data to end up on their desks and screens. The dynamism of discourse is irretrievably lost in concordances, lists and samples of various kinds. Authenticity is also called into question when we make use of publicly available Internet data, unless we happen to occupy the dual role of discourse participant (‘user’ rather than ‘lurker’) and (external) observer of the discourse under construction. But the user role inevitably influences the discourse that we as linguists are hoping to investigate, which is a problem familiar to anthropological linguists and sociolinguists engaging in participant observation of discourse in particular situational and socio-cultural contexts. The status of collections of Internet data as corpora has recently been debated by corpus linguists wishing to benefit from the easy access to huge quantities of publicly available materials (for discussions of Internet data as corpora, see e.g. Baker 2006; Hoffmann 2007; contributions to Hundt et al. (eds.) 2006; Kehoe & Gee this volume; Yates 2001). The main problems include attempts to analyse computer-mediated conversation in lieu of offline discourse, rather than in its own right, and of course, the central issue of the lack of representativeness of the sample, which corpus linguists have to weigh up when considering any quantification of their data (see the discussion in Section 4). Discourse linguists
52
Tuija Virtanen
investigating Internet data will appreciate programs that register (i) the (lack of) simultaneity of interaction, and (ii) what appears on the screens of each discourse participant at any given stage of the interaction. It is also essential to have access to relevant information concerning other discourse activities, online and offline, in which users are engaged in parallel or between their individual attention spans (for discussions, see e.g. contributions to Herring et al. (eds.) forthcoming). Questions of authenticity come to the fore in historical linguistics, where studies of language change frequently suffer from a lack of (appropriate) data. Historical linguists, irrespective of whether they work with corpora or individual texts, are used to assessing the relative authenticity, in one or several of the senses referred to at the outset of this section, of the body of data that has survived through time, its internal and external comparability, and hence, their premises for conclusions. Judgements of the relative authenticity of historical data are based on what is known of their origins, relevant situational and socio-cultural contexts, and the extent to which such written records are deemed appropriate for analysis of reflections of spoken discourse (see the discussions in Kytö 2000; Wårvik 1990, 2003). In the following section, the concern is with the methodological differences between discourse linguistics and corpus linguistics, which again raise the issue of the uneasy balance between representativeness and availability of data. 4.
Methodological differences: two kinds of discourse
A good place to start exploring the similarities and differences between discourse linguistics and corpus linguistics is with the section on ‘methods and materials’ typically found in studies of concrete linguistic phenomena. The conspicuous differences between discourse linguistics and corpus linguistics concerning the ways in which the methods and materials of the particular study are presented remind the reader of the two main scholarly paradigms prototypically associated with the natural sciences and anthropology. Linguistics, the study of language, is a very broad field indeed, encompassing both ‘hard’ and ‘soft’ scientific approaches. In corpus linguistics, the key notion is ‘frequency’. Even though linguists of other orientations also set out to quantify their data, there will often be decisive differences between their goals and methods and those of corpus linguists which will have a bearing on the results (see, for instance, Mair’s discussion in this volume of corpus linguistics and sociolinguistics). In contrast, discourse linguistics has not traditionally had quantification as its primary method. As the terminology needed to refer to non-quantitative research methods which try to account for text in context and the reflexivity of the contextualization processes is, however, largely missing, such studies are often misleadingly called ‘qualitative’. Both discourse linguists and corpus linguists do, of course, strive for qualitative analyses of their data; the difference lies in the fact that discourse linguists tend to prefer situated analyses of the particular, while corpus linguists
Discourse linguistics meets corpus linguistics
53
do so through quantification. What are therefore of interest to corpus linguists are the most frequent items in the data – and occasionally also the least frequent ones, in studies of absence, rather than presence, of linguistic elements – while discourse linguists may be able to learn from any instances that are relevant to their study. Hence, the size and the kinds of data necessary for the two different methodologies can be expected to vary considerably. (For a book-length discussion of the use of corpora in discourse analysis, see Baker 2006.) Discussions of data sampling and search procedures help readers of corpus studies to interpret the particular findings accordingly. Discussions of methods and materials in discourse-linguistic studies may similarly form sections in their own right in published work. Not infrequently, however, this information is integrated in the scholarly discussion of the phenomena at hand, as it is usually far less clear-cut and straightforward than the ‘methods and materials’ of corpus linguistic studies. While human language cannot, of course, constitute an object of study on a par with those typically found in the ‘hard’ sciences, where the analyst can be clearly separated from the data, it is still this paradigm that is reflected in the discourse of corpus studies. The discourse of discourse studies is different, as might be expected in light of the focus on the dynamic nature of the data and the theoretical and methodological choices made in delimiting and approaching the object of study. Discourse linguists have to come to terms with a high degree of causal indeterminacy in their studies. As a result of their expertise in analysing text and discourse in depth and their continued attempts to get to grips with the dynamism of text-context reflexivity, discourse linguists are highly aware of a fact which is relevant to studies of all orientations: that linguists are indeed constructing discourse through discourse, even when they are writing up the study itself. They therefore attempt to make this aspect of study explicit. Discourse linguists also tend to be very much aware of the status of introspection in their work, present in some form and at some stage in all linguistic enquiry, and they therefore make every effort to signal clearly a separation of speculative elements from findings, in the construction of the argument. The two discourses, those of corpus linguistics and discourse linguistics, constitute a source of possible misunderstandings between the practitioners of the two strands of language study. One of the decisions to make, in view of the purpose of a particular study, concerns the relative balance between representativeness and availability of data, already touched upon in Section 3. Because of the choices concerning quantification, corpus linguists and discourse linguists are likely to provide very different answers if confronted with the question ‘representative of what?’ Both know that their data can never be representative of ‘language as a whole’ but, in view of the need for quantification, corpus linguists rightly put a great deal of effort into ensuring that their materials are representative of some aspect of a particular construct. The representativeness of even very large corpora will, however, always be more problematic in view of the goals of discourse linguistics. In the rare instances where discourse linguists are able to conclude that their data are representative of what they want to study, they may not need
54
Tuija Virtanen
the data at all; usually, however, they cannot be sure that their materials are representative enough to warrant a great deal of generalization (see e.g. the discussion in Mair 1990: 14). And they know that one single text is likely to provide them with more insight into the use and structuring of language than they can ever hope to expose through their analyses of the particular. Problems of availability for them tend to be related to restrictions based on ethical issues, specially prevalent in studies of spontaneous spoken interaction, computermediated conversation, and (chains of spoken and written) institutional discourse in many societies. The availability of data may also be reduced on other grounds, such as copyright restrictions and legal constraints of various kinds, the (semi)private nature of much business communication and organizational discourse, or simply because the necessary materials have not survived through time. These problems will, however, affect corpus linguists and discourse linguists alike. 5.
Possible points of convergence
This section explores possible points of convergence between corpus linguistics and discourse linguistics in terms of: (i) five different dimensions of discourse (see Enkvist 1984; Virtanen 1997), and (ii) two methodologically different approaches to corpora (see Sinclair 2004). Discourse linguistics has, over the years, undergone a remarkable expansion of focus. With the discursive turn in social sciences, the relative weight of the reflexive text-context pair of notions has shifted towards its second member. The context to be taken into account in studies of text and discourse has expanded enormously, from co-text (linguistic context) and a particular situational context, to society and culture at large, to the extent that the latter are now judged to be relevant to the study. Yet all dimensions of discourse are still with us and equally relevant, irrespective of their chronological order of appearance on the discourse-linguistic scene - simply because they serve to accomplish different analytical tasks. Situated analyses of discourse practices in text and talk rely on contextualization cues exhibited in the linguistic signals that are present or absent in a piece of discourse. Starting from (i) a ‘structural’ dimension, present in much work on textuality, we can proceed to (ii) a ‘contentbased’ dimension, typically opted for in rhetorically-oriented studies. The ‘cognitive’ dimension (iii) is omnipresent in studies of text and discourse, and it can thus be specifically foregrounded where expedient. The ‘interactional’ dimension (iv), originating in studies of spontaneous speech, cuts across much of the current discussion of discourse phenomena, highlighting the dynamism of discourse practices in both speech and writing. And the ‘socio-cultural’ dimension (v), too, demands consideration of the reflexivity of text and discourse. In (v), the focus is on the situational and socio-cultural contexts in which people jointly engage and re-engage in social action through discourse, and in performances through which discourse takes shape; the concern is with ways of
Discourse linguistics meets corpus linguistics
55
(co-)constructing such contexts and adapting to them, and of maintaining or altering them through discourse. It is obvious that these five dimensions of discourse are not all equally accessible to users of corpus-linguistic methods. In view of the discussion of the status of context in such investigations, corpus-linguistic approaches can be expected to focus predominantly on the structural aspects of discourse and the various content-based phenomena apparent in text and talk. In contrast, the interactional and socio-cultural dimensions of discourse lend themselves less well to corpus studies because what is examined here is the dynamism of discourse as social action. The study of discourse processes and other cognitive issues increasingly have recourse to corpus data but often to ends that are not of primary concern to the corpus linguist. Sinclair’s (2004) distinction between ‘corpus-based’ and ‘corpus-driven’ approaches constitutes another relevant starting point for the discussion of corpus and discourse linguistics. The ‘corpus-driven’ approach is reminiscent of that of conversation analysts, while the ‘corpus-based’ approach is more in line with much work in text and discourse linguistics and pragmatics. 5.1
Fields of mutual interest
Corpus linguistic and discourse linguistic studies have benefited from one another in a number of fields of mutual interest. These include (i) variation across texts and discourses, (ii) textual and pragmatic collocation, and (iii) the intricacies of spoken interaction. The first of these, the discovery of distributional patterns, is the domain of corpus linguistics par excellence. Investigations of linguistic variation place high demands on corpus design. But variation is also of central importance in the study of text and discourse. Discourse linguists have benefited from corpus-linguistic methods to study variation across texts and discourses, including variation across time in historical linguistics. The usual text classifications include text/discourse types, genres, registers, styles and modes, while fictionality can also constitute a dividing line between text categories (for corpus studies of various kinds of variation across texts and discourses, see e.g. Biber 1988; Dorgeloh 2004; Granger (ed.) 1998; Semino and Short 2004; Stubbs 1996; Taavitsainen 1997). The notions employed in text and discourse categorization are not straightforward, however, and linguists of both orientations should continue to give full attention to decisions in this regard. Some divisions have long been standard in corpus design. Thus, it is only recently that speech and writing have started to appear in the same corpus, and multimodal corpora are likely to grow in importance, along with the current interest in Internet data. Both corpus-based and corpus-driven methods are used in discourseoriented studies of linguistic variation. In historical corpus linguistics, the models tend to come from our understanding of present-day discourse phenomena, the combination of which has renewed the field of historical linguistics over the past thirty years. Corpus-driven approaches, again, invite linguists to explore historical data in their own right, which may facilitate the interpretation of the
56
Tuija Virtanen
findings. Variation is also an important issue in studies of ongoing language change, as can be witnessed, for instance, in data from online contexts (but for corpus-methodological concerns, see the discussion in Section 3). The pros and cons of the two approaches, corpus-driven and corpus-based, to variation across texts and discourses are crystallized in the following two quotations from the relevant literature. The first one serves as an argument for the adoption of corpus-driven methods; the second emphasizes the risk of misinterpretation in approaches that do not take into account fundamental distinctions between categories of text and discourse based on text-internal criteria. “…despite theoretical frameworks that are general enough, descriptions are too dependent on the text and discourse type.” (Sinclair 2004: 67) “So determinative of detail is the general design of a discourse type that the linguist who ignores discourse typology can only come to grief.” (Longacre 1996: 7) If we are interested in the inherent hybridity of discourse and the processes of hybridization (Fairclough 1992), the point of departure must be some kind of categorization of discourse. If, in contrast, we start from large amounts of “uncontaminated text” (Sinclair 2004: 191), we cannot study hybridization per se, at least not until we have identified categories that emerge from the data. Longacre’s point about linguists running the risk of comparing apples and oranges if discourse typology is not taken into account has proven to be crucial in studies of text and discourse, irrespective of the kind of text or discourse categorization we are working with (for a discussion of variation across texts and discourses in the light of text type and genre, see Virtanen, forthcoming). Corpusdriven studies promise to uncover categorizations of text and discourse which differ from those in focus in corpus-based studies; though both methods are likely to point to some of the most basic distinctions such as the difference between narrative and non-narrative text. Other distinctions likely to emerge even when using corpus-driven methods include that between ‘evocative’ and ‘operational’ discourse (cf. Enkvist 1985), and between common, and at times adjacent, genres of everyday life (such as news and reviews, or gossip and jokes). The second area in which corpus linguists and discourse linguists happily meet is in collocational patterns. Access to very large corpora and the Internet has resulted in something of a renaissance in the study of collocation. Texts and discourses exhibit collocation in the very concrete sense of words that like each other’s company. The default definition of collocation as the “co-occurrence of words with no more than four intervening words” (Sinclair 2004: 141; 1991) allows us to contemplate them in novel ways, starting from what is present in texts and discourses of various kinds and ignoring for a moment the constraints of grammar. Contextual issues come to the fore when we note that collocational
Discourse linguistics meets corpus linguistics
57
patterns vary according to discourse type, genre, register and style. But new categorizations are also likely to emerge through the study of collocation in large bodies of data. Firth’s early interest in matters of context invites us to study collocation in relation to the context-of-situation and the cultural context. Extending the scope of ‘collocation’ and ‘colligation’ (Firth 1968) from a sentence-grammatical study of word and tag sequences in a given corpus to entire texts allows us to study ‘textual’ and ‘pragmatic’ collocation. While many linguists select relatively narrow search spans to avoid overwhelming problems of insufficient precision in the procedure, the possibility of varying the search span is of great interest in the study of text and discourse as it helps us to explore collocational phenomena which operate over sentence boundaries. In addition to the study of relatively overt textual collocation, we may be alerted to implicit relations that are not readily noticed using traditional methods. Such pragmatic collocation is of major relevance to the study of text and discourse. This is a field of study where corpus-driven approaches promise new insights into an aspect of text and discourse that is “not subject to any conventions of linguistic realizations, and so is subject to enormous variation, making it difficult for a human or a computer to find it reliably” (Sinclair 2004: 144-145, on ‘semantic prosody’; cf. also the discussion of ‘semantic preference’ in Sinclair 2004: 142; for discussions of textual and pragmatic collocation, see Virtanen 2005; Östman 2005). For the analyses to be meaningful, however, linguists need access to very large bodies of data (cf. the discussion in Sinclair 1991). In this light, the opportunities are now very different from the times of early monitor corpora: huge quantities of text on the Internet can be subjected to investigations of regular co-occurrences of words, also in terms of the two extended senses of collocation, textual and pragmatic. This endeavour is facilitated by tools such as WebCorp (see Renouf et al. 2007). It is, however, crucial to verify the nature of the reliance of such interfaces on existing search engines, so that the results can be interpreted accordingly. Search engines may, for instance, retrieve particular kinds of web data while excluding others, such as discussion boards, blogs or chat rooms, which has important implications for the results of the study. Corpus-driven analyses of collocation have been suggested as a point of departure for cognitive text linguistics (de Beaugrande 2004: 24-26). The hypothesis is that a meaning which is conspicuous in a particular co-text reflects processes of multiple activations in networks with other meanings. Collocation is thus assumed to constitute the ‘missing link’ between language and discourse, explaining why people know what a word of a given language potentially means out of context, while still using and interpreting it in a specific sense in a particular discourse context. Equally interesting for discourse-linguistic purposes would be the prospect of extending the recent corpus-linguistic notion of lexical ‘repulsion’ between word pairs (Renouf & Banerjee 2007) to cover potential ‘textual and pragmatic repulsion’, while still trying to eliminate, in appropriate ways, the all too numerous search results that such an expansion would inevitably involve. In the identification of potential repulsion manifest in texts, added precision might come
58
Tuija Virtanen
through the consideration of the contextual notions of genre and register. Findings about linguistic repulsion are also likely to disclose important aspects of textual silence, not least if related to discourse-linguistic insights into discourse types and styles. Hence, pairs of connectors occurring across units of text of various sizes might be hypothesized to show ‘textual repulsion’ in relation to discourse type or genre. Applications would thus seem to include new ways of narrowing the scope of lexical searches on the web. Investigations of ‘pragmatic repulsion’, again, might take into account sets of lexical items that manifest highly implicit patterns of repulsion vis-à-vis particular function words (such as signals of negation or wh-items). Studies of pragmatic repulsion would necessitate very large bodies of data, and as with explorations of pragmatic collocation, they only seem possible using corpus linguistic methods. A third field of mutual interest, impromptu speech (as well as less unplanned face-to-face interaction), is an area where corpus-based studies have been successful. It is a paradox that this is also the area where corpus compilation is especially cumbersome, and problems of authenticity are foregrounded in the transcription process; not to mention ethical issues that accompany the process of recording spontaneous speech. However small-scale, such corpora still offer linguists a rich source of insight into the workings of planned and unplanned speech. Linguistic elements that have been identified as serving discursive or pragmatic functions of various kinds have been explored in corpus data. This strand of research has given particles and routine expressions a central status in linguistic enquiry, thus extending their study beyond the ground-breaking work by the early enthusiasts of discourse markers and pragmatic particles. The starting point has often been a set of predetermined lexical items, selected on the basis of earlier work in discourse linguistics. Important corpus-based studies in this area include those originating in the Lund circle directed by Jan Svartvik, who computerized and analysed the LLC (see e.g. Svartvik 1979; Aijmer 1996; Stenström 1994). Several of its members have subsequently extended this strand of corpus analysis to other corpora and compiled corpora of their own. Brinton (2008), Culpeper and Kytö (1999), and Wårvik (1990) investigate discourse markers and pragmatic particles in historical data, in written records of various kinds that are assumed to reflect some degree of spokenness or orality. Instead of starting from predetermined lexical items, which may have the disadvantage of severely delimiting potential findings, corpus-based studies of spoken interaction have at times chosen as a point of departure a particular discourse-organizing function, such as topic management or conversational openings and closings, or a communicative function, such as disagreeing or making requests (cf. Holmes and Stubbe 2003 on power and politeness manifest in a corpus of workplace discourse). Studies of interaction focusing on politeness and (inter)subjectivity are, however, predominantly grounded in situated discourse analysis because, as Hunston (2004:186) points out concerning evaluation, “reliable automatic identification and quantification can be carried out on only a limited set of realizations”. Situated socio-cultural performance of
Discourse linguistics meets corpus linguistics
59
politeness and affect through discourse seems beyond the reach of corpus linguistics. Corpora of spoken discourse have offered new insight into the study of overlapping speech, prosody and intonation. But searches over relatively large quantities of data, where possible and expedient, still involve a high risk of misinterpretation, while close-up, context-related analyses of individual occurrences are of less interest to the corpus linguist preferring to rely on large bodies of data. For instance, it is important to keep in mind that all overlaps are not necessarily recognized as interruptions by interlocutors in a given speech situation. The hazards of interrupting and being interrupted constitute a fundamental aspect of face-to-face interaction but their investigation necessitates situated in-depth analyses. Wichmann’s work on discourse intonation (e.g. 2004) shows how demanding a corpus-based study of spoken discourse is and how important it is to connect the findings in close, context-related observations of particular occurrences in the data. It can be expected that corpus studies of spoken interaction continue to be conducted along with manual analyses of the particular. Despite fundamental methodological differences, corpus linguistics and discourse linguistics manifest a good number of shared interests and concerns, thus potentially contributing to one another in important ways. Let us therefore turn to some of the most problematic areas in attempts to combine the two approaches. 5.2
Areas of unease
It is in the core areas of the study of text and discourse that corpus-based and corpus-driven analyses have little to offer, simply because it may not be possible to find what the discourse linguist wishes to explore or because the findings point to what we already know. Such areas have to do with (i) text structure or discourse organization, (ii) text-context reflexivity and (iii) situated analyses of ‘doing genre’. The most or least frequent instances are not the primary concern of the discourse linguist trying to determine how coherence works for interlocutors, as individuals and members of groups of various kinds; how words link to worlds and worlds to words simultaneously through discourse; or what kinds of action, or discourse practices people in various interlocutor roles set out to perform and adapt to through discourse in particular situational and socio-cultural contexts. And the processes of co-constructing discourse communities and various communities of practice, or those of (re-)engaging, face-to-face or online, in the ‘discursive struggle’ that is formative of our identities – all of these phenomena are of less value to the corpus linguist trying to get to grips with linguistic variation across established or emergent genres, or with distributional patterns of other aspects of the use of language in as large a sample as possible of representative computerized data. Text structure and discourse organization constitute a shared interest between corpus linguists and discourse linguists. But it is difficult to come up with quantitative findings which respect the inherent dynamism of discourse unless methods are combined so that an in-depth analysis of discourse is also
60
Tuija Virtanen
conducted, and typically a large part of the counting will have to be manual. Small but specialized corpora are easier to handle here but generalizability, essential in corpus linguistics, is then not possible and corpus-driven analyses are not applicable. Studies of corpus data have a lexical focus, which highlights explicitness in the signalling of discourse organization. Yet there are many other cues to discourse phenomena that need to be accounted for if we are to model ways in which people construct coherence, context and culture, through discourse that is at the same time affected by context and culture. The obstacle in the relationship between corpus linguistics and discourse linguistics is the issue of text-context reflexivity, which does not readily lend itself to static analyses of decontextualized data in the form of the linguistic output of situated discourse events which have been recontextualized as a corpus. This fundamental aspect of discourse as process and as social action is a familiar issue to the contextsensitive discourse linguist planning how best to approach the object of study. Central to the study of discourse are people’s intertextual and interdiscursive repertoires, which are constructed, recycled and altered through discourse, in always new and unique communicative situations. The communicative and social contributions of discourse type and genre construction can be accounted for in terms of such repertoires as well as intertextual and interdiscursive chains appearing across texts and discourses. It is through discourse that genres emerge and evolve, as interlocutors keep mediating them in particular communication situations in which they co-construct and make use of them, for and through social action. And it is in discourse that a small number of types or modes are exhibited which facilitate discourse processing and serve the communicative goals of its interlocutors. Corpus linguists investigate explicit signals, or the lack thereof, of established or evolving conventions; but the issue of what people set out to do, with and through genres and discourse types in particular situational and socio-cultural contexts, perforce lies beyond the reach of corpus-linguistic analyses. Hence, the development and change of genre conventions is a popular corpus-linguistic topic, while the social action of ‘doing genre’ is more likely to be adopted for study in situated analyses of discourse data. 5.3
A Happy Ending
Corpus-based studies of discourse phenomena may help us to get to grips with cohesion, rather than coherence. Also, aspects of positionally-defined thematic structure will be easier to examine than the intricate interplay of given and new information. Vocabulary-based analyses can help single out rhetorical units pertaining to structure and content. Interactional signals, and to some extent relevant socio-cultural cues, are typically approached through predetermined sets of lexical items. What all of this suggests is a focus on textuality, rather than the dynamic, situated nature of discourse. Corpus-driven studies of collocation and other semantic relations in text, too, disclose co-textual, rather than contextual, information. Even though discourse linguists will be able to make informed guesses on the basis of the outcome of corpus-driven studies, this process is not,
Discourse linguistics meets corpus linguistics
61
strictly speaking, concomitant with the idea of ‘uncontaminated’ text, guiding such approaches. Practitioners of corpus-based and corpus-driven methods differ in their views of the status, scope and nature of context in the investigation. Similar differences also exist in discourse linguistics. A good deal of context is inferable from the text; yet, corpus-based and corpus-driven analyses might not give access to such information in the way a situated analysis of ongoing discursive struggle in a particular instance of interaction does. But interaction is not only a characteristic of spoken language; writing, too, can be overtly interactive. Corpus linguists can gain insight into interaction, for instance, by analysing corpora consisting of text-based computer-mediated discussions. Yet here too, approaches to interaction that are informed by dialogism are likely to benefit less from corpus study than the monologistic frameworks traditionally adopted in corpus linguistics. Discourse studies tend to require compilation of specialized corpora, which run the risk of being too small to be of interest to corpus linguists. But small-scale corpora may also occasionally provide discourse linguists with findings that are all too familiar to them for the corpus-linguistic methods to be of relevance in the enquiry. Further, small corpora are of no use in corpus-driven studies, which instead demand very large bodies of data to be able to show the existence of systematic lexical and grammatical patterns, which, it is hoped, might serve to ground analyses of (inter)textual relations and contextualization cues. Ultimately, the size and kinds of corpus data will have to be thoroughly (re-)assessed according to the discourse-linguistic goals. The relationship between corpus linguistics and discourse linguistics is thus destined to continue to be a troubled one. While not yet necessarily pointing towards a ‘happy ending’ of any kind, there has, however, recently been an increase in the number of corpus-linguistic investigations of discourse structure. Textbooks in corpus linguistics have hitherto included an odd page on discourse or pragmatics, introducing a few studies of explicit, not infrequently predetermined, lexical signals that have been shown to serve pragmatic functions. Similarly, edited volumes of corpus-linguistic enquiry have at times included a chapter or two on discourse organization, usually oriented towards lexical relations identified in or across texts. Recent volumes clearly attempt to remedy the scarcity of corpus-linguistic studies of discourse phenomena. In addition to a larger number of investigations based on lexical elements, we now also find more focus on prosody and discourse intonation (cf. the contributions to Ädel and Reppen (eds.) 2008, Flowerdew and Mahlberg (eds.) 2009, and Partington et al. (eds.) 2004; Baker 2006). There is often a decisive element of manual, in-depth analysis of text and talk in corpus-based studies of discourse phenomena, while appropriate parts of the study are carried out by computer (see Biber et al. 2004; Biber et al. 2007; Du Bois 2007; Reppen et al. (eds.) 2002; Thomas and Wilson 1996; Wichmann 2004). This avenue remains an option in terms of added value of results or in the potential for developing and testing software for the purposes of discourse-linguistic enquiry.
62
Tuija Virtanen
6.
Concluding remarks
The main differences between corpus linguistics and discourse linguistics are ontological and epistemological. Corpus linguists and discourse linguists set out to describe and explain very different realities, sustain very different views of what constitutes evidence, and have different views of the kinds of claims that can be made. There is not much to be done about these differences; they are intrinsic. But linguists working within one or the other framework would do well to give thought to these basic differences given the goals of their studies and the concrete decisions that they are making during the research process. With reference to the five dimensions of text and discourse singled out in this paper, it is obvious that not all are equally accessible to practitioners of corpus linguistics. And what can be operationalized in view of a meaningful corpus study is not necessarily news to discourse linguists. Despite attractive solutions ranging from discourse-sensitive tagging to the compilation of focussed corpora, consisting of entire texts where possible, the main problem on the road from discourse to corpora and back again remains the lack of contextual dynamism. It is only through due attention to discourse as process and social action that investigations succeed in truly taking into account the bidirectional relation between actual texts and pieces of discourse, and their situational and socio-cultural contexts. Yet there is a benefit in attempting to combine the two approaches, and developments in software motivate linguists of various orientations increasingly to opt for new avenues in their chosen field of study. In principle, combining methods from corpus linguistics and discourse linguistics allows us to explore the workings of discourse in novel ways. In practice, this would seem to involve inclusion in one and the same study of two kinds of analyses: an in-depth context-sensitive analysis of text and discourse, and a corpus-based and/or corpus-driven investigation of some identifiable linguistic elements (or the lack thereof), suggested by the preceding discourse analysis as worthwhile candidates for quantification in a given body of data. Alternatively, a corpus-driven study can greatly benefit from subsequent enrichment by a close analysis of some of its results in a particular discourse context. Complementary or conflicting findings are both welcome: they offer new insights, disclose tacit assumptions and suggest reconsideration of our present knowledge of discourse. An understanding of the premises and goals of both fields will, however, be crucial for a harmonious and happy relationship between corpus linguistics and discourse linguistics. Note 1 This paper is based on an extensive discussion in my chapter entitled ‘Corpora and discourse analysis’ in Corpus Linguistics: An International Handbook, edited by Anke Lüdeling and Merja Kytö, to be published by Mouton de Gruyter.
Discourse linguistics meets corpus linguistics
63
References Ädel, A. and R. Reppen (eds.) (2008), Corpora and discourse: the challenges of different settings. Amsterdam: Benjamins. Aijmer, K. (1996), Conversational routines in English: convention and creativity. London: Longman. Baker, P. (2006), Using corpora in discourse analysis. London: Continuum. De Beaugrande, R. (2004), ‘Language, discourse, and cognition: retrospects and prospects’, in: T. Virtanen (ed.), Approaches to cognition through text and discourse. Berlin: Mouton de Gruyter. 17–31. Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., E. Csomay, J.K. Jones and C. Keck (2004), ‘Vocabulary-based discourse units in university registers’, in: Partington et al. 23-40. Biber, D., U. Connor and T.A. Upton (2007), Discourse on the move: using corpus analysis to describe discourse structure. Amsterdam: Benjamins. Brinton, L.J. (2008), The comment clause in English: syntactic origins and pragmatic development. Cambridge: Cambridge University Press. Culpeper, J. and M. Kytö (1999), ‘Modifying pragmatic force: hedges in Early Modern English dialogues’, in: A.H. Jucker, G. Fritz and F. Lebsanft (eds.), Historical dialogue analysis. Amsterdam: Benjamins. 293-312. Dorgeloh, H. (2004), ‘Conjunction in sentence and discourse: sentence-initial And and discourse structure’, Journal of Pragmatics 36: 1761-1779. Du Bois, J.W. (2007), ‘The stance triangle’, in: R. Englebretson (ed.), Stancetaking in discourse: subjectivity, evaluation, interaction. Amsterdam: Benjamins. 139-182. Enkvist, N.E. (1984), ‘Contrastive linguistics and text linguistics’, in: J. Fisiak (ed.), Contrastive linguistics, prospects and problems. Berlin: Mouton de Gruyter, 45-67. Enkvist, N.E. (1985), ‘A parametric view of word order’, in: E. Sözer (ed.) Text connexity, text coherence: aspects, methods, results. Hamburg: Helmut Buske. 320-336. Fairclough, N. (1992), Discourse and social change. Cambridge: Polity Press. Firth, J.R. (1968), Selected papers 1952-1959. Ed. by F.R. Palmer. London: Longman. Flowerdew, J. and M. Mahlberg (eds.) (2009), Lexical cohesion and corpus linguistics. Amsterdam: Benjamins. Gill, M. (2008). ‘Authenticity’, in: J-O. Östman and J. Verschueren (eds.), Handbook of Pragmatics. Amsterdam: Benjamins. Available also in J-O. Östman and J. Verschueren (eds.) (2005-), Handbook of pragmatics online. Amsterdam: Benjamins, at http://www.benjamins.com/online/hop Granger, S. (ed.) (1998), Learner English on computer. London: Longman. Halmari, H. and T. Virtanen (eds.) (2005), Persuasion across genres: a linguistic approach. Amsterdam: Benjamins.
64
Tuija Virtanen
Herring, S.C., D. Stein and T. Virtanen (eds.) (forthcoming), Handbook of the pragmatics of computer-mediated communication. Berlin: Mouton de Gruyter. Hoffmann, S. (2007), ‘Processing Internet-derived text: creating a corpus of Usenet messages’, Literary and Linguistic Computing, 22 (2): 151-165. Holmes, J. and M. Stubbe (2003), Power and politeness in the workplace. London: Longman. Hundt, M., N. Nesselhauf and C. Biewer (eds.) (2007), Corpus linguistics and the web. Amsterdam: Rodopi. Hunston, S. (2004), ‘Counting the uncountable: problems of identifying evaluation in a text and in a corpus’, in: Partington et al. 157-188. Kytö, M. (2000), ‘Robert Keayne’s Notebooks: a verbatim record of spoken English in early Boston?’ in: S.C. Herring, P. Van Reenen and L. Schøsler (eds.), Textual parameters in older languages. Amsterdam: Benjamins, 273-308. Longacre, R.E. (1996), The grammar of discourse. 2nd ed. New York: Plenum Press. Mair, C. (1990), Infinitival complement clauses in English: a study of syntax in discourse. Cambridge: Cambridge University Press. Ochs, E. (1979), ‘Transcription as theory’, in: E. Ochs and B.B. Schieffelin (eds.), Developmental pragmatics. New York: Academic Press. 43-72. Östman, J-O. (2005), ‘Persuasion as implicit anchoring: the case of collocations’, in: H. Halmari and T. Virtanen (eds.), 183-212. Partington, A., J. Morley and L. Haarman (eds.) (2004), Corpora and discourse. Bern: Peter Lang. Renouf, A. and J. Banerjee (2007), ‘The search for repulsion: a new corpus analytical approach’, in: P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö (eds.), Studies in variation, contacts and change in English. VARIENG, University of Helsinki. Accessed 22 September 2008 at http://www.helsinki.fi/varieng/journal/volumes/02/renouf_banerjee/ Renouf, A., A. Kehoe and J. Banerjee (2007), ‘WebCorp: an integrated system for web text search’, in Hundt et al. (eds.), 47-68. Reppen, R., S.M. Fitzmaurice and D. Biber (eds.) (2002), Using corpora to explore linguistic variation. Amsterdam: Benjamins. Semino, E. and M. Short (2004), Corpus stylistics: speech, writing and thought presentation in a corpus of English writing. London: Routledge. Sinclair, J. (1991), Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J. (2004), Trust the text: language, corpus and discourse. London: Routledge. Stenström, A-B. (1994), An introduction to spoken interaction. London: Longman. Stubbs, M. (1996), Text and corpus analysis. Oxford: Blackwell.
Discourse linguistics meets corpus linguistics
65
Svartvik, J. (1979), ‘Well in conversation’, in: S. Greenbaum, G. Leech and J. Svartvik (eds.), Studies in English Linguistics for Randolph Quirk. London: Longman, 167-177. Taavitsainen, I. (1997), ‘Genre conventions: personal affect in fiction and nonfiction in Early Modern English’, in: M. Rissanen, M. Kytö and K. Heikkonen (eds.), English in transition: corpus-based studies in linguistic variation and genre styles. Berlin: Walter de Gruyter. 185-266. Thomas, J. and A. Wilson (1996), ‘Methodologies for studying a corpus of doctor-patient interaction’, in: J. Thomas and M. Short (eds.), Using corpora for language research: studies in the honour of Geoffrey Leech. London: Longman. 92-109. Virtanen, T. (1997), ‘Text structure’, in: J. Verschueren, J-O. Östman, J. Blommaert and C. Bulcaen (eds.), Handbook of pragmatics. Amsterdam: Benjamins. Available also in J-O. Östman and J. Verschueren (eds.) (2005-), Handbook of pragmatics online. Amsterdam: Benjamins, at http://www.benjamins.com/online/hop Virtanen, T. (2005), ‘Polls and surveys show: public opinion as a persuasive device in editorial discourse’, in: Halmari and Virtanen (eds.), 153-180. Virtanen, T. (in press), ‘Corpora and discourse analysis’, in: A. Lüdeling and M. Kytö (eds.), Corpus linguistics: an international handbook. Berlin: Mouton de Gruyter. Virtanen, T. (forthcoming), ‘Variation across texts and discourses: theoretical and methodological perspectives on text type and genre’, in: H. Dorgeloh and A. Wanner (eds.), Approaches to syntactic variation and genre. Berlin: Mouton de Gruyter. Wårvik, B. (1990), ‘On the history of grounding markers in English narrative: style or typology?’ in: H. Andersen and K. Koerner (eds.), Historical linguistics 1987: papers from the 8th international conference on historical linguistics. Amsterdam: Benjamins. 531-542. Wårvik, B. (2003), ‘When you read or hear this story read: issues of orality and literacy in Old English texts’, in: R. Hiltunen and J. Skaffari (eds.), Discourse perspectives on English: medieval to modern. Amsterdam: Benjamins. 13-55. Wichmann, A. (2004), ‘The intonation of please-requests: a corpus-based study’, Journal of Pragmatics 36: 1521-1549. Yates, S.J. (2001), ‘Researching Internet interaction: sociolinguistics and corpus analysis’, in: M. Wetherell, S. Taylor and S.J. Yates, Simeon J. (eds.), Discourse as data: a guide for analysis. Milton Keynes: The Open University. 93-146.
'Tis well known to barbers and laundresses: Overt references to knowledge in English medical writing from the Middle Ages to the Present Day Turo Hiltunen and Jukka Tyrkkö Research Unit for Variation, Contacts, and Change in English (VARIENG) University of Helsinki Abstract The discursive representation of knowledge, the fundamental objective of scientific inquiry, reflects underlying epistemic conditions of scientific thought (Bates 1995). Knowledge is communicated in scientific writing by means of lexical choice, discourse conventions and the organization of information. Over the long history of vernacular medicine, the writers of each era – from scholasticism and empiricism to evidence based medicine – have had their own perspectives on knowledge, revealed by the discursive practices they employed. Lexical items referring to the concept of knowledge (e.g. knowledge, information, doctrine) are investigated from the late Middle English period to Present-day English. We analyze variation and change in the lexicon of knowledge and analyze the discursive contexts in which the terms appear, showing how these have changed over time in different subgenres within learned medicine. The study makes use of several medical corpora with a total word count of roughly one million words: the MEMT is used for the Middle English period, and a selection of texts from the EMEMT corpus (articles from the Philosophical Transactions and other contemporary medical texts) represent the Early Modern English period. For the PDE period, we use a selection of research articles from academic journals and texts from the Medicor.1
1.
Introduction and background
From the very beginning of organized scholarship, knowledge has been the primary objective of learned activity. While the understanding of what constitutes knowledge and how one should go about gaining it have changed over the centuries, knowledge has remained the yardstick by which the learned judge one another. Medicine, the oldest field of learning with a continuous written history in the vernacular (Taavitsainen and Pahta 2004), has always had a characteristically dichotomous relationship to knowledge. On the one hand, medicine has always been studied theoretically, on the other, medical knowledge has always had a practical application in the healing of the sick. According to the Canon of Avicenna,2 the most important collection of medical texts in the Middle Ages, “Medicine is the science by which the dispositions of the human body are known so that whatever is necessary is removed or healed by it, in order that health should be preserved or, if absent, recovered.”
68
Turo Hiltunen and Jukka Tyrkkö
This study examines how overt references to knowledge have changed in medical writing from the beginning of vernacular medicine in the late fourteenth century to the present day. Underlying the research question is the claim by French (2003) that presenting oneself to the public and professional colleagues as a “rational and learned physician” was often the main enterprise of Late Medieval and Renaissance physicians – sometimes even at the expense of actually acquiring knowledge. On this basis, it is reasonable to presume that medical writers,3 as a discourse community (see, e.g. Swales 1990: 24-27) with a vested interest in regulating references to knowledge, would always make assertions about the act of knowing or the possession of knowledge deliberately and precisely. Using a series of diachronic medical corpora to examine proportional changes in different classes of nouns and verbs in the field of knowledge, we demonstrate that knowledge references are employed differently at different historical periods. The changing styles of scientific thought, which correspond more or less with these periods, have been identified and used in scholarship under a variety of names. This study follows a popular model which distinguishes four main periods: scholasticism, identified with the axiomatic and authoritybased knowledge; empiricism, characterized by observation-focused knowledge; rationalism, during which reasoning and ideational constructs came to the forefront; and finally constructivism, typified by the analytical testing of hypotheses (cf. Taavitsainen and Pahta 1998). Given the scope of the paper, this scheme is naturally a generalization, and individual fields of science, let alone fields of learning, may have descriptive models specific to their particular histories. Methodologically this study combines historical discourse analysis with corpus linguistics. Our approach starts by defining a lexical field, follows with an investigation of its attestations in a series of historical corpora, and finally interprets these as evidence of changes in the discourse of science in different periods in history. 2.
Method
The main research question of this study is to examine whether, over a long time line, the occurrences in a corpus of lexical items representing a given conceptual field can be understood to reflect underlying paradigm changes in scientific thinking. The starting point to this hypothesis is that the conceptual field in question has to be lexically attested at a reasonably high frequency and further that the field can be considered central to fundamental ways of thinking. In our estimation, the conceptual field of knowledge serves such a purpose in scientific writing. Because discursive features are not annotated in the corpora we use, reaching this goal requires that the phenomena under investigation need to be described in a way that facilitates meaningful corpus searches. Our solution is to
References to knowledge in English medical writing
69
focus on overt references to knowledge, that is, passages explicitly evoking the concept by using a particular kind of lexical item. These passages can be retrieved from the corpora, once a list of all relevant lexical items has been established. This operationalisation comes with the caveat that the investigation is restricted to passages featuring knowledge words. Those that do not contain any of the search words are not considered, even if they point to “knowledge” by some discursive means. This in turn means that our analysis provides information about overt references to knowledge, that is, about the way in which medical writers evoke the concept of knowledge by using certain lexical items. In our view, such references are not necessarily directly linked to the amount of knowledge that the texts contain, but are rather matters of writing style and as such particularly revealing about the underlying thought style. To study references to a given conceptual field in a corpus is essentially to examine all the lexical items which can be taken to semantically belong to that field. Although this premise is in itself straightforward, it presents three challenges to be addressed before the examination of corpus evidence can begin. First, the conceptual field has to be defined clearly. This task is not easy, particularly in the case of abstract concepts which are especially prone to being approached from a variety of different theoretical perspectives, resulting in overlapping and, at times, contrasting interpretations of conceptual constructs. Once the field has been defined, its lexical composition needs to be determined. In a diachronic study, this involves paying attention to both lexical and semantic changes that occur over time. Finally, the instantiations of those lexical items in the corpora have to be retrieved, a process which involves careful examination of spelling variants, particularly for ME material, and the ruling out of homonyms. Although the objective of the study is to examine knowledge references in medical discourse, the lexical field cannot reasonably be limited only to items denoting the core sense of episteme or objective, stable knowledge (realized through lexical items such as know, understand, etc.). While we chose to discard references to knowledge claimed through pure belief, it was apparent that references to knowledge systems (doctrine, science, etc.) and practically oriented knowledge (cunning, craft, etc.) were not to be left out. Lexical items referring to units of itemized knowledge (data, information), a feature closely associated with modern scientific writing, were also included. On the other hand, lexical items which refer exclusively to the adjacent semantic fields of teaching and learning were excluded as we judged them to stray too far from the central issue of how medical authors have positioned themselves in relation to knowledge. Any of the lexical items included can of course be used instructively. Items belonging to the field of doxa (i.e. subjective knowledge through faith or belief) were left out altogether. The sense of each occurrence of pertinent lexical items was evaluated individually in context. Items were included in the analysis if the sense was judged knowledge-related. Finer-grained semantic differences, such as those given in the OED, were not identified for individual lexical items.
70
Turo Hiltunen and Jukka Tyrkkö
In several cases, the issue of polysemy became central. Because of the way the conceptual field was delimited, senses primarily related to cognition or simple practical ability were ruled out. With some lexical items the majority of occurrences had to be discarded as belonging to a different semantic field. A good example of this phenomenon is wit, which as a verb can in most cases be classified as a lexical item of knowledge. The corresponding noun, however, predominantly falls under the semantic field of cognition, as in example 1: (1)
SLuggy & slowe, in spetynge muiche, Cold & moyst, my natur ys suche; Dull of wit, & fatt, of contnaunc strange, fflewmatyke, þis complecion may not change. LME: Practical Verse4
To further clarify the semantic categorization, we consulted the respective sections of the Historical Thesaurus of English (hereafter HTE).5 The HTE categorization for knowledge appeared to largely coincide with ours, with the exception that some lexical items of practical knowledge were not to be found under relevant section headings. However, our reading of the primary material clearly confirmed that lexical items such as cunning were frequently used to mean practical skill arising from learned knowledge (see example 2). We, therefore, included such items in the study: (2)
But in specyall ther ar v þat ys to say connynge to wyrke in postumes and konnynge to teche to wyrke in woundys and konnynge to wyrke in vlceres and festurys and old sorys and cankyrs and connynge to restore flesch agayne and awoyd place with medycyns. LME: Book of Surgery
At the same time, many of the lexical items listed under the relevant sections in the HTE either were not words of knowledge in the way we use the term, or were not attested in our data, and were therefore excluded from further analysis. Following these criteria, references to the semantic field of knowledge, as defined above, are realized in the corpora using 17 nouns and 3 verbs. Spelling variants of each were discovered through consulting the Oxford English Dictionary and cross-referencing with the full word lists of all pre-PDE corpora, and all occurrences were retrieved (see section 4). 3.
The Data
Our approach treats lexical items denoting knowledge and knowing as correlates of the scientific thought style, and we expect to find variation in their frequency
References to knowledge in English medical writing
71
and distribution in medical texts on a par with changes in the thought style. The investigation of this hypothesis is based on a series of corpora that represent different periods in the history of medical writing in English. A major factor in choosing a suitable corpus for the analysis was availability: we wanted to make use of existing corpora to the extent it was possible. Some of the available corpora met these requirements: the MEMT corpus for the late Middle English period, the ARCHER corpus for the 19th century, and the Medicor for Present-day English. No finalized corpus of medical writing is presently available for the Early Modern period, but to examine the full time line of vernacular medical writing we filled in the gap between MEMT and ARCHER with a selection of 17th century texts from the forthcoming Early Modern English Medical Texts (EMEMT) corpus. Our study focuses on the learned end of medical writing. In the LME and EModE corpora this includes both texts written by university educated physicians and practitioners without institutional credentials (see e.g. Wear 1998 and Siraisi 1997), while 19C, PDE1 and PDE2 corpora represent university-based medicine exclusively. Within this category, journal articles and other scholarly writing were considered separately, as the available corpora enabled such a distinction for two periods. The corpora used in this study consist of learned medical texts from the Late Middle English period to the present day. The material comes from six different samples, which together cover four periods, as shown in Table 1. The aggregate size of the corpora is ca. 1.1 million words. Table 1: Corpora used in this study Corpus
LME
EModE1
EModE2
19C
PDE1
PDE2
Timeline Texts
1375- 1500 39
1650-1700 36
1665-1713 153
1820-1905 40
1983-1997 63
2001-2005 64
Words
221,646
245,839
195,226
83,970
197,010
252,685
The Late Middle English subcorpus (LME) is a sample from the Middle English Medical Texts corpus (Taavitsainen et al. 2005), containing all the texts in the categories Surgical texts and Specialized treatises. The Early Modern English subcorpus consists of two parts. The first part (EModE1) contains texts from two categories in the forthcoming Early Modern English Medical Texts corpus, General treatises and Surgical treatises. The second part (EModE2) contains articles on medical topics from the Philosophical Transactions of the Royal Society, also to be included in the EMEMT corpus. The nineteenth century subcorpus (19C) consists of all texts included in the category Medicine in the ARCHER corpus. All the texts in this sample come from the Edinburgh Medical Journal (see Biber et al. 1994).
72
Turo Hiltunen and Jukka Tyrkkö
The Present-day English data is again divided into two subcorpora. The first sample (PDE1) contains all texts in three categories of the Medicor corpus: Handbooks, Textbooks, and Editorial articles (Vihla 1998). The second sample (PDE2) contains 64 medical research articles from eight different medical journals representing the specialisms of surgery and orthopaedics. The subcorpora are not of equal size, and the 19th century in particular is represented by a smaller dataset than the other periods under investigation. This is because we did not have access to corpora representing medical writing of the period other than the ARCHER, and time constraints did not permit us to collect supplemental material. We take this into account in our analysis, by using normalized frequency counts per 1,000 words. All searches were carried out using the Wordsmith Tools 4. 4.
Results
The uses of nouns and verbs of knowledge were analyzed separately.6 To provide a more accurate description of the use of relevant lexical items, one further level of categorization was introduced in each group. Data on nouns with different semantic characteristics were considered separately, and verbs are discussed in relation to their actors. Results of corpus searches in each category are provided, and the most interesting developments are discussed and illustrated with examples. 4.1
Nouns
Nouns in the lexical field of knowledge can be divided into several distinct groupings on the basis their semantic properties. While some, like knowledge and understanding, refer to the underlying concept on a general level, others have more specific ranges of reference. For the purposes of detailed analysis, we distinguish four groups of nouns:7 General knowledge nouns: knowledge, understanding, wit, wisdom, reason Nouns denoting knowledge as a learned ability: cunning, craft, skill, mastery Nouns denoting knowledge as a system: art, mastery, science, practice, doctrine, model, theory Nouns of itemized knowledge: data, information Tables 1-4 show the frequencies of groups of nouns in different corpora. The first line shows the raw frequency, and the second the frequency normalized to 1,000 words of running text. Considering each of the noun groups separately, we can observe important changes in their frequencies over time. Taking general nouns under investigation first, we can see in Table 2 that, from the Late Middle English period onwards, there is a gradual decrease in the frequency of these words continuing all the way to the Present-day English corpora.
References to knowledge in English medical writing
73
Table 2: Frequency of general nouns LME 308 1.39
EModE1 165 0.67
EModE2 80 0.41
19C 18 0.21
PDE1 51 0.26
PDE2 52 0.20
In the late medieval period, general nouns denoting knowledge typically occur in passages where some piece of knowledge is explicitly indicated to be useful or necessary to the reader, as in example (3). (3)
Thow schalt also haue knowlech þat he þat is wunt to ete twyis on þe day, and aftyr chongyth þat dyete and takyth hym to o mele, it is very certeyn þat it schal turne hym to noyauns. LME: Þe Priuyte Of Priuyteis
In the Early Modern English data, passages of this kind are no longer common. Instead, we find general nouns in first-person narrative accounts, where the writer of the text speaks of his own knowledge (example 4). (4)
There, Sir, are all the Observations I have been able to collect yet: if any thing else material shall hereafter come to my knowledg about these matters, I shall not fail to impart them, God permitting. EModE2: Glanvill (1669) ‘Observations concerning the Bath-Springs’ The Philosophical Transactions, 4, 49, p. 982
In our PDE data, general nouns occur predominantly in passages indicating a gap in the present state of knowledge, which the research article intends to fill (5): (5)
To our knowledge, no studies of PMF effects on in vivo contusive spinal cord injury (SCI) models have been reported. PDE2: Crowe et al. (2003) ‘Exposure to Pulsed Magnetic Fields Enhances Motor Recovery in Cats After Spinal Cord Injury”. Spine 28, 24, p. 2660-6.
Nouns in the second group, which denote knowledge as a learned ability, are few in the Late Middle English corpora, and in later periods they are all but absent, except for a few sporadic occurrences (Table 3). Table 3: Frequency of skill nouns LME 83 0.37
EModE1 10 0.04
EModE2 11 0.06
19C 2 0.02
PDE1 4 0.02
PDE2 0 0.00
74
Turo Hiltunen and Jukka Tyrkkö
This suggests that while practical knowledge was a relevant part of the lexicon of knowledge in the late medieval period (as in example 6), it no longer appears as such in our data from later periods. (6)
Þerfor þe significaciouns ar to be taken of þe beyng or essencion of þe sekenes which þof all þai be þe bigynnyng and grounde of al þe arte and crafte of medycyne and a parte þer of. LME: De Ingenio Sanitatis
The picture is more varied for nouns in the third group, nouns denoting knowledge systems (Table 4). It seems that there is a small decrease starting in the Early Modern English period and continuing to the 19th century, but the frequency of these nouns in the Present-day English data is again almost the same as in the LME period. Table 4: Frequency of system nouns LME 146 0.66
EModE1 141 0.57
EModE2 69 0.35
19C 37 0.44
PDE1 90 0.46
PDE2 162 0.64
But even while the overall frequency of the noun group remains stable, there are changes in the relative importance of individual nouns within the group. This is particularly obvious when we compare the differences in the distributions of two individual nouns, doctrine and model. In the LME data, the noun doctrine is the most frequently attested noun in this group (55 instances, 0.25 words per 1,000 words) (example 7). In later periods the frequency decreases steadily and there are no occurrences of the noun in PDE research articles. (7)
But neuerþelattere in þe þridde doctrine of þis same chapitre schal be told in partie of þe pannycles þat beþ vndir þe scolle, closinge þe brayn. LME: Chirurgie De 1392
The noun model shows almost entirely the reverse pattern of development. Apart from two occurrences in EModE1, the noun is attested only in the PDE data, and it is by far the most common noun denoting knowledge systems in both corpora (59 instances, 0.30 per 1,000 words in PDE1; 141 instances, 0.56 per 1,000 in PDE2 (example 8)).
References to knowledge in English medical writing (8)
75
The model of demineralized bone matrix (DBM)-induced bone formation recapitulates the cell biology of endochondral ossification seen during embryogenesis and fracture healing. PDE2: Ciombor et al. (2002) ‘Low frequency EMF regulates chondrocyte differentiation and expression of matrix proteins’. Journal of Orthopaedic Research 20,1, p. 40-50.
Finally, the first instances of nouns of itemized knowledge are found in the Early Modern English corpora, after which there is a dramatic increase in their frequency in the later periods (Table 5). Table 5: Frequency of nouns of itemized knowledge LME 0 0.00
EModE1 1 0.00
EModE2 16 0.08
19C 12 0.14
PDE1 183 0.93
PDE2 497 1.96
This increase coincides with important changes in the dominant research paradigm of medical science, and probably reflects the development towards modern clinical medicine, where the focus is increasingly on the results and measurements (example 9), as well as on the implications that they may have on clinical practice and further research (example 10). As the table shows, these nouns are particularly common in Present-day research articles. (9)
The data from our series demonstrate the paramount importance of the extent of the neurological injury for the prediction of the functional outcome. PDE2: Zelle et al. (2004) ‘Functional Outcome Following Scapulothoracic Dissociation’ Journal of Bone & Joint Surgery, 86, 1, p. 9-16.
(10)
Identifying the immediate operative-related risks of instrumented interbody fusion can provide useful information for approach selection. PDE2: Scaduto et al. (2003) ‘Perioperative Complications of Threaded Cylindrical Lumbar Interbody Fusion Devices: Anterior Versus Posterior Approach’ Journal of Spinal Disorders & Techniques, 16,6, p. 502-507.
The results show that an overall change takes place in the discourse of knowledge over the centuries. Significantly, this phenomenon is not only a matter of overall frequency change, but can be attributed more specifically to developments within medical discourse, as shown by the comparison of data from the four groupings of nouns.
76
Turo Hiltunen and Jukka Tyrkkö
4.2
Verbs
Next, we move to the significantly more limited lexical field of knowledge verbs. From the ME period onward, the corpora attest only three verbs in this lexical field: know, understand, and wit. The last of these, wit, is only found in the ME and EModE periods, predominantly in formulaic constructions (“it is to wit”, etc.).8 The three verbs are treated as a single lexical field and not subdivided. To study the use of overt verbal references as a reflection of changes in the underlying thought style, we focused on two indicators: overall usage of knowledge verbs and semantic changes in the actor or agent of knowledge verbs. Over the timeline, the usage of knowledge verbs shows a steadily declining trend until the mid 18th century, after which the frequency appears to level off (see figure 1). The overall decline in the use of knowledge verbs roughly coincides with the timeline associated with the changing of scientific paradigms. Although the observation is partly explained by the overall increase of nominalization particularly in scientific writing from the late seventeenth century onward (see Halliday 2004; Banks 2003),9 the specific nature of the lexical field of knowledge may have contributed to the steep decline. 2,5
2
1,5
1
0,5
0 LME
EModE1
EModE2
19C
PDE1
PDE2
Figure 1: Frequency of knowledge verbs across the corpora (1/1000 words) The scholastic tradition, which persisted in medical writing until the middle of the Early Modern period, is noted for the high level of didactic and author centred discourse (see Wallis 1995, Taavitsainen and Pahta 1998). Our findings support this view, showing frequent use of deontic modal constructions involving know or understand, as well as the formulaic constructions “it is to wit” or “it is to know”.
References to knowledge in English medical writing
77
(11)
It is to wete þat in flebotomie 4 þyngis are principalli attendid: sc., custome, tyme, age, & vertue. LME: Phlebotomy.
(12)
When þu hast ete þi mete, be ware þu ete not eftsonis, vn-til þi mete bifore receiuid be perfitely digestid. And when þat is, þu shalt knowe by .ij. tokenis. One is when þine appetite cummith to þe ayene after þi mete which þu hast receyuid. Anoþir tokin: if þi spettel be sotel, and li3tly will destende in to þi mouth. LME: Regimen sanitatis.
The declining use of knowledge verbs in the Early Modern period can be interpreted as a reflection of the gradual replacement of the gnostic tradition of knowledge with the epistemic (see Bates 1995), the first major shift of scientific paradigm. As the primary discursive purpose of references to knowledge changes from reinforcing established authorities to evaluating knowledge in light of observations and methodology, the need for verbs explicitly denoting the act of knowing can be expected to decline – a view our corpus evidence appears to support. The second major shift in scientific thought styles, from Empiricism to Rationalism, comes through in the data. The discovery of new clinical methods, coinciding with the 19C part of the corpus, appears to have changed the way knowledge was discussed in medical writing. As the focus of medical writers shifted from natural philosophy to knowledge derived from increasingly accurate clinical data, the occasions for using knowledge verbs decreased notably. It will do well to keep in mind the development of the academic register of writing as a somewhat separate issue from the changing underlying scientific paradigms. While modern scientific practice owes mainly to Empiricism and subsequent styles of thought, at least some of the stylistic features associated with modern science writing appear to have been established at a slightly earlier date. The gradual stabilization of academic writing came about not only as a result of ideational developments, but also of social and technological developments. From the 17th century onward, the ever strengthening role of learned societies and universities, the establishment of academic printing in the vernacular and the wider circulation of learned titles all resulted in the development of relatively uniform, genre specific stylistic features that we today associate with academic writing. Our findings, showing a clear decline in the use of knowledge verbs until the EMoDE2 period followed by a relatively steady level thereafter, support the view that at least some stylistic discourse feature may have began to stabilize by the seventeenth century (see Halliday 2004). 4.3
Knowers
In order to take a closer look at the overall patterns of knowing, we were interested in examining whom medical writers of different periods have seen fit to
78
Turo Hiltunen and Jukka Tyrkkö
associate with the act of knowing; in other words, whose knowledge has been considered worth mentioning, whether in the positive or negative. To facilitate a systematic analysis, we categorized the semantic role of actors of knowledge verbs – i.e. knowers – into six groups according to the approximate level of knowledge they appeared to represent (see table 6). At the top of the system we placed references to God as the infallible knower, at the bottom references to the layman. In between, we ranked ancient authorities, the author himself, the community of professional medicos, and the reader, in descending order. No distinction was made according to the specific training or background of the medical practitioner; accordingly, class four includes university trained doctors, surgeons, barber-surgeons, and apothecaries. Table 6: Classification of types of knower Class
Label
Lexical attestations
1
Divine
Direct reference to God or Christ.
2
Authority
In general (e.g. “auctores”, “the ancients”) or by name, such as Galen, Hippocrates, Avicenna, etc.
3
Author
First person singular
4
Medical community
Direct reference to medical or scientific community or to a specific subsection, such as physicians, surgeons, etc. Can be indicated through the use of first person plural, passive voice, etc.
5
Reader
Second person singular or direct reference to reader, or more specifically, as in 'young physicians'
6
Laymen
By direct reference to a non-medical profession such as “laundresse” or “fishmonger”, or to a generic actor (“boy”, “any man”, etc.)
Under this model, knower classes do not imply a qualitative assessment about the factual correctness of the actor’s knowledge. For example, if the actor of a knowledge verb is an ancient authority it does not necessarily follow that the sentence presents that authority figure as someone who knows (see example 16). Using this system of classification, we examined all knowledge verbs in the corpora (figure 2 and table 7).
References to knowledge in English medical writing
79
100% 90% 80%
All/lay Addressee Prof. Comm. Author(s) Authority Divinity
70% 60% 50% 40% 30% 20% 10%
PD E2
PD E1
19 C
2 EM
od E
1 od E EM
LM
E
0%
Figure 2: Knowledge verbs classified by type of actor Table 7: Knowledge verbs classified by type of actor Subjects
Divinity
Authority
Author(s)
LME EModE1 EModE2 19C PDE1 PDE2
3 3 0 0 0 0
7 6 12 2 0 0
13 77 47 10 1 12
Prof. Comm. 52 53 74 19 95 75
Addressee
All/lay
351 111 13 0 1 0
19 71 17 5 4 2
The vast majority of knowledge verbs in the Late Medieval subcorpus is found to occur with deontic modals, indicating a didactic preoccupation. In such instances the subject of the verb is usually the intended reader, whom the author, positioning himself as a teacher, instructs. Another common strategy is to list the things a member of a particular professional community (physicians, surgeons, apothecaries) are expected to know or be able to do. In these instances, we class the subject under ‘addressee’ if the context makes it clear the nominal reference is used didactically (as in example 13) and not as an assertion of shared understanding about a medical issue.
80
Turo Hiltunen and Jukka Tyrkkö
(13)
A surgian muste knowe þat alle bodies þat ben medlid vndir þe sercle of þe moone, ben engendrid of foure symple bodies, her lijknes ech in oþere medlyng. LME: Lanfranc, Chirurgia Magna 1
The second most common actor of knowledge verbs is the professional community, usually manifest syntactically through passive constructions. Here the tone of the discourse is less imperative, and the function of the reference is usually to indicate that a given piece of knowledge is held by all members of the community as a fact. A typical attestation of the type is found in descriptions of illnesses and their signs: (14)
If þe discrasie be hote, which is knowen bi redne3 & vesicacioun; make colde þe place no3t bi iusquiamy ne bi mandrake, as seiþ G, for þai colde tomych. bot with rosis, plantage & vnguento albo, which infrigideþ moderately driand. LME: Chauliac, Wounds
One of the more interesting findings concerns the discursive strategy employed in ME references to ancient authorities. Against expectations, corpus evidence shows that the collocative relationship between the names of authority figures and knowledge verbs is relatively weak, and that instead the knowing of such authorities is expressed much more frequently through speech act verbs, particularly say – a practice Taavitsainen (2001: 45-46) ascribes to the virtually infallible status of such authorities’ knowledge, which needs no reinforcement with a knowledge verb (example 15). (15)
Avicenna seiþ þat membres beþ bodyes imaade of þe firste mellinge of humours; oþir, as it is iseide super Iohannitium, a membre is a stedfast and a sad partye of a beest icompouned of þinges þat ben liche oþir vnliche, and is i-ordeynede to somme special office. LME: Trevisa, On the properties of things
An analogy can be drawn to biblical language, where the word of God is generally expressed through speech act verbs. The actual use of the divine subject (e.g. “God only knows” etc.) is extremely rare in medical writing, showing only three attestations during both the ME and EModE periods and none thereafter. From the beginning of the Early Modern period, scholasticism began to steadily lose ground to the new and frequently iconoclastic paradigm of empiricism. Somewhat surprisingly, changes in the style of scientific thought appear to be reflected in medical writing by an increase, rather than a decrease, of references to ancient authorities as knowers. Significantly, however, the increase comes with a change in polarity, whereby passages referring to an ancient authority as a knower are increasingly used to point out their mistakes and lack of knowledge (cf. McMullin 1985: 17):
References to knowledge in English medical writing
(16)
81
And as for Campher, Galen knew it not. Avicen saith expressely of Campher, that although it bee odorata, yet it is frigida. EModE1: Jorden, A Discovrse of Natvrall Bathes and Minerall Waters (1631), p. 27.
First person singular subjects appear significantly more frequently than in the ME period. Often the discursive function is to assert the personality of the author, and to use his personal authority to make a point. (17)
I know, and am well assured, that Physicians would frequently advise their Patients to stoving and bathing, had they them in their own houses. EModE1: Cock, Miscaelanea Medica (1675), p. 37.
Another explanation for the increasing use of the first person singular subject is the empirical paradigm of the personal observation, which often took the form of narrative. In the Philosophical Transactions, for example, many accounts of firsthand medical observations are presented as first-person narratives. (18)
Antimony will recover a Pig of the Measles; by which it appears to be a great purifyer of the Blood. I knew a Horse, that was very lean and scabbid, and could not be fatted by any keeping, to whom Antimony was given for two Moneths together every morning, and that upon the same keeping he became exceeding fat. EModE2: ‘A Letter lately written by an observing person to a Friend of the Publisher, concerning the vertue of Antimony’ (1668) The Philosophical Transactions 3, 39, p. 774
In the light of our data, Early Modern medical writing (EModE1 and EModE2) also reflected the empirical mindset by representing laymen as people who could be seen as possibly possessing knowledge valuable to the medical community. This practice continues in the 19C subcorpus, but appears only infrequently in PDE1 and PDE2. (19)
‘Tis commonly known to Barbers and Laundresses, that the same PumpWater will not so well and uniformly or without little Curdlings, dissolve Wash-balls and Soap, as Rain-Water, and some running Waters usually will. EModE2: ‘An Account of the Honourable Robert Boyle’s way of examining Waters as to Freshness and Saltness’ (1693) The Philosophical Transactions 17, 196, p. 631
Another major shift can be seen in the 19C subcorpus. Overt references to knowing declined considerably and were increasingly expressed through passive constructions. The frequency of references to the medical community as knower
82
Turo Hiltunen and Jukka Tyrkkö
increased, reflecting the increasingly organized and institutional nature of the medical profession. (20)
Strange as it may read, cases are known where the illness merely leads to indisposition, with headache, giddiness, and a bubo in the neck, groin, or armpit. 19C: Robertson, ‘Notes on an outbreak of plague’(1905)
When it comes to verbal knowledge references, modern medical writing largely follows the trend set in the 19th century. In some respects, modern articles also appear similar in style to the early research articles of the late 17th century. When knowing is mentioned, it is often presented in terms of explaining things which are not yet known and realized through negative polarity (example 21). By doing so, modern medical authors contextualize their findings in terms of the broader field of learning, thereby adding credibility to their own findings by showing areas which are yet to be examined. (21)
Neuroglial cells seem also to be an important mediator for the normal metabolism of neurons, although little is known in this respect. PDE1: Angevine, The nervous tissue (1986)
As with the analysis of nouns, the closer examination uncovered domain-specific discursive practises which help explain the more general frequency changes over the timeline. The decreasing use of knowledge verbs hides a significant transformation in discursive strategy, from the reader-oriented style of the Late Middle Ages to the community-oriented discourse of the Present Day. 5.
Conclusions
This study provides compelling evidence for the changing patterns of overt references to knowledge over a long period of time. The overall trends are clear: the frequencies of both nouns and verbs of knowledge are the highest in the late Middle English data, and considerably lower in later periods, with the exception of nouns denoting itemized knowledge. In part, the decrease can be explained by the waning of the influence of the scholastic thought style on medical writing. In the late Middle English data, overt references to knowledge are mostly encountered in didactic passages which are aimed at the reader of the text. Such passages are characteristic of late Middle English medical writing, but they are no longer common in the Early Modern period and virtually disappear thereafter. The drop in the frequency of knowledge words may therefore be partly attributed to the fact that from Empiricism onward there are fewer contexts in which these words may be used. At the same time, new openness to novel ideas opened the door to seeing even the layman as someone with valuable knowledge.
References to knowledge in English medical writing
83
However, there are other issues that come into play apart from the decline of scholasticism in explaining trends that are observed in frequency data. Partly as a result of the general cultural outlook of the Renaissance and partly in consequence of the growth of printing which identified individual authors more closely than before with their works, the position of the contemporary author as an original and authoritative knower strengthened markedly. Here, in particular, the commercial aspects of medicine highlighted by French (2003) tie in with the business of publishing (cf. Furdell 2002), for the sharp increase in references to the author’s personal knowledge can be seen not only in light of a change in scientific paradigm, but also as a deliberate attempt to assert personal authority for financial reasons. Over the following two centuries, the authority of the individual gradually shifted over to the professional community. The developing register of modern scientific writing began to favour an increasingly nominal style largely devoid of expressions of personal opinion (see Banks 2003). In Present Day medical writing, knowledge is overwhelmingly discussed from this perspective. Additionally, the use of nouns of itemized knowledge increases sharply, a development that can be attributed to the nature of modern clinical medicine and particularly the associated advances in measuring technology. The number of overt references to knowledge is not directly related to how much information a text contains. Rather, we see them primarily as an aspect of writing style, which is contingent on the context in which texts are produced. Therefore the fact that we have observed a decline in the use of verbs and most noun categories does not directly tell us anything about the information content, but it gives us some insight into how that information is expressed. In fact, these results could be interpreted as evidence for the increasing certainty about the propositions that are made in the texts. As pointed out already by Lyons (1977: 809), categorical assertions are epistemologically the strongest kind of statements, and Biber’s (2004: 126) study suggests that reliance on such statements has indeed increased in medical prose in the last two centuries. Therefore, it makes sense that modern research articles (whose information content is unquestionably high) only refer to knowing when something is common knowledge in the field, or when something is not known, but not in making a claim for new knowledge. The results of this exploratory study suggest that the approach to discourse analysis we have adopted, based on the analysis of a clearly delineated conceptual field and the investigation of the associated lexical items in corpora, is a viable model with potential future applications in the diachronic study of the expression of ideas. The findings are made particularly interesting by the fact that while they agree with the major results of earlier research, the corpus-driven nature of the method sheds light on unexplored discourse features. Moreover, our study is able to suggest new hypotheses which could account for changes taking place between individual periods, as well as in the relative importance of individual words.
84
Turo Hiltunen and Jukka Tyrkkö
Notes 1
This study was conducted with funding by the Research Unit for Variation, Contacts, and Change in English at the University of Helsinki, funded by the Academy of Finland
2
Avicenna, Liber Canonis, Book 1, chap 1, F. 1r (Venice, 1507; facsimile, Hildesheim, 1964).
3
As reflected in the composition of the corpus (section 3), we focus on the learned end of medical writing. Although the spectrum of the medical profession was wide and varied until the Enlightenment, the more learned writers can be reasonably approximated as a discourse community.
4
Text labels of the LME corpus refer to the short titles used in the MEMT corpus (Taavitsainen et al. 2005)
5
The Historical Thesaurus of English is available online at http://libra.englang.arts.gla.ac.uk/historicalthesaurus/. We are grateful to Prof. Christian Kay and Dr. Irene Wotherspoon for giving us advance access to the section ‘Knowledge’ of the Historical Thesaurus of English.
6
Adjectives and adverbs denoting knowledge were not included in this study.
7
Mastery is to be found classified as both nouns of learned ability and system nouns. The individual occurrences were evaluated on a case-bycase basis. Lexical items denoting medical signs (e.g. sign, symptom, accident) were not considered units of itemized knowledge in this study. On the use of sign terminology in ME and EModE medical writing, see Tyrkkö (2006).
8
See Taavitsainen and Pahta (1997). Notably, by the ME period English no longer lexically marked the semantic difference between “knowing of” and “knowing about”, attested in Germanic languages (e.g. German kennen and wissen) and in Romance languages (e.g. French connaître and savoir).
9
For a study of nominalization specifically in Early Modern medical writing, see also Tyrkkö and Hiltunen (forthcoming).
References Banks, D. (2003), ‘The evolution of grammatical metaphor in scientific writing’, in: L. Ravelli, A-M. Simon-Vandenbergen and M. Taverniers (eds.) Grammatical metaphor: views from systemic functional linguistics. Amsterdam: Benjamins. 127-148.
References to knowledge in English medical writing
85
Bates, D. (1995), ‘Scholarly ways of knowing: An introduction’, in: D. Bates (ed.) Knowledge and the Scholarly Medical Traditions. Cambridge: Cambridge University Press. 1–22. Biber, D., E. Finegan and D. Atkinson (1994), ‘ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers’, in: U. Fries, G. Tottie and P. Schneider (eds.) Creating and using English language corpora. Amsterdam: Rodopi. 1–14. Biber, D. (2004), ‘Historical patterns for the grammatical marking of stance, Journal of Historical Pragmatics, 5, 1: 107–136. EMEMT= Early Modern English Medical Texts. In preparation. French, R. (2003), Medicine before Science. The Business of Medicine from the Middle Ages to the Enlightenment. Cambridge: Cambridge University Press. Furdell, E.L. (2002). Publishing and Medicine in Early Modern England. Rochester: University of Rochester Press. Halliday, M.A.K. (2004) [1988], ‘The Language of Physical Science’, in: J.J. Webster (ed.) The Language of Science. London: Continuum. 140–158. HTE=Historical Thesaurus of English (forthcoming). Available online at http://libra.englang.arts.gla.ac.uk/historicalthesaurus. Lyons, J. (1977), Semantics. Volume 2. Cambridge: Cambridge University Press. McMullin, E. (1985), ‘Openness and Secrecy in Science: Some Notes on Early History’, Science, Technology, & Human Values, 10, 2: 14–22. MEMT= Middle English Medical Texts. 2005. Compiled by I. Taavitsainen, P. Pahta and M. Mäkinen. CD-ROM. Amsterdam: Benjamins. Oxford English Dictionary. 2004-. Online. J. Simpson (ed.). Available at http://www.oed.com/ Siraisi, N. (1997), Medieval & Early Renaissance Medicine. An Introduction to Knowledge and Practice. Chicago and London: University of Chicago Press. Swales, J. (1990), Genre Analysis. English in academic and research settings. Cambridge: Cambridge University Press. Taavitsainen, I. (2001), ‘Language History and the Scientific Register’, in: H-J. Diller and M. Görlach (eds.) Towards a History of English as a History of Genres. Heidelberg: C. Winter. 185–202. Taavitsainen, I. and P. Pahta (1997), ‘The Corpus of Early English Medical Writing: Linguistic Variation and Prescriptive Collocations in Scholastic Style’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.) To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Société Néophilologique. 209–225. Taavitsainen, I. and P. Pahta. (1998), ‘Vernacularization of Medical Writing in English: A Corpus-Based Study of Scholasticism’, Early Science and Medicine 3. 157–185.
86
Turo Hiltunen and Jukka Tyrkkö
Taavitsainen, I. and P. Pahta. (2004), ‘Vernacularization in Scientific and Medical Writing’, in: I. Taavitsainen and P. Pahta (eds.) Medical and Scientific Writing in Late Medieval English. Cambridge: Cambridge University Press. 1–22. Tyrkkö, J. (2006), ‘From tokens to symptoms: 300 years of developing discourse on medical diagnosis in English medical writing’, in: M. Dossena and I. Taavitsainen (eds.) Diachronic Perspectives on Domain-Specific English. Bern: Peter Lang. 229–255. Tyrkkö, J. and T. Hiltunen (forthcoming), ‘Frequency of nominalization in Early Modern English medical writing’, in: A. Jucker, M. Hundt and D. Schreier (eds.) Corpora: Pragmatics and Discourse. Papers from the 29th International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi. 293-316. Vihla, M (1998), ‘Medicor: A corpus of contemporary American medical texts’, ICAME Journal, 22: 73–80. Wallis, F. (1995), ‘The experience of the book: manuscripts, texts, and the role of epistemology in early medieval medicine’, in: D. Bates (ed.) Knowledge and the Scholarly Medical Traditions. Cambridge: Cambridge University Press. 101-126. Wear, A. (1998), Health and Healing in Early Modern England. Aldershot: Ashgate.
Comparing type counts: The case of women, men and -ity in early English letters Tanja Säily a and Jukka Suomela b a
b
Research Unit for Variation, Contacts and Change in English (VARIENG), Department of English, University of Helsinki Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki
Abstract This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.
1.
Introduction
The linguistic case we study is as follows. We have two roughly synonymous suffixes, -ness and -ity, which are typically used for forming abstract nouns from adjectives, as in example (1). (1)
a.
generous [ ] + -ness generousness [ ]
b.
generous [ ] + -ity generosity [
]
The first suffix, -ness, is etymologically native, while -ity entered the language as a result of contact with French during the Middle English period, and was later reinforced by loans from Latin (Marchand 1969: 312–313). The foreignness of -ity can be readily discerned from the above example: it changes the form of its base from [ ] to [ ], whereas with -ness there is no change (but see Section 2.1). In addition, the meaning of words in -ity is often not entirely compositional, i.e., not deductible from the meanings of the base and the suffix. Thus, it is both (morpho)phonologically and semantically more opaque than -ness (cf. Riddle 1985: 443–444; Aronoff and Anshen 1998: 246).
88
Tanja Säily and Jukka Suomela
What we are interested in doing with the suffixes is to compare their morphological productivity, a concept famously defined by Bolinger (1948: 18) as “the statistically determinable readiness with which an element enters into new combinations”. More specifically, we wish to examine whether the productivity of each suffix varies between different sociolinguistic groups, as defined by Labovian sociolinguistic categories such as age, gender and social status. Many linguistic features show sociolinguistic variation, but to date this has been studied little in the case of morphological productivity, and not at all with the otherwise closely scrutinised pair of -ness and -ity. Our data come from the 17th-century part of the Corpus of Early English Correspondence (1998; henceforth known as the CEEC). We have chosen personal letters as our material because they are one of the closest genres to speech, which is the primary medium of language and the most fertile ground for linguistic change (Nevalainen and Raumolin-Brunberg 2003: 28). This time period is interesting because it is to be expected that -ity would by this time have spread to wider use from the more literary genres in which it entered the language. Furthermore, a pilot study by Säily (2005) using the smaller Corpus of Early English Correspondence Sampler (1998) showed a gender difference in the use of -ity in letters of the 17th century. We believe that -ity, as a learned and etymologically foreign suffix, is less productive with poorly educated social groups, such as women and the lower ranks, than with well-educated groups, such as men and the higher ranks. As to the productivity of -ness, we do not expect to find significant differences between social groups. 1.1
Objectives
The main measure of morphological productivity used in this study is that of type counts, i.e., how many different words in -ity and -ness are used by the different social groups. We seek to study the productivity of the suffixes -ity and -ness in our material by two complementary means: 1.
Statistical hypothesis testing. We aim to formulate and test a hypothesis which captures our belief that gender is significant in the case of -ity.
2.
Exploratory data analysis. Regardless of whether gender proves to be significant or not, we are interested in studying the correlation between productivity and a number of other variables, such as the age, domicile or social rank of the writers.
We present a unified approach which enables us to tackle both of these tasks. 1.2
Contributions
This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic
Comparing type counts: women, men and -ity in early English letters
89
questions related to productivity and type richness. The basic techniques are standard but not widely used in the study of these questions – our hands-on report aims at promoting the use of these powerful tools. With this goal in mind, we have chosen to describe in detail one particular application of these techniques. The emphasis is on depth, not breadth: instead of side-tracking and discussing a number of alternative techniques at each point, we make particular choices and go through all the subtleties that need to be taken into account. We assume a basic knowledge of statistical hypothesis testing, but we have included an informal introduction to permutation tests. We take the approach of computing accumulation curves for types and hapax legomena (i.e., types that occur only once). In particular, we use Monte Carlo sampling to compute the upper and lower bounds of these curves for some predetermined levels of statistical significance. Once we have computed an accumulation curve, we can test a hypothesis by simply plotting a data point on the curve. Exploratory data analysis is equally straightforward, and we can also qualitatively study the shape of the accumulation curves. One of the main technical contributions is described in Section 5: we have developed a computer program which can be used to compute the curves. This is the only part of the method described here which is computationally intensive. In the implementation, the emphasis is on computational efficiency. The program is freely available under an open source licence. The results achieved by using these methods on our data are reported in Section 6. As we shall see, we can conclude that our hypothesis is true: the type richness of -ity is indeed significantly low in the subcorpus which consists of women’s texts. Exploratory data analysis reveals an unanticipated feature of the data: the type richness of -ity is also significantly low in the subcorpus which consists of the letters written in 1600–1639. 2.
Background and related work
In this section, we justify the use of type counts for measuring morphological productivity, place the study in the framework of historical sociolinguistics, and review related work on using similar methods. 2.1
Type counts as a measure of morphological productivity
According to Dalton-Puffer (1996: 217), there is an obvious correlation between productivity and type counts: “a productive morphological rule produces many different words (types), and it is therefore likely that in a given corpus a productive suffix will occur more often than an unproductive one”. Type counts are by no means a perfect measure of productivity, however. As Cowie and DaltonPuffer (2002: 416) point out, the existence of a large number of types may be due to aggregation through productivity in the past rather than current productivity. Furthermore, in the case of -ity, some words have been borrowed from French or Latin as a package including the suffix, with no productivity involved at all in
90
Tanja Säily and Jukka Suomela
English. This applies to the word generosity in our example (1): according to the Oxford English Dictionary (henceforth the OED), generosity has been in the language since about 1432 and is an adaptation of the Latin word generǀsitƗt-em. Nevertheless, type counts are frequently used as a measure of productivity, for example by Baayen and Lieber, who call it the extent of use (1991: 818). This measure may not give us a full picture of the productivity of a suffix, but it can certainly be useful despite the above caveats about past productivity and borrowing. In addition, the impact of these caveats could be reduced by restricting the kinds of words that are counted. One possible restriction would be that the suffixed word must have had an extant base at the time when the material was written; another could be that the word must not have been in the language for, say, more than a century, as evidenced by its first attestation date in a major dictionary such as the OED (Cowie and Dalton-Puffer 2002: 419). These restrictions would increase the probability that the word in question was formed productively from suffix and base rather than retrieved as a whole from the mental lexicon of the writer. For this study, however, we have elected to omit the above restrictions and count all words that etymologically contain the suffix in question – as noted by Plag (1999: 29), dropping out “non-productive formations” could mean prejudging the issue of whether the suffix is productive. The latter of the above restrictions at least would certainly be too limiting: To an individual user of the language, a word can be new even if it has been around in the language community for hundreds of years (cf. Baayen and Renouf 1996: 77), and thus even established words can be formed productively by users from the base and the affix. In fact, even if an affixed word exists in the mental lexicon of the user, he or she may still end up forming it from its constituents, depending on how frequent the affixed word is compared with its base – Hay (2001) claims this is true for processing (e.g., when reading), but we think it holds for producing words as well. As for words with no extant base, they too may contribute to keeping the suffix productive, as they contain its form and meaning, and there is often an adjective related to the missing base that could be seen as the base by the user; see (2).1 Various restrictions on type counts are explored in Säily (2008: 87–95). (2)
ambiguity ~ ambiguous + -ity
2.2
Historical sociolinguistics and morphology
The application of sociolinguistics to historical material is a fairly new approach: according to Nevalainen and Raumolin-Brunberg (2003: 2), the first systematic attempt at this was made by Suzanne Romaine in 1982. Nevalainen and Raumolin-Brunberg themselves are pioneers in this field, which is now called historical sociolinguistics. While morphology has been studied within this framework, research has so far concentrated on inflectional morphology such as the use of third-person -s vs. -th (Nevalainen and Raumolin-Brunberg 2003). BĜezina
Comparing type counts: women, men and -ity in early English letters
91
(2005) is a rare example of a study on the productivity of derivational prefixation from the perspective of historical sociolinguistics. To our knowledge, there have been no studies on suffixation from a similar perspective. 2.3
Methodology
Previous work on comparing the productivity of an affix between subcorpora often relies on the subcorpora being approximately the same size, so that for instance type counts obtained from each subcorpus can be compared directly. Then, if the type counts differ by an order of magnitude, it may be possible to draw conclusions without paying attention to statistical significance (e.g., DaltonPuffer 1996: 106). Empirically validated assumptions on modelling productivity have been made by, e.g., Baayen (1992, 1993). For example, the growth rate of the type accumulation curve has been approximated as the ratio between the number of hapax legomena and the total number of tokens with the affix (Baayen 1992: 115). Baayen (2001) studies both parametric and nonparametric models for the class of LNRE (large number of rare events) distributions, such as lexical frequency distributions. These models are based on the assumption that individual words appear randomly in texts; such modelling assumptions make it possible to extrapolate beyond observed sample size. For a recent study on the statistical models for the accumulation of types and hapax legomena, see Evert and Baroni (2005), and for related statistical software, see Evert and Baroni (2007). Nonparametric methods similar to ours – in particular, Monte Carlo sampling of permutations – have been used in corpus linguistics to some extent. For example, Baayen (2001: 6–7, 24–32) computes Monte Carlo confidence intervals for the accumulation curves of some lexical characteristics. Permutations are generated at the level of individual words, which is consistent with the assumption that individual words appear randomly in texts. However, in many cases the observed values lie outside the confidence intervals (Baayen 2001: 6–7, 24–32; Tweedie and Baayen 1998: 335), indicating that the assumption of randomness causes bias in the results. Tweedie and Baayen (1998) address the bias by permuting words within a randomisation window. Our approach is to leave the original discourse structure intact and permute only large parts of the corpus. Analogous research questions arise and similar methods can be used in studies of biodiversity in the field of ecology, to enable comparisons of species richness in different areas (see, e.g., Gotelli and Colwell 2001). Our text length corresponds to their number of individual animals; our number of types to their number of observed animal species; our two subcorpora of men and women to their different areas; and our type accumulation curves to their species accumulation curves.
92
Tanja Säily and Jukka Suomela
3.
Material
Our material in this study comes from the 17th-century part of the 2.7-millionword Corpus of Early English Correspondence (1998 version). The CEEC is an electronic collection of 6,039 letters composed by 778 writers between the years 1410?–1681. It was compiled by Terttu Nevalainen (team leader), Jukka Keränen, Minna Nevala (née Aunio), Arja Nurmi, Minna Palander-Collin and Helena Raumolin-Brunberg. Due to a lack of resources for transcribing and editing, the corpus is based on published editions of letters; however, some of the material has been checked against the originals by members of the CEEC team. The CEEC is designed for studying the English language – more specifically, English English – in its socio-historical context. To this end, the writers have been carefully selected to give as balanced a representation of different social categories as possible. Nevertheless, the dominance of men from the upper ranks has been unavoidable: they were the most literate group, they were considered important enough that their letters were preserved, and their letters were later considered important enough to be published.
Running words 600,000 men women
500,000 400,000 300,000 200,000 100,000 0 1600-1639 1640-1681
Figure 1: Running words written by men vs. women in the CEEC, 1600–1681 The 17th-century part of the CEEC consists of 1.4 million words covering the years 1600–1681. Unfortunately, only about a quarter of this material was written by women, as can be seen from Figure 1. The situation between different ranks, regions, etc. is similarly imbalanced. Example (3), from a letter written in 1654 by Dorothy Osborne, illustrates the raw material in the corpus (emphases added).
Comparing type counts: women, men and -ity in early English letters (3)
93
… to Visett a place you are soe much concern’d in, and to bee a wittnesse your selfe of the probabillity of your hopes though I will beleive you need noe other inducement to this Voyage then … (A 1654 FN DOSBORNE 130:Heading)
For the purposes of this work, we have divided the corpus into samples, each consisting of one person’s letters from a 20-year period in the corpus: 1600–1619, 1620–1639, 1640–1659, and 1660–1681. As an example, all letters in the corpus that were written by Dorothy Osborne in 1640–1659 form a sample called DOSBORNE-1640. 3.1
Input data
The instances of -ity and -ness were extracted from the corpus using the WordCruncher program. Since the corpus was unlemmatised, and the grammatically tagged version was not yet available, this had to be done by searching for all word-forms which had a suitable ending. Different spelling variants of the suffixes were collected from the OED, the Middle English Dictionary (MED) and by browsing the corpus itself, after which they were used one by one in WordCruncher searches. Some of these variants, such as -nes, yielded a vast number of erroneous results, because many other words besides those having the suffix ended in that way, such as plurals of words ending in -n. These had to be weeded out by hand. A combination of manual work and Perl scripts was used to produce a computer-readable list enumerating all instances of the suffixed words in a normalised form for each sample. The word probabillity in example (3) counts as one instance of the normalised form probability in the sample DOSBORNE1640. There was a total of 94 occurrences of -ity in this sample, and they were instances of 31 different normalised forms, shown in example (4). Thus, we say that the number of -ity tokens is 94 and the number of -ity types is 31 for the sample DOSBORNE-1640. (4)
antiquity authority calamity charity civility commodity conformity contrariety curiosity equality extremity formality gravity importunity impossibility infirmity insensibility necessity nobility opportunity piety possibility probability quality quantity reality severity society university vanity variety
The information extracted from the corpus can be summarised as two incidence matrices, one for -ity and another for -ness. Each row of a matrix corresponds to one sample and each column corresponds to one type. The element at row i and column j indicates the number of occurrences of type j in sample i. The sum of the elements on row i equals the number of tokens in sample i, and the number of nonzero elements on row i equals the number of types in sample i. This is exemplified for -ity in Table 1.
94
Tanja Säily and Jukka Suomela
Table 1: Part of the matrix representation of -ity … contrariety credulity curiosity … probability … … ASTUART-1600 DOSBORNE-1640 SPEPYS-1660 …
0 1 0
0 0 1
1 4 0
0 1 1
The number of running words was counted for each sample; for DOSBORNE1640, the number of running words is 71,299 – the number of distinct words in the sample is not needed in our study. Sociolinguistic information on each person was retrieved from an auxiliary database; this included gender, domicile and social rank. For DOSBORNE-1640, the gender is ‘female’, the domicile is ‘other’ and the social rank is ‘gentry upper’. Our incidence matrices for -ness and -ity are freely available for download (Säily and Suomela 2007). 3.2
Characteristics of the input data
The total number of samples in the corpus is 412, of which 112 consist of letters written by women. The total number of different types of -ity in the corpus is 192 and the total number of different types of -ness is 312. The relative sizes of the samples are illustrated in Figures 2 and 3. In the figures, samples from men are represented by white boxes, while samples from women are grey diamonds. The size of the symbol is in proportion to the number of running words in the sample. The largest samples are labelled, including DOSBORNE-1640 with 71,299 running words, and ASTUART-1600, Arabella Stuart’s letters written in 1600–1619, with 30,472 running words. Figure 2 presents the samples ordered by the number of -ity types they contain per -ity tokens. As noted above, there are 31 -ity types and 94 -ity tokens in the sample DOSBORNE-1640. Figure 3 presents the same information for -ness types. For example, there are 46 -ness types and 188 -ness tokens in DOSBORNE-1640. As can be seen from the figures, the size of the samples varies widely; there are many samples with very few tokens and types, and a few samples with very many tokens and types. From these figures we may observe, e.g., that while DOSBORNE-1640 includes more -ness types and tokens than any other sample, there are many samples from men which have a larger number of -ity types than this sample.
Comparing type counts: women, men and -ity in early English letters
-ity
Types
JCHAMBERLAIN-1600 HMORE-1660
40
ASTUART-1600 JJONES-1640
JHOLLES-1600 WPETTY-1660
HOXINDEN-1640
30
DOSBORNE-1640
JHOLLES-1620
SPEPYS-1660
20 TWENTWORTH-1620 TKNYVETT-1640 CLOWTHER-1620 BELIZABETH-1640
10
AANTONIE-1600
0 0
20
40
60 80 Suffix tokens
100
120
Figure 2: Samples ordered by the number of -ity types per -ity tokens
-ness
Types
DOSBORNE-1640 JJONES-1640
40
ASTUART-1600
HMORE-1660 SPEPYS-1660
JHOLLES-1600
30
TWENTWORTH-1620 HOXINDEN-1640 JCHAMBERLAIN-1600
JHOLLES-1620
WPETTY-1660
20
CLOWTHER-1620 AANTONIE-1600 BELIZABETH-1640
10
TKNYVETT-1640
0 0
50
100 150 Suffix tokens
200
250
Figure 3: Samples ordered by the number of -ness types per -ness tokens
95
96
Tanja Säily and Jukka Suomela
4.
Methods
We are interested in comparing the productivity of a suffix between different subcorpora which consist of several samples, for example, all letters written by women. Our primary measure of productivity is the number of types. In the previous section we defined type counts for samples; this extends naturally to a whole subcorpus. As an alternative measure of productivity, we consider the number of hapax legomena. In precise terms, the measures are as follows; here we use the case of -ity as an example. (a)
Number of types. This is the number of different types of -ity which occur in the subcorpus at least once. For example, if the subcorpus contains occurrences of the word generosity (no matter how many times, regardless of the spelling) and no other -ity words, the number of types is 1.
(b)
Number of hapax legomena or hapaxes. This is the number of different types of -ity which occur in the subcorpus exactly once. For example, if the subcorpus contains only one occurrence of the word instability, one occurrence of the word capability, four occurrences of the word generosity (in various spellings) and no other -ity words, the number of hapaxes is 2.
If we view the subcorpus as a matrix where the element at row i and column j indicates the number of occurrences of type j in sample i (recall Table 1), we can give the following equivalent definitions. Form a vector v by adding up all rows of the matrix. Then the number of types is the number of nonzero elements in v, and the number of hapaxes is the number of elements equal to 1 in v. 4.1
Comparing productivity between subcorpora
The measures we defined above have an obvious drawback: they are sensitive to the size of the subcorpus. In our material we have 80 types of -ity in the texts written by women and 183 types of -ity in the texts written by men; however, we cannot immediately say that the type richness of women’s texts is lower, as we have much more material from men (see Figure 1). Furthermore, the relation between the size of the subcorpus and the number of types occurring in it is not necessarily linear. Put simply, at the very beginning of the type accumulation curve, each -ity word is likely to be new, but later we are more likely to meet -ity words which have already occurred in the corpus. With hapaxes, the measure might even decrease as the size of the subcorpus increases. We shall see practical examples of the nonlinear behaviour throughout this work (e.g., Figures 4, 6 and 8). Therefore, attempts to normalise the number of types by, say, dividing by the number of running words are not justifiable (cf. Gotelli and Colwell 2001). Indeed, such attempts give completely misleading results with our data. For example, the number of -ity types per 100,000 running words is approximately 23.5
Comparing type counts: women, men and -ity in early English letters
97
for women and 17.6 for men in our material. It would appear that the type richness is higher for women, even though the opposite is the case, as we shall see. We might be able to tackle the problem by making further modelling assumptions on the process which generates the text; we might, for example, assume that the occurrences of the words are independent, and we could then use the input data to estimate the probabilities of each person producing a particular word; this way we could compare the productivity of different persons. However, we are reluctant to make such simplifying assumptions, as the choice of words may have subtle dependencies on the textual context (see, e.g., Baayen 2001: 163). We take a somewhat extreme approach in assuming nothing. Instead of trying to compare subcorpora of different sizes, we only assume that we can compare subcorpora of equal sizes. We use the following alternative definitions for equal size: (i)
The same number of running words.
(ii)
The same number of -ity tokens.
For most of this work we focus on definition (i) in conjunction with measure (a), i.e., the number of types. Other combinations may also be of interest, and we can experiment with them by using the same general approach and the same tools. For example, if we use measure (b) and definition (ii), we compare the number of -ity hapaxes in subcorpora with the same number of -ity tokens. Equally well, we could compare the ratios between -ity hapaxes and -ity tokens, arriving at Baayen’s (1992: 115) definition. 4.2
Statistical significance
We are not interested in merely noticing that a particular subcorpus has a lower number of types in comparison with another subcorpus. We are interested in differences which are statistically significant; informally, not likely to be mere random artefacts of the data. We now review some basics of statistical hypothesis testing and apply the ideas to our problem. Let us choose the measure of productivity (a), the number of types, and say that we are willing to compare only subcorpora which are equal by definition (i), the number of running words. The idea that women are significantly less productive than men in this material is captured as follows. Let n be the number of running words in the subcorpus which consists of the texts written by women and let t be the number of types in this subcorpus. Hypothesis. Gender is significant. For a subcorpus with n running words, t is a particularly low number of types. The null hypothesis is that there is no connection between the number of types and gender; the effect is caused by chance. More formally, the null hypothesis is
98
Tanja Säily and Jukka Suomela
that the numbers of running words and the rows of the incidence matrices for men and women are samples from the same population. Intuitively, the null hypothesis suggests that the subcorpus of texts written by women could be constructed through the following process. We randomly pick samples from the corpus, labelling them as having been written by women, until the subcorpus we have accumulated is of size n; the rest of the corpus is then labelled as having been written by men. We emphasise that our samples consist of complete letters. We need not assume that the words within each letter are independent of the context; we only assume that samples as a whole are interchangeable under the null hypothesis. We can test the hypothesis by estimating how likely it is that a subcorpus constructed in this way has as few as t types (we apply one-sided testing here). If this turns out to be very unlikely, say, happening on average only once in 100 trials, we reject the null hypothesis and accept the original hypothesis, with p = 0.01. There is a subtlety: as we work at the granularity of samples, and the sizes of the samples vary, it may be that very few labellings – maybe just the original labelling – produce a subcorpus with exactly n running words. In practice, we make a minor adjustment. Informally, we consider subcorpora with at least n running words and not many more than that; making the subcorpus longer certainly cannot have a negative bias on the number of types. The case of hapaxes is more complicated; we come back to this issue in Section 5.3. 4.3
Permutation testing
Now, we have formalised our hypothesis and we are ready to do standard hypothesis testing – all we need to do is estimate the probability p of obtaining such an extreme case as at most t types in a subcorpus with n running words. As we are dealing with type counts, we do not have a simple mathematical formula for calculating p: the probability depends not only on summary information such as the values t and n but on the full incidence matrix. Therefore, we use techniques from permutation testing (see, e.g., Good 2005). Applied to our problem in a straightforward manner, the basic idea would be as follows. We take the intuitive idea of picking samples in a random order quite literally. The order in which we pick the samples forms a permutation (reordering) of the samples. To calculate the probability p, we need to calculate the percentage of permutations which have at most t types in the first n running words. We generate all permutations of the samples, check which of them satisfy this condition, and compute the percentage p. The next section adapts this basic idea to our needs. 5.
Implementation
Standard permutation testing would indeed suffice if all we were interested in was testing one hypothesis. However, we are also interested in exploratory data analysis. We want to consider several variables besides gender and see if they
Comparing type counts: women, men and -ity in early English letters
99
correlate with the number of types. Ideally, we would prefer to avoid repeating extensive computations between each experiment. We also wish to gain more understanding on the accumulation of types as a function of corpus size. We can address all of these requirements by calculating type accumulation curves similar to that shown in Figure 4. This is the output generated by the computer program that we present in this section. First we describe how to interpret and use these curves; then we discuss the implementation which is used to compute the curves.
-ity
Types
150
100 p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 4: Bounds for -ity types as a function of the number of running words Figure 4 shows upper and lower bounds for the number of -ity types. On the x axis, we have the number of running words in the subcorpus. The bounds are plotted for various levels of statistical significance. For example, the solid black curve corresponds to the level p = 0.01; the lower bound for, say, 600,000 running words at this level is 123, and the upper bound at this level is 163. This can be interpreted as follows: in all permutations of the samples that we can construct from the whole corpus, less than 1% have fewer than 123 -ity types within the first 600,000 running words, and less than 1% have more than 163 -ity types within the first 600,000 running words. The p values here refer to a one-sided test; for a two-sided test, the p values need to be doubled. Once we have computed the curves, we can immediately use them for hypothesis testing, in a very straightforward manner: we simply plot the data point that corresponds to the subcorpus of interest on these curves and see whether the point lies, for example, below the lower bound. If so, we conclude that the number of types is significantly low for a subcorpus of this size. This is merely an (indirect) application of a permutation test.
100
Tanja Säily and Jukka Suomela
An example of this is shown in Figure 5. In the subcorpus which consists of the letters written by women, we have 340,116 running words and only 80 -ity types. In the subcorpus which consists of the letters written by men, we have 1,038,951 running words and 183 -ity types. We have plotted both data points on top of the curves already shown in Figure 4. We note that the data point which corresponds to women’s texts lies below the lower bound with p = 0.001. We conclude that it is highly unlikely to come up with such a collection of samples by chance; our main hypothesis is true. We come back to the analysis of the results in Section 6.
-ity
Types
men 150
100 women p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 5: Hypothesis testing. Women have significantly few -ity types As we shall see, calculating the curves requires some amount of computation. However, once we have done the computation, we can use the same curves repeatedly to answer various questions. We can test other similar hypotheses easily by plotting more data points on top of the curves. Indeed, we can do exploratory data analysis by plotting data points corresponding to each possible value of each sociolinguistic category, such as gender, domicile, social rank, and time period. We shall see examples of this in Section 6. We can also analyse the curves qualitatively: the shape of Figure 4 illustrates the nonlinear relation between the size of the subcorpus and the number of types occurring in it. Finally, we can calculate similar curves for measure (b), hapaxes, and we can also consider definition (ii), which means that the x axis shows the number of -ity tokens instead of the number of running words in the subcorpus. See Figure 6 for an example.
Comparing type counts: women, men and -ity in early English letters
101
-ity
Hapaxes 70 60 50 40 30
p p p p
20 10 0 0
500
1000 1500 Suffix tokens
2000
< 0.1 < 0.01 < 0.001 < 0.0001 2500
Figure 6: Bounds for -ity hapaxes as a function of the number of -ity tokens 5.1
Basic algorithm
We proceed to present the operation of the computer program. The program performs the computations in two steps. The first step essentially tabulates for each pair (t, n) an approximation of the number of permutations such that there are exactly t types within the first n running words. The second step uses the table to find for each value of n those values of t at which we cross the significance levels of interest, such as p = 0.01 and p = 0.001. The first step is computationally more intensive. It consists of generating a large number of random permutations of the samples – typically, the number of permutations is in the range of tens of thousands to millions. For each permutation, we process the samples one by one, in the order indicated by the permutation. For each new sample, we compute the total number of types observed so far. Each permutation can be interpreted as a type accumulation curve, similar to the two examples illustrated in Figure 7; in the figure, each tick mark corresponds to one sample. Once we have a complete accumulation curve, we increment the counters in the table for each pair (t, n) through which it passes. This is repeated for each permutation, after which we can perform the second step.
102
Tanja Säily and Jukka Suomela
Types
-ity
150
100
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
1.2
1.4
Figure 7: Two type accumulation curves. Each tick mark represents the addition of one sample 5.2
Computational complexity
In the first step, we employ a randomised algorithm to approximate the number of permutations for each (t, n). This is an application of the Monte Carlo method (Mitzenmacher and Upfal 2005: 252), in which one picks a number of objects at random from a suitable probability distribution, checks which percentage of them satisfies the desired properties, and derives an estimate of the total number of such objects. By increasing the number of objects that we choose, we can improve the accuracy of the estimate. As is usual in an implementation of permutation testing (Good 2005: 233), we choose a particularly simple probability distribution, the uniform distribution over all permutations; therefore, we can pick a random permutation by using a simple algorithm for randomly shuffling a list. By resorting to a randomised approximation algorithm, we have sacrificed some accuracy. This is acceptable, as we only need the first few decimals of the probability p. Approximation is in any case unavoidable, because it is not likely that there exists any efficient algorithm for, say, computing the exact number of permutations which traverse through a given point (t, n). Even determining whether the number is more than zero is hard: this is a generalisation of the SET COVER problem, which belongs to the class of NP-complete problems, and it is generally believed that no efficient algorithm exists for any problem that is NPcomplete (see, e.g., Garey and Johnson 2003 [1979]).
Comparing type counts: women, men and -ity in early English letters 5.3
103
Implementation details
Next we address the fact that we only have data at the granularity of entire samples. Put simply, based on our input data, we do not know whether the occurrences of the types are at the beginning or the end of the sample; if we are interested in knowing the exact value of t for some n which happens to be in the middle of a sample, we do not know whether we would have already met the new types of this sample by n running words or not. Therefore, our program adopts a safe approach: it always considers the worst case for us and the most favourable case to the null hypothesis, i.e., the case which produces the widest confidence intervals. Finding the worst cases for the number of types is straightforward. For lower bounds, we can proceed as if all types were clustered at the very end of the sample, and for upper bounds we can assume the opposite. The case of hapaxes is more involved, as we need to distinguish between several cases: (a) newly created hapaxes, i.e., types which have not occurred before this sample and which occur only once in this sample; (b) temporary hapaxes, i.e., types which have not occurred before this sample and which occur more than once in this sample; and (c) removed hapaxes, i.e., types which have occurred exactly once before this sample and which occur at least once in this sample. For lower bounds, the worst case is that the types of class (c) occur at the very beginning of the sample, cancelling previously known hapaxes. For upper bounds, the worst case is that the types of class (a) and one instance of each type of class (b) occur at the very beginning of the sample, increasing the number of hapaxes at least temporarily. To develop a program which is computationally efficient in terms of time and memory requirements, we need to address some further issues. First, while the range of possible values of t is typically moderate, the range of possible values of n can be large; in our data, we have more than one million running words. The size of the table where the number of permutations for each (t, n) are stored would be impractical. We can significantly improve performance by dividing the n dimension into a smaller number of slots; for example, we can interpret the range from n = 0 to n = 4,999 as one slot, the following 5,000 running words as another slot, and so on. The approach of using slots is combined with the approach of finding worst-case bounds. Therefore, the slots can be used safely: they do not introduce any artefacts in the curves which would make some finding seem statistically significant if this is not the case. Naturally, using very large slots may prevent one from finding even statistically significant results. To further improve performance, the computations in the first phase use a data layout in which each element requires only 1 or 2 bits of storage: for types, the single bit stands for “at least 1”; for hapaxes, one bit stands for “at least 1” and the other for “at least 2”. The input is pre-processed into an incidence matrix which is stored in this compact format, and the table containing the counts for each slot is also stored in this manner. The compact memory layout is cachefriendly and allows us to exploit bit-parallelism in the calculations.
104
Tanja Säily and Jukka Suomela
The program is written in standard C (ISO/IEC 9899:1999); it should compile and run on any standard-compliant platform. The only essential limitation on the size of the input data is the amount of available memory. Parameters such as the number of iterations and the slot size can be set by using command line switches. 5.4
Performance
The following example illustrates the typical performance of the program. In our input data for the suffix -ity, we had 412 samples and 192 different types of -ity. We used slots of 5,000 running words each; this resulted in 277 slots. We ran the experiments on a desktop PC with a 2.4-GHz Pentium 4 processor, under the Linux operating system; the application was compiled using the C compiler from the GNU Compiler Collection (GCC). We experimented with two different numbers of permutations: 20,000, which is suitable for getting a quick idea of whether there are any statistically significant results in view, and 1,000,000, which is more than enough to produce publication-quality illustrations such as those presented in this work. The running time for computing the type accumulation curves was 1.7 seconds for 20,000 permutations and 82 seconds for 1,000,000 permutations. The running time for computing the hapax accumulation curves was 2.3 seconds for 20,000 permutations and 113 seconds for 1,000,000 permutations. 5.5
Using the implementation
The computer program described in this section is freely available under an open source license (GNU General Public License, version 2.0 or later). For details on obtaining and using the program, see Suomela (2007). Both the input and the output of the program are plain text files. The program accepts as input data matrices similar to those illustrated in Table 1. The input files can be prepared manually or, as we have done, by using corpusspecific tools. The output consists of the numerical data for curves similar to those in Figure 4. Tools such as statistical software packages or spreadsheets can be used to visualise the results. With our program, we provide a script which illustrates how to draw graphs similar to Figure 4 by using R, the free software environment for statistical computing (R Development Core Team 2007). As stated above, the program is only needed for computing the upper and lower bounds for type accumulation, and such computation needs to be performed only once for a given data set. In the following section, we use the bounds for both hypothesis testing and exploratory data analysis. 6.
Results and conclusions
Our hypothesis was that gender is significant in the case of -ity; as seen from Figure 5, this turned out to be the case. The richness of -ity types is significantly
Comparing type counts: women, men and -ity in early English letters
105
low (p < 0.001) in women’s letters in the 17th-century part of the CEEC. Naturally, the 17th-century part of the CEEC is not a perfect representation of 17thcentury English; neither are type counts a perfect measure of morphological productivity. Nevertheless, a result which is statistically this significant demands an explanation, and we argue that an attractive candidate can be found through examining the socio-historical situation in 17th-century England (see, e.g., Wrightson 1993). As women’s access to education was severely restricted, they would not have had the competence to use the learned and etymologically foreign suffix -ity to the same extent as men. The situation for -ness is shown in Figure 8. Here the data points for both men and women fall between the upper and lower bounds, and we cannot draw a similar conclusion on the significance of gender.
-ness
Types 300
men 250 200 150 women 100 50 0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
p p p p
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 8: Bounds for -ness types as a function of the number of running words Finally, we explore some other sociolinguistic categories. Subcorpora based on the domiciles of the informants show no significant results. As for social rank, we might have expected to find a significantly low level of productivity for -ity in the lowest ranks, but there is simply too little data from them in the corpus. A more interesting case comes up when we divide the corpus into time periods: letters written in 1600–1639, and those written in 1640–1681. Figure 9, based on the same set of curves as Figure 5, shows that the type richness of -ity is significantly low in the earlier period. One interpretation for this could be that there is a linguistic change in progress: in the course of the 17th century, the use of -ity becomes more common in personal letters. This makes sense – not only was the use of Latinate features socially stratified (they were mostly used by learned men), but it was also register-specific, and began to spread from more
106
Tanja Säily and Jukka Suomela
formal contexts to less formal ones during the 16th and 17th centuries (cf. Nevalainen and Tieken-Boon van Ostade 2006: 281–282; Riddle 1985: 455–456). The above examples illustrate the ease with which we can do exploratory data analysis once we have computed the bounds of the type accumulation curves. Even with a relatively small corpus, we were able to not only confirm our hypothesis but also discover unanticipated linguistically interesting results. The bounds for hapax counts turned out to be too wide for significant differences to emerge (see Figure 6). It may be that this measure requires more data to become usable. However, if the problem of wide bounds for hapax accumulation curves persists in larger corpora, this could call into question the use of hapax-based productivity measures in general.
-ity
Types
1640-1681 150 1600-1639 100 p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 9: Subcorpora based on time periods In addition to testing hapax accumulation in larger corpora, future work could include a comparison between our type accumulation curves and those derived from more widely used parametric models. Another opportunity for future research would be a more fine-grained investigation of the differences between men and women in the use of the suffix -ity: as pointed out by an anonymous reviewer, part of the differences observed in this study could be due to women writing about a more restricted set of topics, which may lead to a large vocabulary overlap between women. As noted in Section 4.1, our work focuses on definition (i) of corpus size – in our type accumulation curves, the x axis is the number of running words in the corpus. Another possibility would have been to compute type accumulation as a function of suffix tokens. Further work is needed in order to better understand the
Comparing type counts: women, men and -ity in early English letters
107
interplay between the number of running words, the number of affix tokens, and the number of affix types in the context of productivity. Acknowledgements We thank Harald Baayen, Terttu Nevalainen, the audience at ICAME 28 and the members of VARIENG for discussions and comments, and anonymous reviewers for their helpful feedback. The database of sociolinguistic information used in the study was compiled by Arja Nurmi. This research was supported in part by the Academy of Finland Centre of Excellence funding for the Research Unit for Variation, Contacts and Change in English (VARIENG) at the Department of English, University of Helsinki, and the Helsinki Graduate School in Computer Science and Engineering (Hecse). Notes 1 As noted by an anonymous reviewer, this particular example could also be regarded as an instance of affix substitution. This provides an even stronger motivation for not leaving out these kinds of words. References Aronoff, M. and F. Anshen (1998), ‘Morphology and the lexicon: Lexicalization and productivity’, in: A. Spencer and A.M. Zwicky (eds.) The Handbook of Morphology. Cambridge, MA: Blackwell Publishers. 237–247. Baayen, R.H. (1992), ‘Quantitative aspects of morphological productivity’, in: G. Booij and J. van Marle (eds.) Yearbook of Morphology 1991. Dordrecht: Kluwer Academic Publishers. 109–149. Baayen, R.H. (1993), ‘On frequency, transparency and productivity’, in: G. Booij and J. van Marle (eds.) Yearbook of Morphology 1992. Dordrecht: Kluwer Academic Publishers. 181–208. Baayen, R.H. (2001), Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers. Baayen, R.H. and R. Lieber (1991), ‘Productivity and English derivation: A corpus-based study’, Linguistics, 29: 801–843. Baayen, R.H. and A. Renouf (1996), ‘Chronicling the Times: Productive lexical innovations in an English newspaper’, Language, 72 (1): 69–96. Bolinger, D.L. (1948), ‘On defining the morpheme’, Word, 4: 18–23. BĜezina, V. (2005), The Development of the Prefixes un- and in- in Early Modern English with Special Regard to the Sociolinguistic Background, unpublished MA thesis, Faculty of Arts, Charles University in Prague. CEEC = Corpus of Early English Correspondence (1998), compiled by the Sociolinguistics and Language History project team (T. Nevalainen, J. Keränen, M. Nevala, A. Nurmi, M. Palander-Collin, H. Raumolin-Brunberg) at the Department of English, University of Helsinki. http://www.helsinki.fi/varieng/domains/CEEC.html. Corpus of Early English Correspondence Sampler (1998), see above.
108
Tanja Säily and Jukka Suomela
Cowie, C. and C. Dalton-Puffer (2002), ‘Diachronic word-formation and studying changes in productivity over time: Theoretical and methodological considerations’, in: J.E. Díaz Vera (ed.) A Changing World of Words: Studies in English Historical Lexicography, Lexicology and Semantics. Amsterdam: Rodopi. 410–437. Dalton-Puffer, C. (1996), The French Influence on Middle English Morphology: A Corpus-Based Study of Derivation. Berlin: Mouton de Gruyter. Evert, S. and M. Baroni (2005), ‘Testing the extrapolation quality of word frequency models’, in: P. Danielsson and M. Wagenmakers (eds.), Proceedings of Corpus Linguistics 2005. The Corpus Linguistics Conference Series 1. Available at http://www.corpus.bham.ac.uk/PCLC/. Evert, S. and M. Baroni (2007), ‘zipfR: Word frequency distributions in R’, in: Proceedings of the ACL 2007 Demo and Poster Sessions. Stroudsburg, PA: Association for Computational Linguistics. 29–32. Garey, M.R. and D.S. Johnson (2003) [1979], Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman and Company. Good, P. (2005), Permutation, Parametric, and Bootstrap Tests of Hypotheses. 3rd edition. Springer Series in Statistics. Berlin: Springer-Verlag. Gotelli, J. and R. Colwell (2001), ‘Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness’, Ecology Letters, 4: 379–391. Hay, J. (2001), ‘Lexical frequency in morphology: Is everything relative?’, Linguistics, 39 (6): 1041–1070. Marchand, H. (1969), The Categories and Types of Present-Day English WordFormation: A Synchronic-Diachronic Approach. 2nd edition. Munich: C.H. Beck’sche Verlagsbuchhandlung. MED = Middle English Dictionary, 2001 edition. Electronic version. Available at http://ets.umdl.umich.edu/m/med/. Mitzenmacher, M. and E. Upfal (2005), Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge: Cambridge University Press. Nevalainen, T. and H. Raumolin-Brunberg (2003), Historical Sociolinguistics: Language Change in Tudor and Stuart England. London: Pearson Education. Nevalainen, T. and I. Tieken-Boon van Ostade (2006), ‘Standardisation’, in: R.M. Hogg and D. Denison (eds.) A History of the English Language. Cambridge: Cambridge University Press. 271–311. OED = Oxford English Dictionary, 2nd edition, 1989. OED Online. Available at http://dictionary.oed.com. Plag, I. (1999), Morphological Productivity: Structural Constraints in English Derivation. Berlin: Mouton de Gruyter. R Development Core Team (2007), R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
Comparing type counts: women, men and -ity in early English letters
109
Riddle, E.M. (1985), ‘A historical perspective on the productivity of the suffixes -ness and -ity’, in: J. Fisiak (ed.) Historical Semantics; Historical WordFormation. Berlin: Mouton de Gruyter. 435–461. Säily, T. (2005), ‘Use of the suffixes -ity and -ness in early English letters: Was gender a factor?’, unpublished seminar paper, Department of English, University of Helsinki. Säily, T. (2008), Productivity of the Suffixes -ness and -ity in 17th-century English Letters: A Sociolinguistic Approach, unpublished MA thesis, Department of English, University of Helsinki. Available at http://urn.fi/URN:NBN:fife200810081995. Säily, T. and J. Suomela (2007), ‘Incidence matrices for -ness and -ity’. Available at http://www.cs.helsinki.fi/jukka.suomela/ity-ness-data/. Suomela, J. (2007), ‘Type and hapax accumulation curves’, computer program. Available at http://www.cs.helsinki.fi/jukka.suomela/types/. Tweedie, F.J. and R.H. Baayen (1998), ‘How variable may a constant be? Measures of lexical richness in perspective’, Computers and the Humanities, 32: 323–352. Wrightson, K. (1993), English Society, 1580–1680. London: Routledge.
Does English have modal particles? Karin Aijmer University of Gothenburg Abstract Modal particles are functionally closely related to discourse markers. This raises the issue of whether modal particles have a common ‘class-identifying’ function which distinguishes them from discourse markers (and adverbs) as well as questions about what we mean by modality. Of course has been treated as a discourse marker as well as a modal adverb. However it does not seem to have been discussed as a modal particle. It is argued that we should distinguish between its uses as a discourse marker and modal particle on the basis of its formal properties and its functions.
1.
Defining the problem
The interest in modal and evidential particles in different languages of the world in the last decades is evidenced in works such as Chafe and Nichols (1986), Aikhenvald (2004), Palmer (1986) and we can also, as a result, expect more interest in studying particles in the European languages. Modal particles are also said to be a frequent feature of some, mainly Germanic, languages (e.g. German, Swedish, Dutch, Danish, Norwegian). In Swedish, we find ju (‘as you know’), nog (‘probably’), väl (‘surely’), visst (‘evidently’) and descriptions of German regularly identify over twenty modal particles (including schon, wohl, denn, ja) (Hoye 1997: 209). Modal particles are a subclass of pragmatic markers and they share a number of properties with other pragmatic markers. They are not part of the truth-conditional content; they are optional in the sentence and they have textual and interpersonal functions. The definition and classification of (modal) particles rely on a number of formal criteria such as position in the clause, syntactic integration and the lack of stress (Waltereit 2001: 1392; Hansen 1998). Modal particles are for example usually unlike adverbs with regard to stress and position. They do not occupy initial or final position in the clause but ‘particle’ in the relevant languages has a fixed position in the verbal complex (the middle field), a topological notion referring to the position after the initial element of a complex verbal element. The formal criteria are fairly rigid and are influenced by the German tradition of ‘Partikelforschung’ (see e.g. Weydt 1969). Formal factors may not be equally important in all languages although they are part of the definition in German and in Swedish. However, modal particles ‘do not appear to belong to a very clearly defined modal system’ such as the modal auxiliaries (Palmer 1986: 45). Modal particles are generally felt to be semantically and pragmatically elusive (Waltereit
112
Karin Aijmer
2001: 1392) and ‘the modal functions identified are considerably different in the different languages, or at least are conceptualized in different ways’ (Traugott 2007: 142). They can have meanings which are marginally modal or not obviously modal at all. As Palmer (1986) points out (quoting Curme 1905 (1960)), modal particles in German are paraphrased by ‘modal adverbs which denote in what manner a thought is conceived by the speaker’ and they seem to be ‘essentially comments on the proposition rather than opinions about it, and so not very obviously modal’ (Palmer 1986: 46). This raises the issue of whether modal particles have a common ‘class-identifying’ function which distinguishes them from discourse markers (and adverbs), as well as questions about what we mean by modality. Modal particles are functionally closely related to discourse markers. However the relationship between modality and discourse has not been much discussed (cf. Traugott 2007). For example, in the early literature on discourse markers such as Schiffrin (1987), modality is not discussed as a source of discourse markers. In this paper, my aim is to discuss the relationship between modality and different discourse and pragmatic functions. I will discuss the modal adverb of course which has been studied earlier but not from this perspective (cf. Simon-Vandenbergen and Aijmer 2002/2003; Wichmann et al. forthcoming). The adverb is multifunctional and has a number of pragmatic and discourse functions which are obviously modal but removed from the literal meaning of the adverb. Of course has been treated as a discourse marker as well as a modal adverb. However it does not seem to have been discussed as a modal particle. I will argue that we need to distinguish between its uses as a discourse marker and modal particle on the basis of its formal properties and its functions. In addition of course can be an answer particle. This function is easy to describe in both structural and discourse terms. Functionally of course is for example used if the speaker’s and hearer’s assumptions converge. The use as an answer particle will not be further discussed but is of interest if we want to describe the different functions of of course in terms of polysemy and grammaticalization. 2.
Modal particles and modal adverbs
Modal adverbs provide the closest approximation to modal particles and it has recently been suggested that modal adverbs in English should be regarded as modal particles in some of their senses, ‘primarily those adverbs with only faint shades of meaning’ (Hoye 1997: 209). This idea fits in well with the hypothesis proposed by several linguists (Diewald 2006, Waltereit and Detges 2007) that modal particles are derived by grammaticalization from adverbs. In traditional descriptions of English grammar there is no place for modal particles. However, according to Hoye, the distinction between adverb and particle can be said to match the classification into different types of adverbs familiar from Quirk et al’s description (1985):
Does English have modal particles?
113
the concept of ‘modal particle’ is relevant to the classification of modal adverbs in English because, … according to the degree of their integration in clause structure and the nature of their association with the modal verb head, they display various degrees of lexical redundancy and grammaticalization (Hoye 1997: 209). Quirk et al distinguish between adjuncts, disjuncts, subjuncts and conjuncts in terms of their centrality or peripherality in the clause. For example, adjuncts are typically integrated in the sentence and contribute to the propositional content just like other sentence elements. Of course is not an adjunct (a VP adverbial) in present-day English but was used in older English to indicate ‘that something occurred as a natural process’ (Lewis 2003). Of course in present day English has as its core meaning ‘taking for granted’, ‘definitely’, ‘obviously’. Of course as a subjunct is illustrated in (1) where it is subordinate to the subject in the clause. (1)
Many young people of course prefer hip hop to rock music.
It can also be subordinate to the whole clause: (2)
Many young people may of course prefer hip hop to rock music.
Subjuncts ‘have to a greater or lesser extent, a subordinate role in relation to one of the other clause elements or to the clause as a whole. They exhibit considerably less semantic and grammatical independence than disjuncts and are more closely integrated in clause structure and especially the verb phrase.’ (Hoye 1997: 155). As a subjunct of course is ‘concerned with expressing the semantic role of modality in particular emphasis’ (Quirk et al 1985: 587). When of course is a disjunct it is more salient in the clause. Disjuncts ‘have a superior role as compared with the sentence elements; they are syntactically more detached and in some respects ‘superordinate’, in that they seem to have a scope that extends over the sentence as a whole’ (Quirk et al 1985: 613). The semantic role of disjuncts is to express a comment as to ‘the degree of or condition for truth of content’ (Quirk et al 1985: 615). Of course is for instance a high probability adverb conveying ‘the speaker’s strength of conviction or emphasis in the truth of the adjoining proposition; by topicalizing the firmness of the speaker’s belief the effect is, of course, to emphasize it’ (Hoye 1997: 190). (3)
of course he’ll be working with overseas students
(4)
of course, when the subject matter concerns very recent events it may not be easy to convey new techniques (Hoye 1997: 190 abbreviated example).
114
Karin Aijmer
In addition of course can be a conjunct encouraging ‘a particular attitude in the addressee as well as expressing the nature of the connection between the units they conjoin’ (Hoye 1997: 154). In (5), of course is used to express a contrast to the content in the preceding utterance. (5)
A: She could be waiting at the hairdresser’s, I suppose … B: Of course she could but all the same I don’t think it likely.
Of course signals concession (I grant that, certainly) followed by an argument in the but-clause. According to Hoye (1997: 212) “it would not be implausible to redefine subjuncts expressing modality as ‘modal particles’, subdivided into the following categories: evidential particles (clearly, obviously); hearsay particles (apparently); reinforcement or emphasising particles (certainly, surely, well); and focus particles (only, simply)”. Of course (not mentioned by Hoye) could presumably be regarded as a modal particle similar in meaning to certainly or to obviously. Hoye’s suggestion sparks off interest in the question whether English can be said to have modal particles. However it is not easy to say what of course means and how many meanings it has. In this article I will discuss of course as a polysemic marker which has developed functions which are characteristic both of discourse markers and modal particles. It will be shown that the functions of of course can be traced to the (presuppositional) properties of of course and the larger sequences of ‘rhetorical relations’ in which of course plays an important role (Lewis 2003). Hopefully the analysis of of course can also sharpen the analysis of what we mean by discourse markers and by modal particles. Another aim is to show how translations provide a method to circumscribe the meanings and functions of multifunctional and polysemous items by looking at the translations of of course into Swedish. 3.
Translations as a model to study multifunctionality
Of course has several meanings which are not always easy to distinguish from each other. Paraphrasing goes some way towards describing what of course means in different contexts. Translations are a more indirect method to arrive at the meanings of a lexical item. The method is particularly interesting when lexical elements are multifunctional since the translator has to interpret the meaning of the lexical item in its context. The translator’s analysis can thus be a complement to the linguist’s analysis based on features such as position, collocation and above all the linguistic and non-linguistic context. The translations of of course range from meanings such as certainty (Swe. naturligtvis, givetvis, förstås) to translations such as ju ‘as you know’ (see Table 1). However the translations only provide ‘raw semantic data’ which have to be evaluated and further analyzed. We need for instance to explain why of course has a certain discourse function. Moreover the translations do not tell us if a new
Does English have modal particles?
115
meaning has been conventionalised or is only implicated. The frequency of a particular translation (or meaning) may however be a sign that conventionalisation has taken place. Low-frequent meanings on the other hand are more likely to be implicatures or side-effects of more salient meanings. The examples of of course discussed in this study come with a translation taken from the English-Swedish Parallel Corpus (Altenberg & Aijmer 2000), a corpus of almost three million words of fiction and non-fiction. Table 1 shows the translations of of course from English original texts (English originals -> Swedish translations) and the Swedish sources of of course (Swedish translations <English originals).1 The zero-expressions are also important. Omission of of course can be expected if it has a weakened literal meaning and mainly pragmatic function as a modal particle or a discourse marker. Table 1: Swedish translations and sources of of course Translations from English 71
Swedish sources 70
Total
naturligtvis
91
74
165
givetvis (‘of course’)
41
12
53
det är klart (att) (‘it is clear that’)
18
5
23
visst (‘certainly’, ‘by all means’)
6
19
25
fast det är klart (‘but it is clear’)
4
-
4
ju (‘as you know’)
4
66
70
ja (jo)det är klart
2
1
3
visserligen (‘admittedly’)
2
8
10
javisst (ja), jovisst (‘certainly’)
2
8
10
men …ju (‘but… of course’)
1
-
1
naturellement
1
-
1
självfallet (‘of course’)
2
6
8
självklart (‘of course’)
1
1
2
genast (‘at once’)
1
-
1
naturligt (nog)
1
1
2
för den skull (‘because of that’)
1
-
1
troligtvis (‘probably’)
1
-
1
förstås(s) (‘of course’)
141
116
Karin Aijmer
nog (‘probably’)
-
7
7
väl (‘surely’)
-
4
4
så klart (‘clearly’)
-
2
2
då (‘then’)
-
1
1
alltså (‘consequently’)
-
1
1
för all del (‘by all means’)
-
1
1
förvisso (‘certainly’)
-
1
1
of course
-
1
1
minsann (‘indeed’)
-
1
1
det förstår sig (‘that can of course be understood’) nämligen (causal ‘for’)
-
1
1
-
1
1
som bekant (‘as is well-known’)
-
1
1
Other
-
4
4
Zero
9
11
20
Total
259
308
567
Of course is translated in different ways reflecting its meanings as a discourse marker or a modal particle. 4.
Of course as a discourse marker
When of course is a conjunct with concessive meaning I have analysed it as a discourse marker. In the well-known definition of discourse markers by Schiffrin (1987: 31) they are ‘sequentially dependent elements which bracket units of talk’. ‘That is, they do not add so much to the propositional content of utterances as flag the sequential structure of discourse by indicating how discourse relates to other discourse’ (Rühlemann 2007: 116). In (6) of course could be analysed as a discourse marker since it functions ‘conjunctively’: (6)
Breakfast: Breakfast was your most important meal. He hooked up the percolator and the electric skillet to the clock radio on his bedroom windowsill.
Does English have modal particles?
117
Of course he was asking for food poisoning, letting two raw eggs wait all night at room temperature, but once he’d changed menus there was no problem. (AT1) Frukost — frukost var dagens viktigaste måltid. Han kopplade ihop kaffebryggaren och den elektriska kastrullen med klockradion på fönsterbrädet i sovrummet. Att låta ett par okokta ägg ligga och vänta hela natten i rumstemperatur var naturligtvis att medvetet utsätta sig för matförgiftning, men när han ändrade sin matordning blev det inga problem. It can be safely inferred (‘taken for granted’) that leaving raw eggs at room temperature will lead to food poisoning, but in this case the person changed menus and therefore avoided being poisoned. However it is not self-evident how of course should be analysed in similar examples. It functions ‘conjunctively’ but it also expresses the speaker’s attitude. In (7) of course introduces an argument as given information which is later dismissed in the but-clause: (7)
Of course you haven’t been here long, but you’ll have heard of Davina Flory?” (RR1) Visserligen har ni inte varit här så länge, men nog måste ni ha hört talas om Davina Flory?”
Disjuncts express the speaker’s comments on the message and therefore seem to be less clearly discourse markers. However Thompson and Zhou (2000) have shown that disjuncts can be weakly connective although it is more difficult to label or explicate the relation to the preceding utterance in this case (Thompson and Zhou 2000: 137). For example, of course can typically combine with but or be replaced by but. In (8) but of course has the function to close off a topic (Topic A ‘there might be an interaction’) and presents or shifts to a new one (Topic B ‘but he would of course dominate it’) (Lewis 2003). (8)
Then there might possibly be an interaction, but all the time, of course, he’d dominate it with his grasp of the thing and if we were able to come up with anything, if he took hold of it, then he’d elaborate it in his own particular way. (CE1T) Då kunde det möjligtvis ske en växelverkan, men det är ju klart att det var hela tiden han som dominerade med sitt grepp på det och om vi då kunde komma med bidrag, om han högg tag i dem, så vidareutvecklade han ju dem på sitt speciella sätt.
118
Karin Aijmer
Of course often collocates with but as in the example above where the speaker introduces an argument in order to dismiss it as irrelevant. In the following example the translator has added men ‘but’ in the Swedish translation thus signalling that the adversative meaning is implicit in the use of of course. Of course is a discourse marker since the speaker not only implies that something is absolutely certain but uses of course to achieve better coherence or to repair a potential coherence gap in the discourse (Waltereit and Detges 2007:65). Of course marks the transition to a new topic which is dismissed or ‘removed from the centre stage’ (Lewis 2003). (9)
That left him and Blake, the old man thought. In a way he envied Blake, completely assimilated, utterly content, who had invited him and Erita round for New Year’s Eve. Of course, Blake had had a cosmopolitan background, Dutch father, Jewish mother. (FF1) Nu var det han och Blake kvar, tänkte den gamle mannen. På ett sätt avundades han Blake, som var helt assimilerad och fullständigt belåten och hade bjudit hem honom och Erita på nyårsafton. Men Blake hade förstås också en kosmopolitisk bakgrund — holländsk mor och judisk far.
In (10) of course signals a change of direction in the speaker’s thought (‘on the other hand’). Of course introduces a counterargument which is dismissed before the speaker continues: (10)
…and he thought perhaps she wasn’t joking. A little later she said: “Of course, you can’t blame Harry Harris too much, considering what his wife’s like.” (FW1) Kanske ligger det i alla fall något bakom det hon säger, tänkte han. Lite senare sa hon: “Fast det är klart, man kan inte bara skylla på Harry Harris, eftersom man vet hurdan hans hustru är.”
Does English have modal particles?
119
The reason for using of course is to suggest that the following proposition is an established truth and therefore ‘unimportant’ in the larger argumentative context. (11) is another example of the argumentative function of of course: (11)
“You are a fool!” Hilary had banged on a kitchen cupboard as she spoke and the cups and plates inside trembled. “Of course he’s not coming back. The petty cash is empty. (FW1) “Ni är en idiot!” Hilary hade slagit näven i köksskåpen när hon talade, så att kopparna och tallrikarna därinne skakade. “Det är klart att han inte kommer tillbaka. Handkassan är borta, och jag ringde banken.
Of course as a discourse marker can also be used to add a point in an argument as in (12): (12)
He was impressed; it was from the General Secretary of the CPSU personally, handwritten in the Soviet leader’s neat, clerkish script and, of course, in Russian. (FF1) Han blev imponerad; det var från kommunistpartiets generalsekreterare personligen, handskrivet med den sovjetiske ledarens prydliga, bokhållaraktiga stil, och givetvis på ryska.
The fact that the letter was in Russian is not simply inferred from what has been said earlier (the letter was from the secretary of the Communist party personally). It provides ‘a new final idea on a particular topic’ and is used in persuasive discourse to clinch a point in an argument (cf. Lewis 2003). The document received was written by the Soviet leader personally in his own handwriting and most importantly it was in Russian. As shown by the translations, of course does not express certainty or emphasis only but it has developed discourse-marking functions e.g. to shift the topic or to add a point to the argument. Because of its close relationship with adversative markers like but and with additive markers (and) I have regarded of course as an emergent discourse
120
Karin Aijmer
marker with the function of achieving interpersonal and textual coherence while concealing any disagreement between the participants. 5.
Of course — a modal particle?
Modal particles fulfil basic communicative functions in language which differ from those of discourse markers. According to Waltereit (2001) modal particles have the common function to modify the preparatory conditions of the speech act ‘at minimal linguistic expense’. For example the speaker can say both ‘the great tradition in Cadíz is both dance and song’ and ‘ the great tradition in Cadíz is of course both dance and song’. The unmarked form without of course is the preferred one since it respects conversational maxims or heuristics associated with the assertion. The insertion of of course forces the hearer to find a motivation for the speaker’s flouting of the non-obviousness maxim (don’t say something which is obvious to the hearer). Waltereit comments on the German example Die Malerei war ja schon immer sein Hobby (painting has always been his hobby), ‘the effect of ja [a particle with the literal meaning of affirmation] seems to be that the preparatory condition on assertions concerning non-obviousness of p is cancelled, i.e., that the assertion counts as a relevant contribution to conversation even if the propositional content is obvious to the addressee’ (Waltereit 2001: 1398). The reference to the justification for the speech act introduced by of course (or by German ja) can also be achieved by explicit means. The German sentence discussed by Waltereit can be paraphrased: You certainly know that painting has always been his hobby. I’m only saying this because I need this fact for my argumentation (Waltereit 2001: 1399). As suggested by the paraphrase, the motive for using ja can be rhetorical or argumentative. What makes of course special in the example given above for illustration is that it does not only focus on what the speaker knows but signals that the hearer knows (or should know) as well. The meaning of the modal particle can be understood from the presuppositional properties of the adverb. Because of its evidential or modal meaning (something is self-evident or certain) of course presupposes that something is given or known information. By means of pragmatic accommodation (Lambrecht 1994) new presuppositions can come into existence and ultimately become conventionalized. Pragmatic accommodation is described as follows, ‘if the presupposition evoked by some expression does not correspond to the presuppositional situation in the discourse it is normally automatically supplied by the speech participants’ (Lambrecht 1994: 67). For example the speaker can exploit the presuppositional structure of of course in order to mildly draw the hearer into sharing an opinion or to signal a step in the argumentation (‘let us assume that this is shared knowledge- it follows that….’). In previous work (Wichmann et al, forthcoming) we described the function of of course as heteroglossic since it opens up the dialogic space for alternative voices (White 2003). Speakers engage in interaction and use of course and other modal adverbs rhetorically or strategically to take up a position of
Does English have modal particles?
121
alignment or disalignment to assumptions or beliefs generated by the preceding discourse (White 2003; Wichmann et al forthcoming). The heteroglossic function explains why we can use of course as a modal particle with the interpersonal function to take up a stance challenging what is said or the expectations arising from the preceding text: (13)
But the great tradition in Cádiz is, of course, as mentioned earlier, dance and song. (BTC1T) Men den stora traditionen i Cádiz är ju, som tidigare nämnts, dansen och sången.
The speaker provides justification for the statement by referring explicitly to the fact that it has been mentioned earlier (‘as mentioned earlier’). Ju in the translation can be paraphrased by ‘as you know’. The example illustrates what I mean by the use of of course as a modal particle. The modal particle has the function to comment on or make adjustments to the interactants’ common ground in order to avoid misunderstandings. Vaskó and Fretheim refer to the context-adjusting function of modal particles added at strategic points in the discourse, for example ‘to check whether the speaker’s and hearer’s contextual assumptions converge or diverge’ (Vaskó and Fretheim 1996: 245). The examples I will discuss as modal particles are those where of course has been translated as ju, i.e. the translator has interpreted of course as having the meaning ‘as you know’, ‘as everyone knows’. We also need to look at the factors which explain the translator’s choice. Example (13) where the meaning of of course is also signalled by ‘as mentioned earlier’ should be compared with (14) where the clause introduced by of course represents a bridging context in the terminology of Evans and Wilkins (2000: 550): ‘In these contexts… speech participants do not detect any problem of different assignments of meaning to the form because both speaker and addressee interpretations of the utterance in context are functionally equivalent, even if the relative contributions of lexical content and pragmatic enrichment differ.’ When of course co-exists with must the meanings certainty (emphasis) and ‘evidence’ or justification are present simultaneously and cannot be teased apart. (14)
If global decisions are to have legitimacy, then of course they must be representative. (EISC1T) För att globala beslut skall ha någon legitimitet, måste de ju vara representativa.
In (15) of course must be interpreted as the modal particle although the knowledge status of the participants is not referred to explicitly.
122
Karin Aijmer
(15)
Pasqual Pinon’s two heads are shown on a series of photographs from the 1920s and 1930s; the last was taken only a few days before his death. He had of course acquired a certain international fame by then, and had been the subject of a biography, which was published after his death: this was written by the impresario John Shideler, and called A Monster’s Life. There are pictures enough. They all express sadness and dignity; as if the two heads always looked into the camera conscious that they would never be understood, that those seeing the pictures would never understand. (PE1T) Pasqual Pinons två huvuden finns återgivna på en rad fotografier från 20och 30-talen; det sista är taget bara några dagar före hans död.Han hade ju då uppnått en viss internationell ryktbarhet, och blev föremål för en biografi publicerad efter hans död: det var impressarion John Shideler som skrivit denna biografi, “A Monster’s Life”. Bilder finns det gott om. De ger alla uttryck för sorg och värdighet; som om de två huvudena alltid såg in i kameran medvetna om att de som såg bilderna aldrig skulle förstå.
On the hierarchical discourse level the clause containing of course is backgrounded. The topic (Pinon’s two heads are shown in a series of photographs) is resumed after the addition of backgrounded information (Pinon had achieved international fame by then) needed to guarantee that misunderstanding will not occur. Of course is inserted for the prophylactic purpose to show the relationship between the proposition introduced by of course and the preceding sentence but also to show that the information is backgrounded in relation to the main topic. (16) is another example of the necessity to analyse the function of of course in relation to the organization of the text. The speaker claims that the three Spanish cities Sevilla, Cádiz and Jerez are in fact the cradle of flamenco. This is qualified by the statement that Jerez is best known as the city of wine. Of course introduces the information which is needed in order to justify this claim (the sweet grape is grown, several famous wine cellars are located there). Lewis refers to this as a metatextual backgrounding function: ‘the thread of the narrative is broken… to inform the hearer of a circumstance that will make the narrative more coherent’ (Lewis 2003: 87). (16)
Sevilla, Cádiz and Jerez all claim to be the cradle of flamenco, and all three cities are, in fact, important names in the history of flamenco. Jerez, of course, is best known as the city of wine. Between the mouths of the Guadalquivir and Guadalete rivers the sweet grape is grown from which the sherry wine with all its variants comes. The great Bodegas (winecellars) with famous names like Domecq, González Byass, Sandeman and several others are located there. In the world of flamenco two types of
Does English have modal particles?
123
flamenco song can be identified: cante flamenco andaluz and cante flamenco gitano, Andalusian flamenco song which is sung by a payo (nongypsy) and gypsy-flamenco song which is sung by a calé (gypsy). (BTC1T) Sevilla såväl som Cádiz och Jerez gör anspråk på att vara den ort där flamencons vagga stod.Säkert är dock att alla de tre städerna är viktiga namn i flamencons historia. Jerez är ju framförallt känd som vinets stad. Mellan Guadalquivirs och Guadaletes mynningar odlas den ljuva druvan som ger sherryvinet i alla dess varianter. Där finns de stora bodegorna med sina kända namn som Domecq, González Byass, Sandeman och flera till.I flamencovärlden skiljer man på två typer av flamencosång: cante flamenco andaluz och cante flamenco gitano. Andalusisk flamencosång som sjungs av en payo — icke-zigenare — och zigenarflamencosång som sjungs av en calé — zigenare. The importance of taking into account the larger rhetorical context for the interpretation of of course becomes especially clear when we look at examples where there are no linguistic cues to the interpretation. In the following example of course does not have a discourse marking or topic-shifting function but introduces additional information supporting the main topic as is shown by looking at the text : (17)
She was so intense it seemed my quiet mother, her hair groomed and elegant legs neatly crossed as if her husband were there to approve of the standard — the self-respect — she kept up, was the one to supply support and encouragement. Of course I know her. That broad pink expanse of face they have, where the features don’t appear surely drawn as ours are, our dark lips, our abundant, glossy dark lashes and eyebrows, the shadows that give depth to the contours of our nostrils. … And even if I hadn’t known her, I could have put her together like those composite drawings of wanted criminals you see in the papers, an identikit. The schoolboy’s wet dream. My father’s woman. But I had no voluptuous fantasy that night. I woke up in the dark. (NG1) Hon var så intensiv att det kom att verka som om min stillsamma mor — med sitt uppsatta hår och de eleganta benen prydligt korsade som om
124
Karin Aijmer hennes make fanns där och kunde uppskatta att hon fortfarande höll på stilen, självaktningen — var den som gav stöd och uppmuntran. Visst känner jag henne.Ett sådant där skärt ansikte som liksom breder ut sig och där dragen inte verkar klart avgränsade som våra är, våra mörka läppar, våra täta blanka mörka ögonfransar och ögonbryn, skuggorna som ger djup åt våra näsvingars konturer. … Även om jag inte hade känt henne kunde jag ha plockat ihop henne som de där “spökbilderna” av efterlysta brottslingar som man ser i tidningarna, en nyckel till en identifiering. Skolpojkens sexdröm. Min fars kvinna. Men jag hade inte några vällustiga drömmar den natten. Jag vaknade i mörkret.
Of course signals a break in the topic (The woman was unlike the speaker’s mother), also marked by a change of tense from the past to present tense before the topic is resumed (I woke up in the dark). In the clause introduced by of course the speaker stops to think about a certain type of woman: ‘the broad pink expanse of face they have’ unlike other women the speaker knows. She is the picture of a woman the speaker could have imagined - a schoolboy’s dream. The example is unusual because of course has initial position, a position which is more typical of the discourse marker function.2 Notice that of course is not usually found in initial position when it is backgrounded as in the example above. The Longman Dictionary (LDOCE) observes: ‘Instead of saying: We play a lot of tennis and polo. Of course we have our own swimming pool, you would say: We also have our own swimming pool, of course or …and of course we have our own swimming pool.’ As appears from the dictionary example, of course can also have end position as in the following corpus example. The function of the modal particle in this example is to refer to something both the speaker and hearer know, in order to establish rapport (Holmes 1988): (18)
But sometimes I can’t get my breath, I have difficulty in breathing. I’m not as young as I was, of course, and you’ve got to have some ailment. (SC1T) Men ibland har jag svårt för att få luft, svårt att andas.Jag är ju inte så ung längre och nån krämpa ska man ju ha.
As shown by the following example, of course as a modal particle expresses weak connection with both the preceding context and what follows. However unlike the discourse marker, of course as a modal particle does not change the topic but refers to some circumstance (fact, information) which is needed to establish interpersonal coherence (speaker and hearer share assumptions, beliefs
Does English have modal particles?
125
and knowledge). A closer analysis of the contexts where of course is used shows that it is typically used to introduce a topic which is subordinate to another topic (e.g. the explanation for a claim, background information needed to facilitate the progression of a narrative). (19)
“Yes,” said Asplund, “that’s the whole idea.” I had a few meetings with Lewerentz when we were drawing the Bredenberg department store. Lewerentz of course took over his father’s factory in Eskilstuna and used a metal window that wasn’t so common in those days, with interlinked arches and double glazing. It was absolutely new, because we had invited tenders from German companies for that sort of design. Then Lewerentz came along and said that he could do it cheaper, but he couldn’t meet the delivery deadlines. (CE1T) “Jo,” sa Asplund, “det är ju det som är meningen.” Jag hade en del sammanträffanden med Lewerentz när vi ritade Bredenbergs varuhus. Lewerentz övertog ju sin fars fabrik i Eskilstuna och körde med ett metallfönster som inte var så vanligt på den tiden med kopplade bågar och dubbla glas. Det var alldeles nytt, för vi hade tagit in anbud från tyska firmor på en sån konstruktion. Då kom Lewerentz och menade att han kunde göra det där billigare, men han kunde inte klara leveranstiderna.
Of course relates a proposition to the preceding utterance which contains the new information (I had a few meetings with Lewerenz). By introducing a reference to shared evidence for the information in the first utterance the speaker makes sure that misunderstandings are avoided. In (20) the new information is that the shipping company was obliged to lay the vessels up. Of course signals that the sentence to which it is attached fits into the context as backgrounded information. The backgrounded utterance marked by of course is followed by a resumption of the topic or narrative. (20)
Export volumes to Belgium and France were small and the Gällivare company was periodically compelled to lay the vessels up. Narvik did not come into use before the beginning of 1903, of course. (TR1T)
126
Karin Aijmer Exportkvantiteterna till Belgien och Frankrike var små och Gällivarebolaget fick periodvis lägga upp fartygen. Narvik kom ju ej i bruk förrän 1903.
The precise interpretation of of course depends on the context. Example (21) is different from other examples discussed because of course modifies a question. The hearer’s wife is from America and the speaker refers to this circumstance as the justification for asking the question. (21)
And now the Boss stands there, several years closer to Modern Times, and wants to placate, shouts down the stairs. “There was one thing, Aron. I’ve purchased a gramophone and wonder if you know... your wife was from America, of course. (GT1T) Och nu står Patron där, några år närmare det Moderna och vill blidka; ropar neråt trappan. — Det var en sak till, Aron. Jag har inköpt en grammofon och undrar, känner du till... din hustru hon var ju från Amerika.
In a question, of course comes to mean ‘request for confirmation’ rather than reference to evidence or justification. There are some syntactic contexts where of course cannot be interpreted as a discourse marker and which are therefore indications of the conventionalisation of of course as a modal particle. For example, when of course is found in a nonrestrictive relative clause the information is already backgrounded or ‘parenthetical’. By sneaking in of course the speaker makes it even more difficult for the hearer to avoid the implication of shared knowledge associated with of course. The assumed knowledge (‘as you know’) may be specific to a social network of which both the speaker and the hearer are part as may be the case in the following example (cf. Holmes 1988 ‘the confidential of course’). (22)
“But one must not forget the long winters which, of course, for seventy to eighty percent consist of complete darkness”. (GT1T) Men man får då inte glömma de långa vintrarna som ju till sjutti, åttio procent består av rent mörker.
Similarly in (23) the information in the because-clause is presupposed and therefore backgrounded. Of course denies that the information is new but implies that it is uncontroversial because it is shared by the community: (23)
‘How do you do, Franklin,’ said Auntie, shaking the boy’s hand (she found herself wondering just whom it had originally belonged to, because of course it was, as you might say, second-hand). (ARP1T)
Does English have modal particles?
127
— Goddag Franklin, sa fastern och tog gossens hand (och hon kom på sig med att undra vem den hade tillhört i original, den var ju numera så att säga second hand). In the following example of course introduces shared information as an afterthought after a break (and therefore backgrounded): (24)
“We became friends because we shared some artistic enthusiasms — music, and manuscripts, and calligraphy, and that sort of thing — and of course he made me one of his executors. (RDA1) “Vi blev vänner för att vi hade en del konstnärliga intressen gemensamt — musik och manuskript och kalligrafi och sådana saker, och han gjorde ju mig till en av sina testamentsexekutorer.
In this article I have only discussed the meaning of of course. Other modal particles have a more transparent meaning for example I think which can be explained as cancelling or flouting the preparatory condition that the speaker is sincere (Bill is fat I think) (Aijmer 1997). Like other modal particles it is used to check that the speaker and hearer are on the same wavelength by referring to the background context for the assertion. 6.
Conclusion
English has modal particles ‘which look like adverbs’ but can be distinguished from those on the basis of function as well as on the patterns where they occur. Hoye (1997) referred to the adverbs as modal particles when they were subjuncts, i.e. subordinate in the clause when compared with disjuncts. However, the meaning of the modal particle is not simply a weakening of the modal meaning of the adverb as suggested by Hoye but reflects the fact that the adverb has been ‘pragmaticalized’ and has a number of new functions. Moreover, Hoye did not discuss the difference between modal particles and discourse markers which is important when we analyse the functions of of course. In this light, the translation data from Swedish is particularly interesting because it gives evidence for a functional split between different uses of of course. I have suggested that these differences could be characterised in terms of the difference between discourse markers and modal particles. Modal particles in English are above all a functional category, although they can have certain formal characteristics. Since they are backgrounding, they are for instance normally placed in medial or final position. It has been argued that we can understand their functions by referring to the conditions for the speech act. The speaker can, for instance, say either John of course took over his father’s factory or John took over his father’s factory. With the first alternative, of course has the procedural or signalling meaning to comment on how the
128
Karin Aijmer
information fits with the background context (the preparatory conditions of the speech act). Thus, for example, of course is incompatible with ‘the nonobviousness’ condition of the assertion and is motivated by the need to avoid misunderstandings caused by divergent opinions. Of course can have a number of different functions which are explained by its presuppositional properties (something is given or true). It can be used dialogically to take up a stance to what the hearer knows (as you know, as you should know) or what is common knowledge in order to agree or disagree. Because it comments on common ground, it can have functions such as solidarity or rapport if used by members in a social group. Other functions are argumentative or manipulative. Of course as a modal particle can also appear in contexts where it has a backgrounding function. By this, I mean that is used for ‘subordinate’ functions such as elaborating or explaining what is said or to remedy a break of the narrative thread. As a discourse marker on the other hand, of course guides the hearer through the discourse signalling a topic shift, new turns, or the introduction of new points in the argumentation. Of course was frequently found after but which marks a deviation from the main topic to a new thought or argument. Moreover, it is foregrounding, i.e. it introduces new information into the discourse. Table 2 summarizes the meanings of the modal particle and the discourse marker: Table 2: The discourse marking and modal particle functions of of course Discourse marker Foregrounding (new information) Dismissive Concessive Topic-shifting Marking steps in the discourse or points in the argumentation Modal particle Backgrounding (old information) Context-adjusting Argumentative/ manipulative Solidarity (positive politeness) The polysemy of of course in present-day English and the relationship between the different meanings is motivated by the diachronic changes. The co-existence of variants representing different stages of the language is known as layering (Hopper 1991). Layering and the dynamic view of language it presupposes is also evidenced by bridging contexts where the functional distinctions between different uses of of course seem to be neutralized. In a bridging context of course can for instance both mean certainty and ‘as you know’. Modality is a broad notion as illustrated by this study of of course and should be redefined to take into account its interactional uses. Of course does not only refer to certainty but can be realized in many different ways. For example, of course has interpersonal modal meanings (to refer to what is clearly familiar or true) associated with pragmatic accommodation as well as with meanings
Does English have modal particles?
129
oriented towards ‘interpersonal or dialogical coherence’ (the speaker imposes him- or herself in the discourse in order to shift the topic or to reject it before continuing). Notes 1
There may be an imbalance between the same item as translation and as the source of a translation due to the translation process itself. I have therefore referred to the frequencies of translations and sources together.
2
Visst (certainly) in the translation emphasises the dialogical aspect of of course although the speaker in this case responds to his own thoughts rather than to the hearer.
References Aijmer, K. (1997), ‘I think – an English modal particle’, in: T. Swan & O.J. Westvik (eds.) Modality in Germanic languages. Historical and comparative perspectives. Berlin: Mouton de Gruyter. 1-47. Aijmer, K. and A.-M. Simon-Vandenbergen (2007), The semantic field of modal certainty. A corpus-based study of English adverbs. Berlin and New York: Mouton de Gruyter. Aikhenvald, A.Y. (2004), Evidentiality. Oxford: OUP Altenberg, B. and K. Aijmer (2000), ‘The English-Swedish Parallel Corpus: A resource for contrastive research and translation studies’, in: C. Mair and M. Hundt (eds.) Corpus linguistics and linguistic theory. Papers from the 20th International Conference on English Language Research on Computerized Corpora (ICAME 20) Freiburg im Breisgau 1999. Amsterdam & Philadelphia: Rodopi. 15-33. Chafe, W. and J. Nichols (eds). (1986). Evidentiality: The linguistic coding of epistemology. Norwood, N.J.: Ablex. Curme, G.O. 1905 (1960), A grammar of the German Language. London: Macmillan: (1960 rev. edn) New York: Frederick Unger. Diewald, G. (2006), ‘Discourse particles and modal particles as grammatical elements’, in: K. Fischer (ed.) Approaches to discourse particles. Amsterdam: Elsevier. 403-425. Evans, N. and D. Wilkins (2000), ‘In the mind’s ear: The semantic extensions of perception verbs in Australian languages’. Language 76 (3): 546-592. Fraser, B. (1996), ‘Pragmatic markers’. Pragmatics (6)2: 167-190. Holmes, J. (1988), ‘Of course, A pragmatic particle in New Zealand women’s and men’s speech’. Australian Journal of Linguistics 2: 49-74. Hoye, L. (1997), Adverbs and modality in English. London and New York: Longman. Lambrecht, K. (1994), Information structure and sentence form. Topic, focus, and the mental representations of discourse referents. Cambridge: CUP. Lewis, D.M. (2003), ‘Rhetorical motivations for the emergence of discourse particles, with special reference to English of course’, in: T. van der
130
Karin Aijmer
Wouden, A. Foolen and P. Van de Craen (eds.) Particles, Special issue of Belgian Journal of Linguistics 16: 79-91. The Longman dictionary of contemporary English (1995) [1978] [LDOCE] Palmer. F.R. (1986), Mood and modality. Cambridge: CUP. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Rühlemann, C. (2007), Conversation in context. A corpus-driven approach London and New York: Continuum. Schiffrin, D. (1987), Discourse markers. Cambridge: CUP. Searle, J.R. (1969), Speech acts. An essay in the philosophy of language. Cambridge: CUP. Simon-Vandenbergen, A.-M. and K. Aijmer (2002/2003), ‘The expectation marker of course’. Languages in Contrast 4 (1): 13-43. Thompson, G. and J. Zhou (2000), ‘Evaluation and organization in text: The structuring role of evaluative disjuncts’, in: S. Hunston and G. Thompson (eds.) Evaluation in text. Authorial stance and the construction of discourse. Oxford: OUP. 121-141. Traugott, E. Closs (2007), ‘Discussion article: Discourse markers, modal particles, and contrastive analysis, synchronic and diachronic’, in: M. Josep Cuenca (ed.) Catalan Journal of Linguistics (6). 139-157. Special issue: Contrastive perspectives on Discourse Markers. Vaskó, I. and T. Fretheim (1997), ‘Some central pragmatic functions of the Norwegian particles altså and nemlig’, in: T. Swan & O.J. Westvik (eds.) Modality in Germanic languages. Historical and comparative perspectives. Berlin: Mouton de Gruyter. 233-292. Waltereit, R. (2001), ‘Modal particles and their functional equivalents: A speechact theoretic approach’. Journal of Pragmatics 33: 1391-1417. Waltereit, R. (2006), Abtönung. Zur Pragmatik und historischen Semantik von Modalpartikeln und ihren funktionalen Äquivalenten in romanischen Sprachen. Tübingen: Max Niemeyer Verlag. Waltereit, R. and U. Detges (2007), ‘Different functions, different histories. Modal particles and discourse markers from a diachronic point of view’, in: M. Josep Cuenca (ed.) Catalan Journal of Linguistics (6). 61-80. Special issue: Contrastive perspectives on Discourse Markers. Weydt, H. (1969), Abtönungspartikel: die deutschen Modalwörter und ihre französischen Entsprechungen. Bad Homburg: Gehlen. White, P. (2003), ‘Beyond modality and hedging: a dialogic view of the language of intersubjective stance’. Text 23(2): 259-84. Wichmann, A., A.-M. Simon-Vandenbergen and K. Aijmer (forthcoming), ‘How prosody reflects semantic change: a synchronic case study of of course’, in: K. Davidse and H. Guyckens (eds.) Subjectification, intersubjectification and grammaticalization. Berlin and New York: Mouton de Gruyter.
A reassessment of the syntactic classification of pragmatic expressions: the positions of you know and I think with special attention to you know as a marker of metalinguistic awareness Julie Van Bogaert Ghent University, Belgium; Research Foundation – Flanders (FWO-Vlaanderen) Abstract This paper wishes to point out some limitations to the way in which the syntactic classification of pragmatic expressions has traditionally been handled. As an alternative, it proposes a syntactic classificatory system that pivots on the notion of scope. This alternative approach is applied to corpus data of you know and I think. An attempt is made to establish connections between the pragmatic expressions’ syntactic behaviour on the one hand and their functional properties on the other hand and to compare the findings for both expressions. In so doing, special attention is devoted to local you know, a specific syntactic use of this pragmatic expression, which is found to correlate with a particular function, viz. that of marking metalinguistic awareness. The findings of this study may have implications for the way you know is commonly viewed, especially by laypeople, but also in scholarly settings.1
1.
Introduction
You know and I think are two of the most frequently used pragmatic expressions that take the form of a verb of cognition with a first or second person singular subject. They are at the top of the list of most frequently used expressions of this type in the London Lund Corpus (Stenström 1995: 293) and this is no different in the data that I have used for this article (cf. section 2). In Scheibman’s study of stance in American English conversation, I think was preceded, in a frequency count of verbs of cognition in the first person singular simple present, only by I (don’t) know, and you know towered over all other cognitive verbs in the corresponding second person singular category (Scheibman 2002: 64-67, 74-76). You know and I think have been studied from various perspectives and have been referred to by a plethora of terms such as discourse marker (Schiffrin 1987), modal particle (Aijmer 1997), discourse particle (Aijmer 2002) or comment clause (Jespersen 1937; Peltola 1983; Stenström 1995; Quirk et al. 1985), to name but a few. Aijmer (this volume) suggests that ‘discourse marker’ and ‘modal particle’ are not just two alternative labels for one and the same concept by pointing out a functional split between of course as a discourse marker and of course as a modal particle. Adopting a sociolinguistic point of view, Bernstein (1971: 98, 109-14) found that “egocentric” I think sequences are typical of middle class speakers while “sociocentric” sequences like you know are more frequently used among working class speakers. Huspek (1989) further developed
132
Julie Van Bogaert
this insight and took the factor of group identity into consideration when looking at the use of you know and I think in the language of his working class American informants. The first significant monograph dealing specifically with you know is Östman’s (1981). Subsequent studies that try to come to grips with the pragmatics of this expression, often looking for its core meaning, include Schourup (1985), Erman (1987), Stenström (1995), He and Lindsey (1998), Jucker and Smith (1998) and Fox Tree and Schrock (2002). Some authors approached you know from a variationist-sociolinguistic point of view (Holmes 1986; 1990; Stubbe and Holmes 1995; Erman 2001) and others still investigated speakers’ perceptions of it (Watts 1989; Fox Tree 2007). As good as all of the aforementioned studies, by their very interest in the ‘meaning’ and functions of you know, gainsay laypeople’s opinions on this “exasperating expression” (Stubbe and Holmes 1995) or “verbal garbage” (Schourup 1985: 94) that has been a popular target for prescriptivists (Schourup 1985; Fox Tree 2007). As regards I think, Urmson, as early as 1952, tackled the issue of first person singular verbs in the present tense combining with a that-clause that have the capacity to occur in positions other than the beginning of a sentence and dubbed them “parenthetical verbs”. This syntactic mobility aroused the interest of transformational grammarians, who explained the phenomenon as a case of S(entence)-lifting (e.g. Ross 1973). Hooper (1975) investigated different types of predicates combining with that-clauses and called I think a weak assertive predicate. In more recent years, a number of studies appeared that apply grammaticalization theory to I think and pragmatic expressions of the same type (e.g. Thompson and Mulac 1991a; 1991b; Palander-Colin 1999; Tagliamonte and Smith 2005; Van Bogaert 2006). The sociolinguistic and stylistic conditions for the use of I think have also been explored. Simon-Vandenbergen looked into the expression’s relation with social class (2002) and its use in political discourse (1998; 2000) while Andersen concentrated on its use in teenager talk (2001). The equivalents of I think and other expressions with cognitive verbs have been studied in languages other than English. With reference to the Romance language family, Blanche-Benveniste and Willems (2007), Schneider (2007) and Dendale and Van Bogaert (2007) should be mentioned and Nuyts (1994) compared Dutch mental state predicates to other grammatical means of realizing epistemic modality. Some studies devote attention to the question of the syntactic positions that you know, I think and related expressions can occupy in a sentence (Erman 1987; Thompson and Mulac 1991a; 1991b; Stenström 1995; Aijmer 1997; Simon-Vandenbergen 1998; 2000; 2002; Van Bogaert 2006). This issue will be of central interest in the present article. The aim of this paper is to raise a few points of criticism of the approach that is traditionally adopted to the syntactic classification of you know and I think and to propose an alternative way of looking at the syntactic behaviour of these expressions that is based on the notion of scope. It will be demonstrated that this alternative system can be helpful in explaining certain correlations between syntactic position and function.
Pragmatic expressions: the positions of you know and I think
133
After a brief presentation of the data, the canonical approach to the syntactic classification of pragmatic expressions is critically evaluated and an alternative classificatory system is put forward. This alternative is then applied to the data of you know and I think and functional explanations are offered for the syntactic findings. In this discussion, special attention is devoted to a particularly interesting type of you know that will be referred to as local you know. 2.
Data
This study makes use of corpus data from the ICE-GB (International Corpus of English – the British Component), which comprises 1,061,264 words. However, only the spoken part of the corpus was taken into consideration as you know and I think are typically used in spoken language (Schourup 1985; Biber et al. 1999: 668-69; Aijmer 2002). The spoken section of the ICE-GB contains 637,562 words and yields 1,081 occurrences of you know and 1,734 of I think. In the I think data, allowance has been made for negative transportation, which means that examples with I don’t think have also been incorporated. Unclear and unfinished utterances have been left out of consideration. 3.
The canonical tripartite system for the syntactic classification of pragmatic expressions
Traditionally, pragmatic expressions like you know and I think have been syntactically classified following the well-known tripartite system distinguishing between initial, medial and final position, similar to the syntactic description of adverbs. The application of this by now canonical system to syntactically mobile first person singular cognitive verbs dates back to Urmson’s work on what he called “parenthetical verbs” (Urmson 1952): I suppose that your house is very old. Your house is, I suppose, very old. Your house is very old I suppose. (Urmson 1952: 221) Quirk et al. proposed a more elaborate system for the syntactic classification of adverbials. They refined the tripartition by splitting up the categories ‘medial’ and ‘final’ into ‘initial medial’, ‘medial medial’, ‘end medial’, ‘end’ and ‘initial end’ (Quirk et al. 1985: 491-500). Alternatively, some scholars have described pragmatic expressions in terms of their position in the turn (Erman 1987) or in the intonation unit (He and Lindsey 1998; Kärkkäinen 2003). However valuable these approaches may be, they do not take syntax as a starting point. In this paper a conscious decision is made to regard the position of you know and I think in relation to syntactic rather than discursive or intonational units.
134
Julie Van Bogaert
The main fault that can be found with the canonical tripartite system is that it does not do justice to the pragmatic expressions’ functional specificities. Labelling a token as initial, medial or final reveals rather little about the pragmatic or interpersonal functions that it may have. In other words, the canonical tripartition does not constitute a sufficiently refined tool for explaining why a particular pragmatic expression occurs where it does. In this respect, I feel that this classificatory system misses two important points. Firstly, it does not take account of the syntactic level at which a pragmatic expression occurs. As such, it would group together the following two occurrences of I think in the category ‘medial’. (1)
Cameing Coming out I think was definite
(2)
Father McDade d’you you remember in I think lecture three uh Rabbi Sacks said at one point faith is not measured by acts of worship alone
The canonical approach pays no heed to the fact that in (1) I think is placed in between the constituents of a clause whereas in (2) it is used at a lower syntactic level; here I think has been inserted within a phrase functioning as a clause constituent. This syntactic difference is functionally relevant as will become clear in this article. The second drawback to the canonical tripartite system is its failure to take into account the scope2 of the pragmatic expression. It wrongly assumes that a pragmatic expression invariably has scope over the entire clause (complex) in which it occurs, failing to notice that its scope is sometimes limited to one particular phrase or word functioning as a clause constituent or as a phrase constituent. (3)
You know you have a beech wood that might be all beeches and it might be on limestone and it might be on chalk or it could be on flint or gravel soil soil
(4)
It’s very uhm you know solid
As regards the syntactic characterization of corpus example number (3), I agree with the canonical tripartition by classifying it as a case of initial you know with scope over the whole clause complex that it introduces. Number (4), on the other hand, is of a different nature in that in this case, you know is used with very narrow scope; it is restricted to the word solid, the head of the adjective phrase very solid. Therefore, rather than assigning this example to the category of medial position, I consider it as a case of initial you know used with local scope over an adjective functioning as the head of an adjective phrase. The pragmatic expression assumes a position that is initial relative to the element over which its scope applies. Thus, it will be observed that the term ‘initial’, in my framework,
Pragmatic expressions: the positions of you know and I think
135
is not restricted to clause-initial position, as is common in the literature, but is to be understood as ‘in front of the scoped element’. Bearing the above criticism in mind, I have attempted to develop a refined syntactic classificatory approach to pragmatic expressions in which syntax is not considered in isolation but rather in relation to interpersonal, discursive and pragmatic properties. Assigning corpus data to syntactic categories must not be an end in and of itself; rather it must be a means to gain more insight into the pragmatic expressions’ functional properties. The syntax of you know and I think should be described in such a way that also discloses relevant information about what these expressions do in discourse. The syntactic classification of a pragmatic expression should not only tell us where the expression is used but also why it is used. In the next section, I will propose an approach to the syntactic description of pragmatic expressions like you know and I think that allows one to account for the interaction between formal, syntactic characteristics and functional ones. Its relevance to functional interpretations of pragmatic expressions will become apparent in section 5. 4.
An alternative approach
The essence of the alternative approach to the syntactic description of pragmatic expressions is to specify the form and the function of the scoped element. The basic distinction to start from is that between clausal scope and local scope. Context plays a crucial role in identifying the extent of a pragmatic expressions’ scope. 4.1
Clausal scope (A)
4.1.1 Clause functioning as speech act (A1) Clausal scope implies that a pragmatic expression has scope over a clause or over a proform (so or not, or zero proform as with you know) substituting a clause. The function performed by this clause or proform is mostly that of a speech act. A pragmatic expression with clausal scope may be placed in initial, medial or final position in the clause (complex). Medial position means that the expression is placed in between clause constituents. (5) is an example of I think in medial position having scope over a clause functioning as a speech act. Figure 1 is a schematic representation of this use of I think. (5)
And he also I think wants time and space to himself to sort himself out
136
Julie Van Bogaert
A1
Figure 1: Clause functioning as speech act It should be noted that it is possible for a pragmatic expression to be inserted within a phrase functioning as a clause constituent whilst holding the whole clause in its scope rather than a constituent of this phrase, which is, as we will see in 4.2.1, most commonly the case when a pragmatic expression is used at this syntactic level. This syntactic type will be referred to as a special kind of medial position, viz. intrusive medial use. It is exemplified in (6) and visually illustrated in figure 2. In this example, I think is used within a verb phrase but it would be untenable to claim that this is done because the pragmatic expression specifically qualifies have or been. The intrusive medial position is not restricted to clauses functioning as speech acts (A1’), but it can also be found in the other clausal scope categories, which will be presented below. (6)
Well the Arabs have I think been a little bit slow with the sole exception of Syria of President Assad of Syria
A1’
Figure 2: Clause functioning as speech act: intrusive medial position 4.1.2 Clause functioning as clause constituent (A2) It may be that a clause that is scoped by a pragmatic expression does not perform a speech act but rather that it functions as a constituent, either nominal (A2a) or adverbial (A2b), of another clause. The nominal clauses may be that-clauses, whclauses or nominal relative clauses and in theory also exclamative clauses, but none of those were found in the corpus. The adverbial category includes adverbial clauses and sentential relative clauses.3 (7) and (8) exemplify the use of a pragmatic expression in a nominal clause and in an adverbial clause respectively. (7)
It’s much more to do with sociocultural factors and next time I will explain why I think that these sociocultural factors are important and how they’re actually operating
Pragmatic expressions: the positions of you know and I think (8)
137
And so Roger we’re doing a we’re doing a we’re putting a a D on the front of each of these notes because I think it needs it really cos it’s a sort of
A2a/b
Figure 3: Clause functioning as clause constituent 4.1.3 Clause functioning as phrase constituent (A3) A third and final category within the clausal scope class is that in which the clause in question shifts to an even lower rank, viz. that of a phrase constituent. Most of the time the clause in question is a relative clause, as in (9): (9)
That is the fundamental premise of a new police force which I think we need in this country and should move towards
A3
Figure 4: Clause functioning as phrase constituent The use of I think in relative clauses, as in (9), receives attention in SimonVandenbergen (2000: 49), who regards occurrences of I think immediately following the relative pronoun as being used in medial position: ...the fact that in this position [immediately after a relative pronoun] I think cannot be followed by that (in contrast with initial I think) provides an argument for classifying such instances as medial rather than initial... (Simon-Vandenbergen 2000: 49) Nevertheless, I would like to argue that this particular use of I think needs to be viewed as initial because firstly, the pragmatic expression cannot occur any earlier in the relative clause and secondly, the subordinator that can, though very marginally, sometimes be expressed. No such examples were found in the ICEGB, but the BNC yielded a few as did the World Wide Web.4 A selection of them is listed as examples (10) to (14). The most exceptional examples are (13) and (14), in that most grammars disallow the realization of that when the function of the relative pronoun is that of subject of the relative clause. (Quirk et al. 1985: 1050; Huddleston and Pullum 2002: 953).
138
Julie Van Bogaert
(10)
This is a point that I have made often in the House and on which I think that I have the support of the Adam Smith Institute which I hope will also be supported by many Conservative Members. (BNC HHW)
(11)
On this matter, there are two central issues on which I think that those responsible must be held to account (www.publications.parliament.uk/pa/cm200304/cmhansrd/vo040720/debte xt/40720-37.htm)
(12)
So I wanted to draw on that kind of thing which I think that’s a very important part of Scottish culture which still exists. (www.nationaltheatrescotland.com/content/default.asp?page=s3_1_1&id= 1801)
(13)
Another track which I think that might be the next Wu-Banger is “R.E.C. Room”, True Master produced this track and I think that this track has a whole lot of potential. (mysite.wanadoo-members.co.uk/rnnr/uncontrolled.html)
(14)
Yeah, I think there isn’t a lot to do, but I’m not stopped, I’ve e-mailed someone who I think that can help us getting the SV16 or a way to back up the firmware from the fone. (www.3g.co.uk/3GForum/archive/index.php/t-18493.html)
These marginal constructions in which the relative pronoun is a push-down element (Quirk et al. 1985: 1298) raise a number of questions. As (10) and (11) are not the only attestations in political discourse one may wonder whether the realization of that is a case of hypercorrection or hyperformality. A related question would be whether this atypical use of the subordinator is facilitated by the deliberative meaning of I think. Aijmer (1997: 21ff) distinguishes between on the one hand the tentative use of I think, which softens illocutionary force and expresses uncertainty, and on the other hand its deliberative use, heightening illocutionary force and expressing certainty and commitment. Aijmer’s criteria for differentiating between the two uses are prosodic and syntactic. As regards syntax, deliberative I think is used in initial position and followed by the subordinator that. I think used with zero that is considered tentative as are occurrences in medial or final position. The deliberative function of I think is frequently attested in political language (Simon-Vandenbergen 1998; 2000).5 Examples (10) and (11) are both deliberative and used in political settings. It may be that speakers using this uncommon construction feel that they require the subordinator in order to sound more authoritative and deliberate. Another way of looking at the constructions is to treat them as syntactic amalgams or blends (Bolinger 1961; Lakoff 1974). This seems especially plausible for examples (12) to (14), as these attestations have one foot in the spoken domain, the medium in which syntactic amalgams are most likely to occur. (12) is a transcription of an
Pragmatic expressions: the positions of you know and I think
139
interview and (13) and (14), coming from a personal website and an internet forum respectively, can also be thought of as standing rather close to spontaneous spoken language. The BNC data contained a few clearer cases of syntactic amalgams, one of which is given as example (15): (15)
There was an assumption that inflation would be higher than it was and that was cut back to one point five percent, which I think that I would actually support was a sensible way forward.
4.2
Local scope (B)
When a pragmatic expression is used with local scope, it mostly applies, from a formal point of view, over a phrase or a word but it may also have scope over a subclause. The function of the scoped element is either that of a clause constituent (B1) or of a phrase constituent (B2). 4.2.1 Phrase, word or subclause functioning as clause constituent (B1) Figure 5 clarifies the notion of a pragmatic expression with scope over a phrase functioning as a constituent of the clause. We can see that, similarly to figure 1, representing a pragmatic expression in medial position with scope over the whole clause, the expression is positioned in between clause constituents. This time, however, the arrow points to one of the clause constituents rather than the entire clause. The pragmatic expression may, as in example (16), scope the clause constituent following it, which corresponds to initial position, or it may assume final position relative to the clause constituent preceding it. In the corpus, two cases of I think were found that would seem to occupy medial position in a phrase, i.e. their scope is not restricted to a phrase constituent, which would make them category (B2), but rather, it extends over the phrase that the pragmatic expression breaks up. Under (17), an illustration of this marginal phenomenon is given and it is visualized in figure 6. (16)
I mean all that and he’s he’s popped his clogs at fifty-three and he’s you know not not a particularly nice man
B1
Figure 5: Phrase/word/subclause functioning as clause constituent (17)
Uh he later told me in the course of his cross-examination that he didn’t think there had been any change i in the ground uhm uh between the date of the accident and the photograph uh with the exception I think of the uhm the slope
140
Julie Van Bogaert
B1’
Figure 6: Phrase functioning as clause constituent: medial position As was mentioned, the clause constituent over which a pragmatic expression with local scope applies may take the form of a clause. The importance to distinguish between this type and categories (A2) and (A3), which also involve a pragmatic expression with scope over a subclause, will be illustrated by means of a comparison between examples (18) and (19). (18)
There was One other name I wanted to to throw in was Gerald Finzi because I think the Clarinet Concerto is the most amazing piece
(19)
And you’re made uhm Archbishop of Canterbury I think because you’re thought to have done a tolerably good job as a diocesan bishop
While (18) falls under category (A2) and (19) under (B1), in both cases I think has scope over a clause which in turn functions as a constituent within a clause. Nevertheless, there is a difference in that in (18), I think functions at the level of the subclause; it is an interpersonal modification of the propositional content of the because-clause. I think here indicates that the clarinet concerto being the most amazing piece is the speaker’s personal evaluation. In (19), on the other hand, I think has an interpersonal function in the main clause; it realizes an epistemic modification that singles out the validity status of one particular constituent: an adverbial adjunct that happens to be clausal. This scopal difference also comes to expression at the formal level: in (18) I think is part of the subordinate clause and hence it is placed ‘within’ this clause, after the subordinating conjunction. I think in (19), by contrast, does not have a function inside the subclause as it is placed ‘outside’ of it, preceding the adverbial clause. The difference in scopal relationship can be illustrated by means of a reactance6 involving a cleft construction: (20)
It is because I think the Clarinet Concerto is the most amazing piece that there was another name I wanted to throw in.
(21)
It is because you’re thought to have done a tolerably good job as a diocesan bishop that I think you’re made Archbishop of Canterbury.
Pragmatic expressions: the positions of you know and I think
141
To put I think in (19) after the conjunction would lead to an entirely different interpretation. The result would be (22) and the difference in meaning is made explicit by means of a cleft reactance in (23). (22)
And you’re made uhm Archbishop of Canterbury because I think you’re thought to have done a tolerably good job as a diocesan bishop
(23)
It is because I think you’re thought to have done a tolerably good job as a diocesan bishop that you’re made Archbishop of Canterbury
The difference between the two sentences can be visualized by comparing figure 3, corresponding to (18), to figure 7, a schematic representation of a sentence like (19). B1
Figure 7: Local scope: clause functioning as a clause constituent 4.2.2 Phrase, word or subclause functioning as phrase constituent (B2) In this syntactic category, the scope of the pragmatic expression is as narrow as possible; it is limited to the constituent of a phrase. The visual representation of this type bears some resemblance to both the intrusive medial position (A1’) and to the exceptional use of a pragmatic expression in medial position relative to the phrase that it scopes (B1’). In all three cases, the pragmatic expression is inserted within a phrase but the difference resides in its scope. In the case of intrusive medial position, the scope is the widest as it encompasses the whole clause. (B1) involves a somewhat narrower scope, viz. phrasal scope and in (B2) the scope is at its narrowest. Here only one particular word or phrase that is part of the disrupted phrase falls within the scope of the pragmatic expression. This is the most common scopal behaviour when a pragmatic expression is placed inside a phrase, both for you know and for I think.
B2
Figure 8: Phrase/word/subclause functioning as phrase constituent (24)
Uh in the uhm I think October issue of Computational uh Linguistics there’s an attempt to do something of this type
142
Julie Van Bogaert
Similarly to (B1’), also in this category it may happen that a pragmatic expression is realized within a phrase without scoping any single constituent of this phrase but rather the phrase as a whole, only in this case the phrase in turn functions as the constituent of another phrase, as exemplified in (25). This means that within (B2), the category of local scope over a phrase constituent, we need to allow for an admittedly very marginal category of medial position. (25)
A day a meal for as long as you like in the imagination of a generation like ours obsessed I think with the attempt to put themselves back in past time as well as to live intensely as they do in the present
It can be seen that the phrase within which I think in (25) occupies medial position is a non-finite subordinate clause functioning as a postmodifier. So in this subcategory of local scope too the scoped element can be a clause. 5.
Data discussion and findings
I will now discuss the most important findings that came out of the classification of the ICE-GB data of you know and I think following the alternative approach presented in section 4. 5.1
More local you know than local I think
Upon comparing the scores for clausal as opposed to local scope of you know and I think, one immediately notices that local you know is used with much higher frequency than local I think. As table 1 shows, local you know accounts for nearly 37% of all uses as opposed to a mere 5.42% of local I think Table 1: Clausal and local uses of you know and I think clausal local
you know 63.09% (682) 36.91% (399)
I think 94.58% (1640) 5.42% (94)
This observation extends the following claim about you know made by Erman (1987: 98): [you know] has a narrower scope than the other two PEs [pragmatic expressions] [I mean and you see] . … you know tends to be used more locally (e.g. between and within constituents) than the other two PEs. On the basis of the ICE-GB data, we can add to Erman’s claim that you know tends to be used with narrower scope not only than I mean and I see, but also than
Pragmatic expressions: the positions of you know and I think
143
I think. In the following two sections, explanations for this divergent behaviour of you know and I think will be provided by relating the pragmatic expressions’ syntactic properties to their interpersonal and discursive functions. 5.1.1 I think as an epistemic expression The functional explanation for the tendency of I think to be used with clausal scope resides in its function as an expression of epistemic modality. It was characterized as such by Thompson and Mulac (1991a; 1991b), Aijmer (1997) and Thompson (2002), to name but a few. In Systemic Functional Linguistics I think is known under the name of ‘interpersonal grammatical metaphor’, i.e. a metaphorical realization of epistemic modality, which is congruently expressed by modal auxiliaries and adverbs (Halliday 1985, Halliday and Matthiessen 2004). Being one of the TAM properties, modality is inherent in the finite clause and consequently it is not surprising that I think tends to have clausal scope. On closer examination of the 94 I think tokens that were analysed as having local scope, it turns out that 33 of these, or 35.11%, are ellipses. This means that the scoped element is used on its own, but it stands for a finite clause that can be recovered from the context. The number of ellipses in the you know data is much lower; it amounts to 16.54% of local you know. It would not be entirely indefensible to treat these elliptic uses as clausal ones as one is in fact expected to infer a clause from them. With reference to ik denk / denk ik, the Dutch equivalent of I think, Nuyts (1994: 81) observes that some cases that he was inclined to classify as parenthetical in fact can also be analysed as elliptical complementtaking uses. An example of elliptic I think is given under (26). On the basis of the utterance preceding the ellipsis, a clause like My mum is coming back on Friday I think can be inferred. (26)
B: When’s your Mum coming back A: Uh Friday I think
In spite of its tendency to have clausal scope, it is nevertheless possible for I think to problematise the validity status of one particular piece of information rather than the clause as a whole. In example (27), the speaker expresses uncertainty specifically about the name of the town that was being besieged by the Scots: (27)
The the s the Scots were besieging I think uh uh Berwick and Edward whoever it was at the time came out to relieve it
5.1.2 You know as a local marker of metalinguistic awareness The proclivity of you know for local scope can also be explained by having a closer look at its functional properties. In the literature, one of the recurrent core meanings that are attributed to you know is that of negotiating common ground
144
Julie Van Bogaert
(Östman 1981; Schiffrin 1987). The data suggest that you know is usually not used to create a general sense of common ground, or what Östman called a “youknow mood” (1981). Rather, you know has the potential to locally create common ground. Local you know is used when the common ground status of particular, rather small items in the information structure needs to be established by the speaker and acknowledged by the hearer. It is used to introduce specific pieces of discourse into the realm of common ground. To be even more precise about the nature of this common ground, most local uses of you know negotiate common ground at the metalinguistic level. A lot of the common ground that speaker and hearer make use of in conversation is of a metalinguistic nature; both parties draw on the knowledge that they share about the language that they are using, especially, in the case of you know, about its lexical possibilities and constraints. The metalinguistic component of common ground will be referred to as metalinguistic awareness and local you know will be subsumed under the umbrella term of ‘marker of metalinguistic awareness’7, which is similar to what Verschueren (2000: 445) called “metapragmatic markers […], which draw attention to the lexical choice-making itself, as a kind of warning against unreflective interpretation”. Local you know works at the level of metalinguistic awareness in that it draws attention to the speaker’s process of lexical selection and the hearer’s acceptance of this choice. The pragmatic expression indicates that the speaker is drawing on their metalinguistic awareness to produce a particular phrase or word and at the same time you know constitutes a request to the hearer to also make use of their metalinguistic awareness in order to understand the speaker’s communicative intentions. It is important to note that local you know is not just about the speaker’s selection of the right words. In fact, you know is highly hearer-oriented. Selecting the right expression and arriving at the right meaning is a joint enterprise. By using you know the speaker makes a request for cooperation and benevolence on the part of the hearer. The speaker wants the hearer to accept their choice of wording and appeals to the hearer and to the hearer’s metalinguistic awareness to accept their lexical choice. Hence, the hearer is actively involved in the process of creating meaning. Such a cooperative relationship between speaker and hearer presupposes solidarity. No instances were found in the corpus that could be interpreted as authoritarian. The example provided below illustrates the active involvement of the hearer in the process of lexical selection and creating meaning. It can be seen how the hearer supplies the words that the speaker is having difficulty producing. (28)
A: I’m look I’m quite looking forward to seeing them again They ‘re quite you know Quite n... B: Very nice guys A: Yeah they are definitely
Pragmatic expressions: the positions of you know and I think
145
According to some scholars, what you know essentially does is to invite addressee inferences (Jucker and Smith 1998; Fox Tree and Schrock 2002). With respect to local you know, we can say that this pragmatic expression suggests that the hearer, on the basis of the common ground s/he shares with the speaker, could have inferred the word, phrase, or clause marked by you know by themselves. In order to infer something, people need to use the resources of their background knowledge, i.e. the common ground they share with their interlocutors, which, in this case is background metalinguistic knowledge. The overall category of marker of metalinguistic awareness can be differentiated into four functional subcategories, which will be dealt with one by one below. i) Online planning activities You know as a marker of metalinguistic awareness can be used as a device that helps speakers plan their utterance as they go along. Its use as a repair marker was described by Schourup (1985), Holmes (1986) and Erman (1987). Schourup (1985) pointed out that repairs performed by means of you know suggest that the hearer could have inferred the repair himself/herself. That is why a repair like the one in (29) does not seem to work: (29)
? I got a dog you know cat for my birthday. (Schourup 1985: 122)
According to Schourup, the change from dog to cat is too radical and would have been more felicitously realized by I mean. It will be seen that performing a repair involves both interlocutors’ metalinguistic awareness in the sense that the speaker, who decides that they did not use the right word, asks the hearer to accept the dismissal of it and its substitution with another. The examples listed under (30) and (31) illustrate the use of local you know in repair sequences. (30)
Not not as bad you know not as stiff as the other ones
(31)
I’m just wearing leggings and my big baggy you know my big green Vneck jumper...
Local you know can also be found in repetition sequences, as illustrated in (32) and (33). Not infrequently is a function word, such as a determiner (e.g. (32)), or a preposition or a combination of the two (e.g. (33)), realized before you know and repeated after it, postponing the realization of a content word, most typically a noun or NP. Local you know mostly has scope over a noun or a NP, viz. in 42.11% of all local uses.
146
Julie Van Bogaert
(32)
reading that uh you know that book you gave me on Stephen that Stephen King book as well
(33)
And I said well maybe you know I’ll look in my you know in my diary
Local you know is often surrounded by hesitation markers such as pauses and ‘fumbles’ like uhm and uh. This also points towards online planning activities. (34)
They look as if they’ve all had a quick turn under the steam roller uh and yours have that same quality in that you’ve made up your decisions which are you know stylized
(35)
He used he used to be quite portly you know
The above types of online planning phenomena all involve lexical searching. In prescriptivist and lay discourse they are commonly referred to, rather irreverently, as ‘fillers’ (cf. Fox Tree 2007). ii) Creative language Speakers sometimes use local you know to draw their interlocutor’s attention to the fact that they are using expressive or “creative language” (Aijmer 2002). They signal that they are using figurative language or an unconventional turn of phrase and they request the hearer to accept their metaphor, comparison or imaginative use of language that may require an extra processing effort. The speaker indicates that the hearer will also need to be creative with their background metalinguistic knowledge in order to appreciate the meaning of what the speaker is saying. Examples (36) and (37) illustrate this usage of you know. In (36), language is used metaphorically seeing that the interlocutors are talking about painting. (36)
He’s developed a sort of a you know a language
(37)
It’s like uhm I mean it’s a sort of minor version of uhm you know Paul Vining’s moustache cum beard
We can see in examples (36) and (37) that local you know as a marker of creative language tends to collocate with sort of. According to Willemse et al. (2007), sort of or kind of may also mark creative language. This does not mean, however, that you know and sort of are mutually interchangeable or that the realization of one rather than the other is entirely random. Evidence for motivated use of pragmatic expressions can be found in Fox Tree (2007). In this study involving informant testing, Fox Tree demonstrates that speakers have distinct notions about the meanings of the pragmatic expressions um, uh, you know and like.
Pragmatic expressions: the positions of you know and I think
147
iii) Metalinguistic distancing When a speaker uses you know as a metalinguistic distancing device, they apologize for not having chosen the most appropriate term. They want to make it clear to the hearer that they are not altogether happy with the way they have put things. Quite commonly, speakers distance themselves metalinguistically from a certain word or phrase because they have selected an expression from a different register from the one they consider appropriate. The metalinguistic distancing function of you know shows similarities to that of like as described by Andersen (2001: 243): (…) like can be construed as a signal that the expression the speaker chooses may not be the most appropriate one, and that an alternative expression might communicate her ideas more efficiently (…) Analogously, like can be construed as a signal that the chosen expression does not fit readily into the linguistic repertoire of the speaker, i.e. that the speaker feels a minor discomfort with its use. (…) The potential alternative might be for instance a stylistically different expression (…). In the examples provided below, the reasons for metalinguistic distancing are stylistically motivated. The respective speakers of (38) and (39) do not feel entirely comfortable using the rather slangy or crude expressions pigged off and chucked in. (38)
If I’m sort of you know pigged off with things at school I will pick up Pride and Prejudice (…)
(39)
So I thought God damn it if I ever get close to walking up the aisle and then I get you know chucked in I’ll be I’ll have a nervous breakdown
At this point in the discussion of the functionality of local you know, I consider it appropriate to go into a short digression to point out some commonalities between the functions of this pragmatic expression. The metalinguistic distancing function of you know borders onto what I would like to call quotative you know. This usage of you know has not been discussed yet in this article as it tends to be used with scope over a whole speech act and an in-depth discussion of you know with clausal scope falls outside the scope of this study. Nevertheless, I consider it legitimate to attribute a quotative function to you know. It can aid in demarcating the speaker’s own words from quoted discourse, as in (40), which can be interpreted as an act of distancing oneself at a metalinguistic level from somebody else’s words or thoughts or from his/her own words or thoughts at some previous time.8
148
Julie Van Bogaert
(40)
And I felt like turning around and sort of saying you know well whose fault is that
Interestingly, the marker sort of / kind of has also been found to have a quotative function (Willemse et al. 2007). This would be the second characteristic that you know and sort of / kind of have in common, besides marking creative language. In fact, the quotative function of you know and sort of / kind of bleeds into that of marking expressive or creative language. The quoted sequences are often cases of creative language. In this respect, the expressions resemble the well-known quotative like, which does not so much render somebody’s words verbatim as the overall ‘feel’, i.e. the emotions, attitudes and dramatic effect of what was said. The like quotative may, for example, also frame non-speech sounds and facial expressions (Fairon and Singler 2006: 326). In example (41), quotative you know is used expressively in that the speaker more than likely wishes to convey his feeling of desperation rather than the words that he actually uttered. (41)
and I used to kind of say you know please t please God get me out of this
iv) Expansion The fourth and final usage of local you know as a marker of metalinguistic awareness is the expansive function, to be understood in the Hallidayan sense of the term (Halliday 1985, Halliday and Matthiessen 2004). It entails that you know is used to add extra information that elaborates a preceding concept, mostly as an apposition (42) or a clarification, as in (43). (42)
Uh but I thought that one you know the brie de Meaux ‘s quite good isn’t it
(43)
He was doing it with Golden Grahams You know the breakfast cereal
Admittedly, the expansive category is the least metalinguistic one; it does not so much appeal to the interactants’ knowledge of the language as to their knowledge of the world. Nevertheless, the door to the metalinguistic realm remains open seeing that appositional you know shades off into reformulation and repair, activities that involve metalinguistic awareness. The thin line between apposition and reformulation or repair is illustrated in examples (44) and (45). (44)
He said he wanted to serve the Government uh you know support the Government
Pragmatic expressions: the positions of you know and I think
149
(45)
But she means involved with other people you know prepared to take an interest in
5.2
Local you know: mostly in initial position
We have just seen that you know is used with local scope more often than I think. In this section, a second striking quantitative observation will be related to the functional properties of the two pragmatic expressions. Both you know and I think are mostly used in initial position. The percentages for initial position, regardless of scope, are 71.97% and 85.70% respectively. However, with clausal scope, I think is used in initial position more often than you know but when used with local scope, you know assumes initial position most often, as becomes clear in table 2. Table 2: Initial uses of you know and I think clausal local
you know 65.69 % (448) 82.71% (330)
I think 87.38% (1433) 56.38% (53)
Since local you know as a marker of metalinguistic awareness is used “as a kind of warning against unreflective interpretation” (Verschueren 2000: 445, my italics), it should not come as a surprise that it has a proclivity for preceding the element for which it provides a warning. It gives the speaker more time to think and it signals to the hearer that his/her active involvement in the decoding process is called upon. A speaker’s uncertainty as to the epistemic status of a piece of information, on the other hand, may be added after the informational chunk about which one is not sure, as a kind of ‘afterthought’. I think, when used locally, can be tagged on to a word or phrase to whose truth value the speaker decides that, on second thoughts, s/he does not want to commit. (46) is an example of I think occupying final position in relation to a noun. (46)
The house knows that this matter may be debated on the Queen’s speech specifically tomorrow and again on uh Monday I think
6.
Conclusion
Since the canonical tripartite system for syntactically classifying pragmatic expressions like you know and I think was found inadequate in its neglect of the notion of scope and syntactic levels, an alternative approach to the issue of the syntactic positions of this type of expression was attempted. The results of a classification of spoken ICE-GB data following this system were found to throw
150
Julie Van Bogaert
more light on the functional properties of you know and I think and, conversely, insights into the pragmatic expressions’ functions helped explain their syntactic patterns. The discussion of local you know, in particular, suggests that some views commonly held about this pragmatic expression, in the first place among laypeople but also, to some extent, among linguists, need to be revised. The insights provided in this article constitute an argument against the notion of ‘random sprinkling’, according to which, as explained in Fox Tree and Schrock (2002), you know can be scattered through the discourse at random. It would mean that is does not matter exactly where the expression occurs, as long as it creates a casual atmosphere (cf. Östman’s notion of a “you-know mood” (1981)). It would seem, instead, that you know is used at strategic, critical points in the discourse which require heightened metalinguistic awareness and where the common ground requires local remedying. To this should be added that no cases of you know in intrusive medial position were found in the data and that even regular medial you know was quite rare (4 occurrences). These observations further strengthen the claim that when you know is inserted somewhere, it is there for a reason. You know is sometimes referred to as an imprecision marker or what James calls a “compromiser” (1983). This view requires some subtle qualification. On the basis of the above discussion, we can say that you know is not used by a speaker who is deliberately being imprecise. Rather, it is used by a speaker who has great concern with being understood by the hearer; the speaker tries to facilitate efficient communication by substituting one word with another word that they consider more effective, by warning the hearer that they are not using the most conventional of expressions or by adding extra information to be absolutely sure that the hearer grasps his/her communicative intentions. You know can be considered an imprecision marker only to the extent that it is used by a speaker who aims to communicate adequately and clearly but who apologizes for perhaps not always reaching their goal. Given the interesting relationship between syntax and pragmatic, interpersonal or discursive functions that came to expression in this study, the possibilities of the alternative syntactic classificatory system will be further explored in future work on related pragmatic expressions composed of cognitive verbs, e.g. I suppose, I guess and I believe. Notes 1 I would like to thank my supervisor, Anne-Marie Simon-Vandenbergen, and co-supervisor, Miriam Taverniers, for their constructive comments on this study. Needless to say, the responsibility for any errors is entirely mine. 2
‘Scope’ needs to be understood as a largely semantic notion rather than a strictly syntactic one. It is to be understood as “the stretch of language affected by the meaning of a particular form” (Crystal 1985: 308) or the way McGregor defined it (1997: 209ff). He distinguished between three types of syntagmatic relationship: constituency, dependency and
Pragmatic expressions: the positions of you know and I think
151
conjugation. The third category characterizes, defines and is defined by the interpersonal semiotic. In a conjugational relationship, one unit ‘shapes’ the other, indicating how it is to be taken by the addressee. Within each of the three syntagmatic relationships, two dimensions are possible: scoping and framing. Scoping means that a unit applies over a certain domain, leaving its mark on the entirety of this domain. 3
Admittedly, the status of sentential relative clauses as adverbial clauses is debatable. It shares characteristics with both content disjuncts and with non-restrictive relative clauses (Quirk et al. 1985: 1120).
4
A search restricted to pages from the UK was conducted on www.google.co.uk.
5
In Van Bogaert (2006) it is demonstrated that I believe can also be used with either a tentative or a deliberative meaning and like I think, deliberative I believe is typical of political language.
6
The term ‘reactance’ is used as in Whorf’s (1945) terminology to mean that certain syntactic differences are not noticeable in the surface forms of utterances but only come to expression in different ‘reactions’ to particular syntactic operations.
7
Both ‘common ground’ and ‘metalinguistic awareness’ need to be understood not so much as pre-existing constructs but as dynamic concepts that come into being as the discourse unfolds. That is why the term ‘marker’ to refer to you know in these contexts is somewhat infelicitous as marking something presupposes that what is being marked is already there.
8
When used as a quotative, you know is usually accompanied by additional expressions with a quotative function, such as she said in the example given.
References Aijmer, K. (this volume), ‘Does English have modal particles?’. Aijmer, K. (2002), English discourse particles. Amsterdam: Benjamins. Aijmer, K. (1997), ‘I Think - an English modal particle’, in: T. Swan and O.J. Westwik (eds.) Modality in germanic languages: Historical and comparative perspectives. Berlin & New York: Mouton de Gruyter. 1-47. Andersen, G. (2001), Pragmatic markers and sociolinguistic variation. Amsterdam: Benjamins. Bernstein, B. (1971), Class, codes and control 1: Theoretical studies towards a sociology of language. London: Routledge and Kegan Paul. Biber, D. et al. (1999), Longman grammar of spoken and written English. London: Longman.
152
Julie Van Bogaert
Blanche-Benveniste, C. and D. Willems. (2007), ‘Un nouveau regard sur les verbes à rection faible’, Bulletin de la Société de Linguistique de Paris, 102.1: 217-254. Bolinger, D. (1961), ‘Syntactic blends and other matters’, Language, 37: 366381. Crystal, D. (1985), A dictionary of linguistics and phonetics. 2nd ed. Oxford: Blackwell. Dendale, P. and J. Van Bogaert (2007), ‘A semantic description of French lexical evidential markers and the classification of evidentials’, in: M. Squartini (ed.) Evidentiality between Lexicon and Grammar. Thematic issue of Italian Journal of Linguistics / Rivista Di Linguistica, 19.1. Erman, B. (1987), Pragmatic expressions in English: A study of you know, you see and I mean in face-to-face conversation. Acta Universitatis Stockholmiensis/Stockholm Studies in English. Stockholm: Almqvist & Wiksell International. Erman, B. (2001), ‘Pragmatic markers revisited with a focus on you know in adult and adolescent talk.’ Journal of Pragmatics, 33: 1337-59. Fairon, C. and J.V. Singler. (2006), ‘I’m like “Hey, it works!”: Using Glossanet to find attestations of the quotative (be) like in English-language newspapers’, in A. Renouf and A. Kehoe (eds.) The Changing Face of Corpus Linguistics. Amsterdam: Rodopi. Fox Tree, J.E. (2007), ‘Folk notions of um and uh, you know and like’, Text and Talk, 27.3: 297-314. Fox Tree, J.E. and J.C. Schrock (2002), ‘Basic meanings of you know and I mean’, Journal of Pragmatics, 34: 727-47. Halliday, M.A.K. (1985), An introduction to functional grammar. London: Arnold. Halliday, M.A.K., and C.M.I.M. Matthiessen (2004), An Introduction to Functional Grammar. London: Arnold. He, A.W. and B. Lindsey (1998), ‘“You know” as an information status enhancing device: Arguments from grammar and interaction’, Functions of Language, 5.2: 133-55. Holmes, J. (1986), ‘Functions of you know in women’s and men’s speech’, Language in Society, 15: 1-22. Holmes, J. (1990), ‘Hedges and boosters in women’s and men’s speech’, Language and Communication, 10.3: 185-205. Hooper, J.B. (1975), ‘On Assertive Predicates’, in: J.P. Kimball (ed.) Syntax and semantics. Vol. 4. New York: Academic Press. 91-124. Huddleston, R., and G.K. Pullum. (2002), The Cambridge Grammar of the English language. Cambridge: Cambridge University Press. Huspek, M. (1989), ‘Linguistic variability and power: An analysis of you know/I Think variation in working-class speech’, Journal of Pragmatics, 13: 66183. James, A.R. (1983), ‘Compromisers in English: A cross-disciplinary approach to their interpersonal significance’, Journal of Pragmatics, 7: 191-206.
Pragmatic expressions: the positions of you know and I think
153
Jespersen, O. (1937), Analytic syntax. Copenhagen: Levin & Munksgaard. Jucker, A.H. and S.W. Smith (1998), ‘And people just you know like ‘wow’: Discourse markers as negotiating strategies’, in: A.H. Jucker and Y. Ziv (eds.) Discourse Markers: Description and Theory. Pragmatics and Beyond New Series. Amsterdam: Benjamins. 171-201. Kärkkäinen, E. (2003), Epistemic stance in English conversation: A description of its interactional functions, with a focus on I think. Amsterdam: Benjamins. Lakoff, G. (1974), ‘Syntactic amalgams’, in: M. La Galy, R.A. Fox and A. Bruck (eds.) Papers from the tenth regional meeting of the Chicago linguistic society. Chicago IL: CLS. 321-44. McGregor, W. (1997), Semiotic Grammar. Oxford: Clarendon Press. Nuyts, J. (1994), Epistemic Modal Qualifications: On Their Linguistic and Conceptual Structure. Antwerp Papers in Linguistics 81. Antwerp: UIA. Östman, J.-O. (1981), You know: A discourse functional approach. Pragmatics and Beyond. Amsterdam: Benjamins. Palander-Collin, M. (1999), Grammaticalization and social embedding: I think and methinks in Middle and Early Modern English. Helsinki: Tome LV. Peltola, N. (1983), ‘Comment clauses in present-day English’, in: I. Kajanto (ed.) Studies in classical and modern philology. Helsinki: Suomalainen Tiedeakatemia. 101-13. Quirk, R., S. Greembaum, G. Leech, J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Ross, J.R. (1973), ‘Slifting’, in: M. Gross, M. Halle and M. Schützenberger (eds.) The formal analysis of natural language. The Hague: Mouton. Scheibman, J. (2002), Point of view and grammar: Structural patterns of subjectivity in American English conversation. Amsterdam: Benjamins. Schiffrin, D. (1987), Discourse markers. Cambridge: Cambridge University Press. Schneider, S. (2007), Reduced parenthetical clauses as mitigators: A corpus study of spoken French, Italian and Spanish. Amsterdam: Benjamins. Schourup, L.C. (1985), Common discourse particles in English conversation. New York: Garland. Simon-Vandenbergen, A.-M. (1998), ‘I think and its Dutch equivalents in parliamentary debates’, in: S. Johansson and S. Oksefjell (eds.) Corpora and crosslinguistic research: Theory, method and case studies. Amsterdam: Rodopi. 297-317. Simon-Vandenbergen, A.-M. (2000), ‘The functions of I think in political discourse’, Journal of Applied Linguistics, 10.1: 41-63. Simon-Vandenbergen, A.-M. (2002), ‘I think - a Marker of Middle Class Discourse?’ in E. Kärkkäinen and T. Lauttamus (eds.) Studia Linguistica et Litteraria Septentrionalia. Studies presented to Heikki Nyyssönen. Oulu: Oulu University Press. 93-106.
154
Julie Van Bogaert
Stenström, A.-B. (1995), ‘Some remarks on comment clauses’, in: B. Aarts and C.F. Meyer (eds.) The verb in contemporary English. Cambridge: Cambridge University Press. 290-301. Stubbe, M. and J. Holmes (1995), ‘You know, eh and other ‘exasperating expressions’: An analysis of social and stylistic variation in the use of pragmatic devices in a sample of New Zealand English’, Language and Communication, 15.1: 63-88. Tagliamonte, S. and J. Smith (2005), ‘No Momentary Fancy! The zero ‘complementizer’ in English dialects’, English Language and Linguistics, 9.2: 289-309. Thompson, S.A. (2002), ‘“Object complements” and conversation: Towards a realistic account’, Studies in Language, 26.1: 125-64. Thompson, S.A. and A. Mulac (1991a), ‘A quantitative perspective on the grammaticalization of epistemic parentheticals in English’, in E.C. Traugott and B. Heine (eds.) Approaches to Grammaticalization. Vol. 2. Amsterdam: Benjamins. 313-39. Thompson, S.A. and A. Mulac (1991b), ‘The discourse conditions for the use of the complementizer that in conversational English’, Journal of Pragmatics, 15: 237-51. Urmson, J.O. (1952), ‘Parenthetical verbs’, Mind, 61: 480-96. Van Bogaert, J. (2006), ‘I guess, I suppose and I believe as pragmatic markers: Grammaticalization and functions’, BELL New Series, 4: 129-49. Verschueren, J. (2000), ‘Notes on the role of metapragmatic awareness in language’, Pragmatics, 10.4: 439-56. Watts, R.J. (1989), ‘Taking the pitcher to the well: Native speakers’ perceptions of their use of discourse markers in conversation’, Journal of Pragmatics, 13: 203-37. Whorf, B. (1945), ‘Grammatical Categories’, Language, 21: 1-11. Willemse, P., K. Davidse, and L. Brems (2007) ‘Synchronic layering of type nouns in English and French’, Paper presented at the 28th ICAME conference. Stratford-upon-Avon, 23rd-27th May 2007.
The functions of expletive interjections in spoken English Magnus Ljung University of Stockholm Abstract This paper is a study of the functions of ten common expletive interjections in a 1 millionword sub-corpus from the spoken component of the BNC. The findings indicate that about a hundred interjections function as release mechanisms for mostly negative feelings triggered by real-world experiences. The rest are shown to be pragmatic markers as these have been defined in the recent literature and are analysed mainly in terms of the discourse-based analytic model used in Stenström (1994).
1.
Introduction
The aim of this study is to explore the use of expletive interjections in modern spoken British English as represented in the spoken component of the BNC. The study focuses on ten common expletive interjections in a sub-corpus made up of 26 conversation texts from the spoken component of the BNC, viz. KB0 – KB9, KBA-KBN, KBP, KBR, KNT. The sub-corpus contains 1,000,015 words and has for obvious reasons been named Conv1M. The ten expletive interjections that are the focus of my study are bugger, Christ, cor, damn, fuck, god, gosh, hell, Jesus, and shit. The study considers only interjectional uses of these words and consequently ignores their use as nouns, adjectives, adverbs and as “filler material” in expressions like What the fuck, Who the hell, etc. On the other hand I have included in my study both single interjections like Bugger!, Cor!, Hell! etc. and - with one exception - collocations containing these words which are functionally interchangeable with the single interjections. The exception is God, which occurs in such a plethora of collocations as to make their inclusion impractical. Here I include only the most common collocation with God, viz. Oh God! Table 1 provides a full account of my data. Table 1: Expletive interjections included in the study BUGGER
14
CHRIST
35
COR
82
Bugger! 1, Oh bugger! 1, Bugger it! 6, Bugger me! 2, Bugger + NP/pron 4. Christ! 11, Ah Christ! 1, By Christ, 2, By bloody Christ! 1, Cor Jesus Christ! 1, For Christ’s sake(s)! 7, Jesus Christ! 3, Jesus bloody Christ! Oh Christ! 8, Oh Jesus Christ! 1 Cor! 74, Cor blimey! 2 , Cor bloody hell! 1, Cor Jesus Christ 1, Cor strewth! 4
156
Magnus Ljung
DAMN FUCK OH GOD GOSH HELL
JESUS
SHIT Total
4 14 179 30 121
16
18 513
Damn! 2, Damn it! 2 Fuck! 2, Fuck 4, Fuck it! 3, Fuck me! 3, Oh fuck! 1, Oh fuck me! 1 Oh God! 179 Gosh! 19, Ah gosh! 1, Oh gosh! 9, Oh my gosh! 1 Hell! 1, Bleeding hell! 1, Bloody hell! 70, By hell! 2, Fucking hell! 12, Oh hell! 3, Sodding hell! 1, God flipping hell! 1, Flipping hell! 1, Oh fucking hell! 2, Oh bloody hell! 26, Oh, oh bloody hell! 1 Jesus! 7, Ah Jesus! 1, Ah Jesus Christ! 1, Jesus wept! 1, Jesus bloody Christ! 1, Jesus Christ! 1, Oh Jesus 2, Oh Jesus Christ! 1, Cor Jesus Christ! 1 Shit! 11, Oh shit! 7
As Table 1 shows, the total number of expletive interjections in my study is 513, which means that the speakers in the study produce one of the selected expletive interjections per 2000 words. However, the total number of expletive interjections in the Conv1M corpus is much higher than that. Merely including all existing interjectional combinations with the word God would have added another 322 instances. If we also add the motley crew of euphemistic interjections alluding to God, the sum total would rise to about 900 and the production rate for these particular interjections would be very close to one per 1000 words. 2.
Subjectivity, interactivity, textuality
The fact that the present paper is about expletive interjections in a way makes it a study of swearing, a large and somewhat ill-defined area of language that has lately attracted the attention of a number of linguists. The last few years, for instance, have seen a number of studies on swearing in British English, for example McEnery (2005), McEnery and Xiao (2003), (2004). These studies offer valuable information about the typology and sociolinguistics of English swearing and provide fascinating historical accounts of British attitudes towards swearing over the years. As I have already mentioned, the aim of the present study is different. The question I want to address here is why people swear, more specifically what functions expletive interjections serve in spoken English. It may seem that there is an obvious answer to my question: a generally held view of interjections and in particular of expletive interjections is that they are used in outbursts of mostly negative speaker feelings like anger and irritation. In his influential 1997 Cambridge Encyclopedia of Language David Crystal sums up this view in the following manner: “The functions of swearing are complex. Most obviously, it is an outlet for frustration and pent-up emotion and a means of releasing nervous energy after a sudden shock” (Crystal 1997: 61).
The functions of expletive interjections in spoken English
157
When used in this way the expletive interjections reflect or are usually thought to reflect the speaker’s inner states and feelings. For this reason they have been referred to as pure interjections, a category which may be thought of as closely corresponding to that of response cries first suggested by Goffman (1978). The pure interjections - PI’s for short - express the speaker’s reaction to a range of stimuli that is in principle impossible to delimit. The stimuli are often thought of as being of a physical and easily observable nature, thus making it possible for those in the speaker’s presence to make deductions about his/her reasons for uttering them. What I wish to argue in the present paper is that while certain of the expletive interjections in my corpus may be interpreted as clear instances of PI’s, the majority of the expletive interjections are used to express speaker attitudes, to signal the orientation of a text, and to deliver different interactional signals. In short, my claim is that in many of their uses the expletive interjections should be regarded as belonging to a linguistic category that has variously been called pragmatic markers, pragmatic particles and discourse markers. The notion that expletive interjections – and, indeed, interjections in general - may be used for pragmatic purposes and should be included among the pragmatic markers is not uncontroversial. Many of the scholars involved in the study of pragmatic markers do not mention interjections at all. Others expressly deny that interjections should be admitted to that category, for instance Andersen (2001: 42) Yet a third group take a more kindly view of interjections, for example Aijmer who claims that discourse particles include elements as varied as conjunctions (however), main clauses (I think), sentence adverbials (frankly), imperatives (look) and interjections (oh) (Aijmer 2002: 18). What then are the criteria for membership in the pragmatic marker category? According to Brinton (1996: 33), pragmatic markers have at least the following characteristics: Pragmatic particles (1) constitute a heterogeneous set of forms which are difficult to place within a traditional word class (including items like ah, actually … I mean, I think, you know), (2) are predominantly a feature of spoken rather than written language (3) are high-frequency items, (4) are stylistically stigmatized and negatively evaluated, (5) have little or no propositional meaning or are at least difficult to specify lexically, (6) occur either outside the syntactic structure or are attached to it and have no clear grammatical function, (7) are optional rather than obligatory features, (8) may be multifunctional operating on different levels (including textual and interpersonal levels). In my opinion, the expletive interjections satisfy all of these principles. They are definitely a heterogeneous group whose word class membership is often impossible to establish. It is true, however, that they have one factor in common, viz. the fact that, by a process of grammaticalization, they have developed from words denoting matters that are, or once were, taboo.
158
Magnus Ljung
As for the other criteria the expletive interjections are certainly a feature of the spoken language and have high frequencies of occurrence; they are definitely stigmatized and negatively evaluated, they have little or no propositional meaning, they are optional rather than obligatory, and they tend to be multifunctional on different levels. Later scholars - like Erman (1998) and (2001), Andersen (2001) and Aijmer (2002) - who prefers the term “discourse particle” to “pragmatic particle” - have elaborated on Brinton’s principles for pragmatic particles, and distinguish three broad types of pragmatic function, viz. subjectivity, interactivity and textuality. Individual pragmatic particles are typically associated with one of these functions but are usually also connected with the others: pragmatic particles are polyfunctional. Like Brinton’s earlier criteria, those of subjectivity, interactivity and textuality cause no problems for the expletive interjections. Let us take a look at the first one, subjectivity. What this term usually refers to in pragmatic texts is a number of speaker-related functions, in particular those conveying the speaker’s attitude to (the proposition underlying) the following utterance and those expressing the speaker’s epistemic stance towards that proposition. Example (1) is a straightforward example of a speaker using an expletive interjection to express his attitude to what he is saying: (1)
bloody hell look at that old codger behind the wheel (KB7 11226)
Here it would seem that although bloody hell in (1) is probably polyfunctional like most other pragmatic markers, its main function is to express the speaker’s surprise at the age of the driver. The wider context also confirms that this is the intended effect. But it is not always as easy as this to determine just what it is the expletive interjection is meant to express. In (2) for example a case can be made both for an attitudinal and an epistemic stance interpretation, something that reminds us of what we just said about the polyfunctionality of the pragmatic markers: (2)
Cor that was a proper macho man! (KBL 2438)
The mild interjection cor is often used to convey an attitude of surprise, both on its own and with regard to a following proposition. That may well be what it is doing in (2). However – like other clause initial interjections - cor also places a certain amount of emphasis on the following utterance. Emphasis may be interpreted in many different ways, but a likely interpretation in (2) is that the added emphasis is a way to insist on the veracity of the utterance: what the speaker is saying is that a proper macho man is a true description of the man in question. This leads to the conclusion that (2) therefore expresses both attitude and epistemic stance. Emphasis may also be used to strengthen the speech act force of certain utterances, in particular promises and predictions as in (3) and (4):
The functions of expletive interjections in spoken English (3)
Bugger it I’m gonna pay this off! (KB2 1980)
(4)
I’m not playing this, bugger it! (KB7 6582)
159
Epistemic stance expressed by means of expletive interjections may also be negative and be expressed by means of a post-posed expletive interjection as in (5). (5)
A: Have you done it? B: Well, I’ve done some of it C: Have you fuck! (KBM 995)
The clearest examples of the second main pragmatic marker function – interactivity – are the feedback signals known as backchannels given by listeners to speakers to show that they are listening. At their most colourless, backchannels are mere acknowledgements like Mm, Mhm. However, as e.g. Stenström (1995: 82) demonstrates, backchannels do not have to be colourless but vary along a “feedback gradient” reflecting the listener’s degree of involvement in what the speaker is saying. As an example of such a gradient Stenström offers the series Mm – I see – Oh – Gosh – Really – My goodness – Hell, a series ranging from listener indifference to strong listener involvement. Examples (6) and (7) are examples from my corpus of backchannels expressing, respectively, mild and strong degrees of involvement on the listener’s part: (6)
A: They’ve got 14 lawns here.. B: Gosh! (KBK 6153)
(7)
A: She must be 37. B: Bloody hell! (KB1 3983)
The third main function usually attributed to the pragmatic markers is textuality. According to Andersen, textuality or the textual function “describes what the speaker perceives as the relation between sequentially arranged units of discourse” (Andersen 2001: 66), for example the use of Now as an indicator of the transition from one topic to another (cf. also Aijmer 2002: 6). A not unusual type of textual pragmatic meaning expressed by means of expletive interjections in my data is the use of preposed expletive interjections to indicate that what follows somehow exemplifies a previous claim (cf. Aijmer’s point that certain “pragmatic markers are used to mark an elaboration or clarification of the topic” (Aijmer 2002: 86)). This seems to be what is going on in example (8). (8)
A: Ange was saying she’s .. .she gets a bit funny, don’t she? B: Cor bloody hell she give I [sic] <pause> three questions the other day (KB6 2186)
160
Magnus Ljung
Apparently B regards the asking of the three questions as evidence that A is right in what she is saying about someone being “funny” and uses the expletive interjection Cor bloody hell to point this out. 3.
The pure interjections
In the preceding section I have tried to show that in many of their uses, expletive interjections meet the same functional demands as the bona fide pragmatic markers and should therefore be admitted to the same category. I have also argued that the popular view of expletive interjection usage – that they are mostly psychological outlets for pent-up feelings of irritation and the like – accounts only for a certain type of interjections that I have called pure interjections. When pure interjections are used in real life, they may be more or less difficult to interpret. If they are triggered by some observable mishap like the accidental cutting of a finger or the breaking of a window, bystanders usually find it easy to construct an explanation for the uttering of the pure interjection by linking it to the mishap. But when the use of a pure interjection is caused by nonobservable factors like the speaker’s own thoughts or feelings, or are triggered by physical mishaps not observable to others, they are much more difficult to interpret. The same kind of difficulties often arise when we try to interpret pure interjections in a corpus of spoken English, be it on tape or in transcription; since we cannot observe the factors that trigger them, we are reduced to more or less ingenious guesses about what is going on. However, it is only fair to note that we sometimes do get information about speech situations in the transcripts of the spoken component of the BNC. On such occasions, the text actually identifies the event that triggered the expletive interjection. Examples (9) and (10) are cases in point. (9)
Oh my god (KB6 347)
(10)
A: Again, this is it’s the same as this. Shit! (KBD 7410)
It is obvious that Oh my god! in (9) is a reaction to the child falling over. In (10) we can, I think, make a reasonably good case for interpreting the situation as one in which A is attempting to find out what the wall is made of, perhaps hoping for something solid. On finding that s/he is mistaken, s/he reacts by using the pure interjection Shit!. (Obviously, the very opposite may be the case – it may be the sameness that causes the speaker’s irritation). However, in most cases such direct explanations are missing, and we have to form an idea of the situation in which the utterance is made by studying its immediate context. (11) is a fairly typical example of such a guessing-game:
The functions of expletive interjections in spoken English (11)
161
A: I’ll have the yellow ones. B: The yellow ones? A: Just <pause> oh bloody hell B: The yellow ones were thrown A: What do I do? (KB7 4705-4709)
Here clearly (A) had planned to use something s/he had counted on finding in a cupboard and when s/he discovers that what s/he is looking for is no longer there, s/he gives vent to her/his disappointment by exclaiming Oh bloody hell! We never do find out what s/he was looking from either from the preceding or the following text, but we can, I think, confidently characterize oh bloody hell in this context as a pure interjection expressing A’s disappointment on not finding whatever it is s/he is looking for.. By engaging in detective work of this nature I eventually managed to identify a number of what I regard as convincing instances of pure interjections: of the 513 expletive interjections in my data, I reckoned that about one fifth belong to the pure interjections category and eventually put the total number of PI’s at 92. That leaves us with 421 expletive interjections which are not PI’s and should accordingly be amenable to pragmatic analysis. 4.
A discourse-based analysis of the expletive interjections
In section 2, I discussed the well-established pragmatic notions of subjectivity, interactivity and textuality and gave a few examples of how these three notions might be used to provide pragmatic analyses of those expletive interjections that are not pure interjections. My approach was to study individual instances of interjection usage and try to assign plausible meanings to them. Necessary as it is, this approach needs to be combined with one that attempts to provide an account of the interplay between the meanings of the pragmatic particles and the different surroundings - syntactic and discourserelated - in which they occur. An obvious candidate for the job would be an analytic model operating in terms of turn-taking like the discourse-related approach developed in the mid-1990’s by Anna-Brita Stenström (cf. Stenström 1991 and 1994) and which goes back to earlier ground-breaking work on discourse by John Sinclair and Malcolm Coulthard as presented in SinclairCoulthard (1975). In the remainder of this paper I will show what an analysis in terms of Stenström’s model – in broad outline – would look like and how it can be used to account for the functions of at least certain of the expletive interjections in the corpus. In Stenström’s model, communication operates in terms of turns, moves and acts. A turn is everything a speaker says before the next speaker takes over. Turns are realised by moves. A simple turn contains a single move, while a complex turn contains several moves. Moves are realized by acts, of which there is a bewildering array.
162
Magnus Ljung
One of the key features in Stenström’s model is the distinction between gap fillers and slot fillers (Stenström 1994: 61-62). Gap fillers are turns of their own. Slot fillers, on the other hand, are merely part of a turn. Stenström (1994: 61) illustrates this distinction by contrasting examples (12) and (13). In (12), the exclamation Right! is a gap filler, making up the entire second turn in the exchange, while in (13) it is a slot filler placed before another slot filler, viz. the clause let’s look at the applications. (12)
A: It’s under H for Harry B: Right.
(13)
A: Well I went about a quarter to B: Right, let’s look at the applications
Gap fillers typically function as responses to a previous utterance and characteristically serve as second turns in two-turn exchanges as in example (12) or as third turns in interrogative exchanges like (14) (14)
A: Whose father died then? B: Celestian’s. A: Oh Christ! (KBH 1166-68)
Note how the responses in (12) and (14) differ in their degree of speaker involvement: Right in (12) is merely a backchannel informing A that B is listening, while the function of Oh Christ! in (14) is to express B’s reaction to the information s/he has received and possibly also to offer B’s sympathy to those affected by the father’s death. My examples of gap fillers so far have all been responses, and it is true that in the case of expletive interjections, there is a strong link between the two. However, “gap filler” is the overall term for any utterance making up a simple turn on its own. Thus the first turns in (12), (13) and (14) are all gap fillers serving as conversation initiators. Slot fillers display a variety of functions. A very common one among the expletive interjections is to express subjectivity - attitude, epistemic stance - with regard to a following (less often a preceding) utterance in the same turn: in fact my early examples (1) – (5) were all demonstrations of such slot filler functions Like Stenström, I distinguish between several types of slot-fillers depending on where in the turn they occur, but unlike hers, my classification operates in syntactic rather than turn-based terms. I make a distinction between five types of slot filler positions: (1) immediately before a clause, (2) in the middle of a clause, (3) immediately after a clause, (4) immediately before a word or phrase and (5) immediately after a word or phrase. The above description of the five different types of slot fillers concludes my account of the different uses of expletive interjections that I have encountered in my corpus. Together with the gap fillers and the pure interjections that I have
The functions of expletive interjections in spoken English
163
already discussed, they make up a total of seven different functions for expletive interjections. Table 2 shows how the 513 expletive interjections in my corpus are distributed across these seven functions. Table 2: Distribution of expletive interjections in Conv1M Slot filler before a clause Gap fillers Pure interjections Slot filler after a clause Slot fillers before a word/phrase Slot fillers after a word/phrase Slot filler inside a clause Total
226 116 92 40 30 7 2 513
The statistics in Table 2 show that there are great differences among the different uses of the expletive interjections in spoken English. There are three major uses: as slot fillers immediately before a clause, as gap fillers and as pure interjections. The first of these is the by far most important type, with almost twice as many members as its closest competitor the gap fillers. With its 92 members, the pure interjections are obviously also a major category. Much further down the list, with 40 and 30 members respectively, we find another two expletive slot-filler positions: immediately after a clause and immediately before a word or a phrase. At the bottom there are the two very small categories: slot fillers following a word or phrase with only seven members and finally the use of expletives as slot fillers in the middle of a clause of which there are only two instances. Below I will comment on all of them in turn, beginning with the slot fillers. We have already seen examples of slot fillers immediately before a clause, viz. (1) and (2), repeated below for convenience: (1)
bloody hell look at that old codger behind the wheel (KB7 11226)
(2)
Cor that was a proper macho man! (KBL 2438)
Other examples of the same type are (15) – (17): (15)
Christ that’s gonna be a thousand pounds. ( KB1 1181)
(16)
Oh hell well he won’t have to bother, bother about a suit will he? (KB2 1870)
(17)
Just had a shower, cor feel a bit cold now (KB7 2870)
In my previous interpretation of (1) and (2) I claimed that it seemed reasonable to regard both as expressions of speaker subjectivity with regard to the content of the following clause or rather with regard to the content of the proposition underlying that clause. These are indeed plausible interpretations which can also
164
Magnus Ljung
be given to (15), (16) and (17). On such a reading we would then claim that in (15) Christ expresses the speaker’s surprise and perhaps even irritation over the cost of something, that in (16) Oh hell adds extra emphasis to the claim he won’t have to bother, and that in (17) cor is used to express the speaker’s mild concern at feeling cold. What makes such interpretations somewhat problematic is the fact that we know nothing about the phonology of (1) and (2) and (15) – (17) for the simple reason that – unlike the London-Lund corpus - the spoken component of the BNC is not phonologically annotated. As a result we don’t know where the tone unit boundaries go in these examples and nor do we know anything about the intonation. (Apparently the punctuation and spelling in the transcript cannot be relied upon to reflect phonological detail). What are the consequences of the lack of phonological annotation? Well, if a phonological analysis were to reveal that there is no tone unit boundary separating bloody hell, cor, Christ and oh hell from the following clause in (1), (2) and (15) – (17), these interjections are not independent units representing moves of their own, but are part of the same move as the following clause or NP. If on the other hand they were followed by a tone unit boundary, they would constitute independent moves of their own expressing the speaker’s strong surprise and/or irritation. What difference does it make? If a collocation like bloody hell in (1) is not a tone unit of its own, does that mean that it no longer qualifies as a pragmatic marker? At least according to one pragmatics scholar, the answer seems to be that it does not. In her discussion of Oh! in Aijmer (2002: 108), the author argues that when followed by a tone unit marker, oh is a pragmatic marker carrying a strong meaning of surprise. When oh is not followed immediately by a tone unit boundary it loses much of its “surprise” meaning and is reduced to the role of intensifier of the following item(s) but is still regarded as a pragmatic marker. Let us turn now to the gap fillers. With its 116 instances this is the second largest discourse function in my data. Let us consider three new examples of this function, viz. (18), (19) and (20). (18)
A: I’ve got thirty in tens. B: Ah Jesus!
(19)
A: I’m driving, there’s this big bang, and the whole bonnet lit up. B: Oh God!
(20)
A: Double tennis court? B: Mhm. A: Gosh!
In all three examples above the expletive interjections have clearly interactive functions. As could be expected, they serve as acknowledgements of the information given in a previous turn, but at the same time they also express a reaction to that information. Take for instance the exchange between A and B in (18). A study of the conversation leading up to the exchange in (18) reveals that B
The functions of expletive interjections in spoken English
165
has asked A to give (lend?) her/him forty pounds and when it turns out that A has only thirty pounds, B utters Ah Jesus! as an expression of disappointment. (19) and (20), on the other hand, do not express much interest from the speakers. It is interesting to compare my findings concerning gap fillers with Stenström’s results in her 1991 study of the expletives in the LLC, viz. the London-Lund Corpus of Spoken English. Investigating what she called at the time “expletives as separate turns” i.e. gap fillers, she found that 58% were used as responses in second or third turns, and that the remaining 42% were used as “go/on signals”, viz. as feedback signals interrupting the speech of another speaker. While there are instances of such go/on signals in my corpus like for example (21), (21)
A: If you take B: Cor! A: the top off (KBP 1660)
such interruptive constructions are rare in my data. There is no obvious reason for this difference, but it may have to do both with the time at which the two corpora were created and with the kind of speakers involved. The LLC contains speech dating back to the 1960s and the 1970s, while the 26 texts in my data were all recorded in 1991 and 1992. It is possible, though perhaps not very likely, that the use of go-on signals has diminished in the time interval separating the two corpora. A more plausible explanation may be the difference between the speakers in the LLC and the BNC. The aim of the former was to represent educated adult British English and in fact most of the speakers are academics. The aim of the BNC was not to record only educated adult British English but to represent the entire gamut from “educated English” to uneducated and from teenagers to 70year-olds. As a result, the speakers recorded in the BNC are not at all as homogeneous as those in the LLC but differ from them both with regard to age and to social class. It seems to me that both the age difference and the social difference between the speakers enrolled in the two corpora may have had an effect on the use of go-on signals. The third largest group of expletives in Table 2 is the pure interjections. I have already pointed to the difficulties involved in finding plausible triggering factors for interjections in corpus data. However, occasionally another, theoretically more interesting interjections-linked difficulty turns up. What I am referring to are cases in which what seems to have started out as a genuine pure interjection is overheard by others and intrigues them to such an extent that they ask the speaker what is the matter. By doing that they change the nature of the original pure interjection, which has now become the first turn in an exchange. Example (22) shows how this may happen: (22)
A: Oh damn it. [Turn 1] (KBA 46) B: What? [Turn 2] A: This one doesn’t seem to want to come out [Turn 3]
166
Magnus Ljung
In an analytic model based on turn taking, this is a non-problem: an initial utterance that is linked to another utterance in the way Oh damn it! and What? are linked in (22), is by definition the first turn in an exchange involving at least two - in the case of (22) three - turns. But if we forget for a moment the exigencies of a strict turn taking system, we realize that the key question here is what A’s intentions were. Did s/he intend her/his utterance Oh Damn it! to be taken as the first move in an exchange, or did s/he just let it slip out without any communicative plans? Cases like these raise other - larger - questions, like the nature of self-talk and whether it makes sense to talk about pure interjections as communicative, both of them issues raised in an interesting paper by Erving Goffman from 1978. Next in the list in Table 2 we find two smaller categories. The first is made up of slot fillers appearing immediately after a clause as in our old example (4), repeated below and in the new (23). (4)
I’m not playing this, bugger it! (KB7 6582)
(23)
Stop dribbling, for Christ’s sake! (KBL 33702)
Like many other examples in BNC, (23) was uttered while the speaker was watching football on TV. The phrase for Christ’s sake alternates in this position with for God’s sake, for Pete’s sake and occasionally for fuck’s sake. All the pragmatic sake constructions seem to have developed a highly specialised function: they are used by speakers to emphasize the situational relevance of her/his own utterances (or of elements of these utterances). Stenström (1994) uses the term booster for this function, defining it as “the speaker’s assessment of what s/he says”. The slot filler position in the middle of a clause is used extremely seldom. One of the few examples I have found is (24), which admittedly could also be interpreted as a pure interjection: (24)
Right, Ann <pause> what wine <end of voice quality> <pause> oh God! <pause> is made in <pause> oh, Department of the Marne <end of voice quality> (KBD 7826)
Slot fillers before a word or phrase, on the other hand, are fairly common. There appear to be at least two ways of using the expletive interjection in such cases. In (25) and - in particular - in (26) the interjections are in all probability tone units of their own expressing the speaker’s feelings concerning the following NP. Thus in (25), Oh God is used to convey the speaker’s irritation with the neighbour’s cats. In (26) the wider context of the quote makes it clear that the speaker expresses surprise at the proposal to locate a night club in a certain street. Example (27), on the other hand, strikes me as another example of the merely intensifying use of an interjection that we noted in the discussion of the expletive interjections used as slot fillers before a following clause in (1), (2),
The functions of expletive interjections in spoken English
167
(15) – (17). (It is hard not to feel that the difference between (25) and (26) on the one hand, and (27) on the other is in fact reflected in the punctuation here as in many other places and that the role of punctuation in BNC might be worth looking into). (25)
Oh God, next door’s cats (KB8.8468)
(26)
Shit! Down Quinnan Street! (KBD 967)
(27)
Oh god yes. (KBP 4686)
There is one interesting case of a slot filler occurring just before an NP in what must be an example of an interjection used with a textual function, more precisely as an act of repair when the speaker realises that s/he has made a mistake and wants to put it right: it was not curtains that the speaker should have ordered, but curtain rails. (28)
No I haven’t ordered any curtains, cor … curtain rails (KBH 3898)
The position immediately after a word or phrase is not as common as that before a word or a phrase, but we do find examples like (29): (29)
Damn paint and stuff, cor strewth. (KBR 531)
The situation here is that the speaker and her/his interlocutor are visiting a building that is being redecorated. The speaker coughs and then exclaims Damn paint and stuff ! adding cor strewth as a booster emphasizing the relevance of his exclamation. 5.
The distribution of the individual interjections
In the preceding section I explored the different mostly discourse-based functions with which the expletive interjections in my data have been used. I will bring this paper to its conclusion with a brief presentation of the distribution of the individual interjections across these functional categories with a view to establishing whether the individual expletive interjections show any marked tendencies to differ in their choice of function. I present my findings in Table 3. However, before we discuss the results in the table, let me remind the reader that the labels represent all uses of the words involved, whether as single words or as part of a collocation.
168
Magnus Ljung
Table 3: Distribution of the expletive interjections. PI: pure interjection, GF: gap filler, BC: before clause, MC: mid-clause, AC: after clause, BWP: before word/phrase, AWP: after word/phrase; % in brackets LABEL
Bugger Christ Cor Damn
PI 1 (7.1) 11 (31.4) 8 (9.75) -
Fuck
-
Oh God Gosh
34 (19) 1 (3.3) 25 (20.7) 5 (31.2) 7 (38.9) 92 (17.9)
Hell Jesus Shit Total
GF 2 (14.2) 2 (5.7) 11 (13.4) 1 (25) 2 (14.2) 49 (27.3) 8 (26.7) 31 (25.6) 7 (43.7) 3 (16.7) 116 (22.6)
BC 3 (21.4 ) 15 (42.9) 58 (70.7) 2 (50) 6 (42.9) 67 (37.4) 19 (63.3) 48 (39.7) 2 (12.5) 6 (33.4) 226 (43.9)
MC 2(1.7) 2(0.4)
AC 3 (21.4) 6 (17.1) 3 (3.7) 1 (7.1 ) 13 (7.3) 1 (3.3) 10 (8.3) 2 (12.5) 1 (5.6) 40 (7.8)
BWP 5 (36) 1 (25) 4 (28.6) 15 (8.4) 1 (3.3) 3 (2.5) 1 (5.6) 30 (5.7)
AWP -
TOT 14
1 (2.8) 2 (2.4) -
35
1 (7.1) 1 (0.55) -
82 4 14 179 30
2 (1.7) -
121
-
18
7 (1.4)
513
16
The statistics in Table 3 have been organized from left to right in order to make it possible to observe what percentage of the total number of occurrences each interjection devotes to the different functions. When the total number of occurrences is very low, this becomes a rather uninteresting exercise. However, with interjections with high total frequencies of occurrence, this method sometimes yields interesting information about the functional preferences of individual interjections. Given the information in the Table 2, we should not be surprised to find that almost 44% of the totals fall in the slot-filling “before clause” category. By the same logic the fact that the gap fillers and the pure interjections end up in second and third position will hardly cause any raised eyebrows. What is more interesting is the way the “before clause” percentages for cor and gosh surpass the 43.9% in the totals row by a thumping 26.8 and 19.4 percentage points respectively. Cor has 70.7% of its 82 occurrences in that position while gosh has 63.3% of its 30 occurrences in the same slot, a distribution strongly suggesting that these two items have specialized as expletive clause-initial pragmatic markers. Another surprise may be found among the gap fillers, where Jesus has 43.7% in comparison with the 22.6% value in the totals row. But Jesus is a low-
The functions of expletive interjections in spoken English
169
frequency item with a mere 16 occurrences and we may find it more rewarding to study the gap filler figures for real high-frequency interjections like Oh God! and hell. Both of these have gap filler percentages clearly above the totals values for the category. A third set of items with deviant percentages are to be found in the column for pure interjections, but as membership in this category is more difficult to determine than for the other functions these findings should be taken with a certain amount of scepticism. For what it is worth, however, a study of the percentages in that column reveals that Shit, Christ and Jesus all have substantially higher percentage values than the expected 17.9% found in the totals row. In the case of shit almost 39% of its occurrences are pure interjections; the corresponding percentages for Christ and Jesus are 31.4% and 31,2 % respectively. 6.
Conclusion
The aim of the present paper has been to explore the functions of expletive interjections in spoken British English as they are used in a 1 M-word sub-corpus from the spoken component of the BNC. The study focuses on ten common expletive interjections representing the semantic areas particularly associated with English expletives, viz. bodily waste, religion and sex. As Table 1 shows, the majority of the expletives are religious both in terms of types and tokens. The data was examined with a view to establishing in what ways the expletive interjections were actually used in conversation. It was found that they may be used in two distinct ways. Thus in about 20% of the 513 utterances making up my data the interjections are used merely to signal often involuntary speaker reactions to stimuli of various kinds as for example in exclamations of pain, irritation, surprise etc. I refer to the interjections used in this manner as pure interjections. In the utterances making up the remaining 80% of the data, the expletive interjections were used to carry out the communicative functions of subjectivity, interactivity and textuality (see the discussion of examples (1) – (8)), functions strongly associated with the category of pragmatic markers. In addition it turned out that all the expletive interjections in this category also satisfied the criteria for membership in the pragmatic marker category listed in Brinton 1996. These findings indicate that unless they are used as pure interjections, there is every reason to regard expletive interjections as pragmatic markers. The expletive interjections were also exposed to a discourse-based analysis in terms of the distinction between gap fillers and slot fillers found in Stenström (1991), (1994). The analysis revealed that the majority of the interjections were used as slot fillers, in particular before clauses as in (1), where bloody hell expresses the speaker’s attitude to the (proposition underlying) the following clause:
170
Magnus Ljung
(1)
… bloody hell look at that old codger behind the wheel. (KB7 11226)
The second largest category was the interactive gap fillers used as responses to the immediately preceding utterance as for instance in example (7), where B uses the same expletive interjection as that found in (1) in response to A’s claim that somebody is 37 years old: (7)
A: She must be 37. B: Bloody hell!! (KB1 3983)
The final part of the study explored the distribution of the individual expletive interjections. It was found that certain of them have become highly specialized, for instance cor and gosh, both of which favour the slot filler position “before clause” (cf. Table 3). References Aijmer, K. (2002), English discourse particles, evidence from a corpus. Amsterdam / Philadelphia: Benjamins. Aijmer, K. (2004), ‘Interjections in a Contrastive Perspective’, in: E. Weigand (2004) Emotion in Dialogic Interaction: Advances in the Complex. Current Issues in Linguistic Theory Vol. 240, pp. 99-120. Amsterdam & Philadelphia :John Benjamins. Andersen, G. (2001), Pragmatic Markers and Sociolinguistic Variation. A Relevance Theoretic Approach to the Language of Adolescents. Amsterdam & Philadelphia : John Benjamins. Brinton L.J. (1996), Pragmatic Markers in English: grammaticalization and discourse Functions. Berlin: Mouton de Gruyter. Crystal, D. (1997), The Cambridge Encyclopedia of Language. Second edition Cambridge: Cambridge University Press. Erman, Britt (1987), Pragmatic expressions in English: a study of you know, you see, and I mean in face-to-face conversation. Sweden: Almqvist & Wiksell International. Erman, B. (2001), ‘Pragmatic markers revisited with a focus on you know in adult and adolescent talk’. Journal of pragmatics 33: 1337–1359. Goffman, E. (1978), ‘Response Cries’. Language 54: 787-815. Hughes, G. (1998), Swearing: a Social History of Foul Language, Oaths and Profanity in English. Oxford: Blackwell. Hughes, G. (2006), An Encyclopedia of Swearing. Armonk N.Y. ; Sharpe. McEnery, T. and R.Z. Xiao (2003), ‘Fuck Revisited’. Corpus Linguistics 2003 28-31 McEnery, T. and R.Z. Xiao (2004), ‘Swearing in Modern English: the case of Fuck in the BNC’. Language and Literature 13: 235-268. McEnery, T. (2005), Swearing in English. Bad language, purity and power from 1586 to the present. Abingdon: Routledge.
The functions of expletive interjections in spoken English
171
Sinclair, J. and M. Coulthard (1975), Towards an Analysis of Discourse. The English used by teachers and pupils. Oxford: Oxford University Press. Stenström, A. (1990), ‘Lexical Items Peculiar to Spoken Discourse’, in: J. Svartvik (ed.) The London-Lund Corpus of Spoken English. Lund: Lund University Press. Stenström, A. (1991), ‘Expletives in the London-Lund Corpus’, in: K. Aijmer and B. Altenberg (eds.) English Corpus Linguistics. Studies in honour of Jan Svartvik. London and New York: Longman. Stenström, A. (1994), An Introduction to Spoken Interaction. London & New York: Longman. Svartvik J. and R. Quirk (eds.) (1980), A corpus of English conversation. Lund Studies in English 56. Lund: Gleerup. Svartvik, J. (1991), The London-Lund Corpus of Spoken English. Lund: Lund University Press. Van Lancker, D. and J.L. Cummings (1999), ‘Expletives: Neurolinguistic and Neurobehavioural Perspectives on Swearing’. Brain Research Reviews 31:83-104.
Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991 Geoffrey Leech and Nicholas Smith Lancaster University and University of Salford, UK Abstract The creation of the Lanc-31 corpus (familiarly known as B-LOB - ‘Before LOB’) completes a trio of matching corpora of standard written British English 19311- 1961 1991 on the model of the Brown corpus. The short-term history of English in the twentieth century can therefore now be examined using three equidistant broadly-sampled and comparable corpora of the written language, and it is possible to trace how far trends of change already observed in the comparison of LOB (1961) and F-LOB (1991) have themselves been undergoing change over the period in question. We will present in outline the recent history of a considerable range of grammatical features insofar as it can be learned from frequency counts from these three equivalently-sampled corpora. In many cases examined, the trend of increasing or decreasing frequency observed in the later period (1961-91) is found to be a continuation of a similar trend in the earlier period (1931-61).2 In other cases there is change in the rate or direction of change. In other words, there is both constancy and change in the rate of change. We provide tentative explanations of these changes, where appropriate, in terms of grammaticalization, colloquialization, Americanization and densification. Comparable developments in American English, based on analysis of the equivalent Brown and Frown corpora, are traced for the 1961-92 period, and provide insight into the relation between the two regional varieties, mostly showing AmE trends to be in advance of those for BrE.
1.
Introduction
The first part of our title may seem tautological, and needs explanation. What is meant, in a linguistic context, by ‘change in linguistic change’? To answer this, we first refer to a methodology that has, over the past decade or so, become quite an established way of studying short-term diachronic change. This is the use of comparable corpora – corpora equivalently sampled from the language, though different in temporal as well as geographical provenance – as a means of identifying rather precisely how the use of the language developed over a period. The Brown quartet of matching corpora (the four corpora of written standard English known as Brown, LOB, Frown and F-LOB) have been used in this way (see a range of publications, including Hundt 1997, Hundt and Mair 1999, Leech 2003, Leech and Smith 2006, Smith 2003, Leech et al. in press) as a means of tracking changes in frequency of use during the period between 1961 and 1991/2 in AmE and BrE. Such studies have focused largely on grammar, as the size of the corpora (c. one million words each), while too limited for lexical studies, is
174
Geoffrey Leech and Nicholas Smith
particularly suitable for grammatical studies. For a given grammatical phenomenon, this methodology can establish not only significant changes in frequency (increase or decrease), but rates of change.3 We have now been able to use, in a provisional form, a further recently completed corpus (see Leech and Smith 2005), from 1931 (± 3 years), the Lanc-31 corpus familiarly known as B-LOB (‘before LOB’). This, matching in every achievable respect the LOB and F-LOB corpora of British English, extends the comparable corpus methodology thirty years further into the past. ‘The Brown quartet’ of corpora now becomes ‘the Brown family’, encompassing three generations of language use. With this additional third sampling period, we have a trio of temporally equidistant corpora for BrE, doubling the length of the period for studying change in this way. But, more than this, the new corpus enables us to identify alterations in the direction and rate of change across time. We may observe, for example, an increase (acceleration) or a decrease (deceleration) in the rate of change of frequency in the post-1961 period. Indeed we might find a change in the direction of change (from increase to decrease of frequency or vice versa), using three-point line charts of a kind that will be abundantly illustrated in this chapter. Some of the kinds of pattern we can now observe are shown in Figure 1.4
(a) Steady increase
(b) Steady decrease
(c) Decrease: deceleration
(d) Increase: acceleration
Figure 1: Some patterns of linguistic frequency change: B-LOB – LOB – F-LOB 2.
An example of 1930s English
To begin with, we present a sample from the 1931 corpus – as a reminder of how the written standard language has changed since then:
Change and constancy in linguistic change: 1931-1991 (1)
175
Poached eggs require skill in the handling, and must be carefully denuded of water when removed from the pan, for nothing is more distasteful than water-sodden toast. (B-LOB F35, category F: Popular lore)
We can scarcely imagine a cookery book using such a seemingly ‘stilted’ style like this at the present day. In addition to its general formality of lexical choice (e.g. require, denuded, distasteful), the passage contains three grammatical choices, highlighted with italics, which would be significantly less common today: (a)
Like nearly all modal auxiliaries, the modal auxiliary must illustrated here has declined; indeed, it has declined much more than most other modals (see Section 6).
(b)
The passive voice, as in must be… denuded, has also declined markedly (see Section 4).
(c)
For, as a conjunction of reason, has declined catastrophically since the 1930s (see Section 4).
3.
Possible determinants of grammatical change
Before moving on to a more detailed account of changes of grammatical frequency, it is as well to consider the question: how is it that such changes have been taking place? Are there any general trends that can be observed? Although it may seem like putting the cart before the horse, we believe it will be helpful if we list a number of possible determinants of change right away, elaborating on them later. (i) Colloquialization has been proposed as an explanation of many changes in our corpora (Mair 1998, Leech et al. in press, especially Chapter 11). New developments in a language, it seems, tend to arise in colloquial speech, and to make their way gradually into the written medium. The trend of written language acquiring habits of spoken language, although by no means general, is one that has been observed through corpus studies going back to the seventeenth century (Biber and Finegan 1989, 1997). Colloquialization can also have a negative side: decline in frequency of a structure strongly associated with formal or literary writing may be attributed to an avoidance of such structures due to colloquial influence. This can be part of the explanation of (b) above, the decline of the passive. Turning to (c), the conjunction for, these days, is rarely found in speech, and so part of its decline may again be a negative effect of colloquialization: forms strongly associated with the literary language may become ‘upstaged’ by the increasing prevalence of more colloquial forms. (ii) Grammaticalization has been another such explanation, widely accepted in accounting for diachronic change in English (see, for example, Traugott and
176
Geoffrey Leech and Nicholas Smith
Hopper 2003). If must has declined so drastically, part of the reason could be that it is competing with verbal idioms such as have to, (have) got to – so-called ‘semi-modals’ whose emergence is a textbook example of grammaticalization (Krug 2000; Tagliamonte 2004). (iii) Americanization? The evidence provided by the Brown family of corpora – especially the comparison between the British corpora (1961, 1991) and the American corpora (1961, 1992) – often shows AmE to be in the lead or to show a more extreme tendency, and BrE to be following in its wake. Thus must, in our data, has declined more in AmE than in BrE, and has become much rarer than have to and (have) got to in AmE conversational speech.5 Users of British English are familiar with lexical changes due to American influence, such as increasing use of movie(s) and guy(s), but grammatical changes from the same source are less noticeable. The question mark after ‘Americanization’ above, however, is a warning that a finding that AmE is ahead of BrE in a given frequency change does not necessarily imply direct transatlantic influence – it could simply be an ongoing change in both varieties where AmE is more advanced. If the term ‘Americanization’ is taken to imply direct influence of AmE on BrE, it should be treated with caution. (iv) Densification is the tendency for the semantic content of written language to become more compactly expressed (see Biber 2003) – as shown, for example, in the frequency increase of noun + noun sequences and of s-genitives (see Section 7) that has been in progress for at least sixty years, and probably for much longer (see Leonard 1968, Rosenbach 2002, 2006). Strangely, this tendency seems to run counter to colloquialization, since a high frequency of nouns and low frequency of pronouns are strongly characteristic of written, not spoken language (Biber 1989). Ultimately we can only speculate about the causes of such changes: but these ‘-izations’ have a preliminary explanatory value in showing how individual changes belong to a more general class of changes with apparently similar characteristics and motivations. We now consider these trends in more detail, and later briefly discuss the interconnections between these trends. 4.
Colloquialization
The easiest and most canonical illustration of colloquialization is the increasing use of contractions (both verb and negative contractions) in written language over the sixty-year period. Restricting attention to negative contractions in -n’t, we begin by showing this trend in two separate line charts illustrating two different ways of measuring change of frequency. The first, or proportionate, method expresses the frequency of a feature’s occurrence as a percentage of its grammatically and semantically allowable occurrences. This is rather easy to do
Change and constancy in linguistic change: 1931-1991
177
in the case of negative contractions: we merely have to count the instances of -n’t and divide this figure by the instances of -n’t and of uncontracted not following a finite auxiliary or form of be; for example: f (don’t) f (don’t + do not)
70% 60% 50% Press Gen Prose Learned Fiction Overall
40% 30% 20% 10% 0% 1931
1961
1991
Figure 2: Contracted form n’t in BrE as a proportion of all not-contractions We see from Figure 2 that not-contractions have increased markedly since 1931, and that this increase has been steady and consistent in the two periods prior to and after 1961. To compare broad groupings of text genres, we have divided each corpus into four subcorpora we will refer to as ‘Press’ ‘General Prose’ ‘Learned’ ‘Fiction’
(text categories A-C) (text categories D-H) (text category J) (text categories K-R)
In comparing the subcorpora, it is no surprise to find that Fiction (the most speech-related subcorpus, because of its extensive inclusion of dialogue) shows by far the highest frequency of contractions, and that Learned (which is remote from speech, consisting chiefly of academic writing) shows the lowest frequency, close to 0%. The intermediate categories General Prose and Press come between Learned and the overall percentage for the whole corpus (shown by a thick black line), but of these two subcorpora, Press shows a sharper increase than General
178
Geoffrey Leech and Nicholas Smith
Prose. The whole picture, however, is remarkably consistent: each subcorpus, even to a small degree the Learned category, shows a constantly increasing use of contractions. The second way of measuring increase of frequency is the normalization method, which for us is simply to count occurrences per million words (pmw). This is close to the raw frequency count in each corpus, and might be considered a ‘rough and ready’ measure. However, our next chart (Figure 3) indicates a result closely similar to the more sophisticated proportionate frequency measure. In both Figure 2 and Figure 3, all four subcorpora show (a)
a steady and consistent increase,
(b)
the same initial order of frequency (Fiction, Press, Gen. Prose, Learned)
(c)
a divergence between Press and Gen. Prose, the former climbing more steeply than the latter, and moving from a position where contractions in Press are slightly less frequent than in Gen. Prose, to one where they are substantially more frequent.
8,000 7,000 6,000 Press Gen Prose Learned Fiction Overall
5,000 4,000 3,000 2,000 1,000 0
B-LOB
LOB
FLOB
Figure 3: Contracted form n’t in BrE expressed as frequency per million words In the rest of the chapter, therefore, we will rely on the normalized (pmw) method as the most convenient and often the only viable one, mainly because the proportionate method requires a clearly definable set of alternatives, the occurrences of which have to be counted to determine the non-occurrences of the feature being examined, and such a set of alternatives cannot be easily defined, let alone easily retrieved from the corpora. Consider, for example, the modal must: alternatives to must should include not only semi-modals expressing
Change and constancy in linguistic change: 1931-1991
179
obligation/necessity, such as have to and (have) got to, but other means of expressing a similar meaning, such as main verbs (need, require), adverbs (necessarily, inevitably), adjectival constructions (it is necessary to; we are obliged to), and possibly the uses of need and necessity as nouns. And these are only a sample of the alternatives to must. There is no easy way of drawing a boundary line to identify ‘non-occurrences of must’. From a negative viewpoint, colloquialization also seems to be the main explanation for a dramatic change in the frequency of relativization devices, especially between 1961 and 1991. The following charts show (a) a dramatic increase in the frequency of relative clauses introduced by that, and (b) a corresponding (though less dramatic) decrease in the frequency of relative clauses introduced by wh- pronouns (which and who/whom/whose). 3,000
2,500
2,000
AmE (est.) BrE
1,500
1,000
500
0 1931
1961
1991
Figure 4: Increasing use of that-relative clauses 1961-1991/2 in AmE (Brown ĺ Frown) and BrE (LOB ĺ F-LOB): frequencies pmw Figure 4 compares the increases in that-relatives in BrE and AmE since 1961, showing that the American increase is even steeper than the British (the amount of hand-editing required to count that-relatives, as compared with wh-relatives, dissuaded us from looking at that-relatives in the B-LOB corpus). Figure 5, on the other hand, shows a steady decline in wh-relatives in BrE since 1931, while the figures for AmE (available only for 1961-91) again show a somewhat more extreme trend.
180
Geoffrey Leech and Nicholas Smith
9000 8000 7000 6000 5000
AmE BrE
4000 3000 2000 1000 0 1931
1961
1991
Figure 5: Wh-relatives in AmE and BrE (frequencies pmw) Analysis of the subcorpora indicates that, while relative clauses as a whole are strongly biased towards expository writing, wh-relatives are particularly associated with Learned texts, whereas that-relatives are more evenly spread in the corpora. In the Brown family of corpora, the wh-relatives (which of course have a virtual monopoly of non-restrictive clauses) are predominant throughout. Overall, in Brown the frequencies of wh-, that and zero relatives are in the proportion 68% : 21% : 11% (changing to 54% : 35% : 12% in Frown). In LOB, the proportions are 74% : 14% : 12% (changing to 70% : 17% : 13% in F-LOB). The predominance is huge in the LOB Learned texts, where the ratios are 84% : 11% : 5%. In LOB Fiction, however, these types are more evenly spread, with the proportions 53% : 22% : 25%, while the wh- relatives still remain in the majority. Comparing Learned, as the most formal-informative variety, with Fiction, the variety closest to speech, we note in the above comparisons that wh- relatives have a distribution contrasting with that relatives, which have their strongest representation (in percentage terms) in Fiction writing.
Change and constancy in linguistic change: 1931-1991
181
10,000 9,000 8,000 7,000 Press Gen Prose Learned Fiction Overall
6,000 5,000 4,000 3,000 2,000 1,000 0 1931
1961
1991
Figure 6: Wh-relatives in BrE (frequencies pmw) Figure 6 breaks down the decline of wh-relatives in Figure 5 into subcorpora, showing that wh-relatives have by far the lowest frequency in the Fiction subcorpus, which is in general closest to the spoken language, and hence likely to reflect colloquial influences.6 This lends plausibility to the view that the increasing trend to favour that-relatives and disfavour wh-relatives is an aspect of colloquialization. On the other hand, in AmE, it is a massive decline (-34.9%) of which alone in relative clauses that accounts for the overall decline of whpronouns, and this has much to do with prescriptive influences.7 Our remaining examples of the influence of colloquialization are again negative ones, showing the decline of formal features, some already noted in section 1. Figures 7 and 8 display the already-noted declining frequency of the be-passive. Figure 7 shows AmE considerably ahead of BrE in the overall passive decline. Figure 8 breaks down the passive decline into subcorpora. Here the four subcorpora are in the opposite order to the order observed with contractions: Learned shows by far the highest frequency of passives, and Fiction shows the lowest. This is consistent with the view that the passive may have been declining because of disassociation with colloquial usage.8
182
Geoffrey Leech and Nicholas Smith
16,000 14,000 12,000 10,000 BrE AmE
8,000 6,000 4,000 2,000 0 1931
1961
1991
Figure 7: The be-passive in AmE and BrE: (frequencies pmw) 25,000
20,000
Press Gen Prose Learned Fiction Overall
15,000
10,000
5,000
0
1931
1961
1991
Figure 8: The be-passive in BrE (frequencies pmw) However, the picture is not quite so straightforward as it was with contractions. It seems that passives show no decline in General Prose, and indeed show an increasing trend in Learned, before the 1960s. This needs further investigation, but conceivably the ethos of dispassionate impersonality, encouraged in serious
Change and constancy in linguistic change: 1931-1991
183
writing and particularly in science, was still in the ascendant up to the middle decades of the century, but has since lost influence. An additional reason for a passive decline, probably increasing through the century, has been the hostility (especially in the US) of prescriptive forces – including usage gurus, house style manuals, crusaders in favour of ‘plain English’, and latterly, grammar checking software – all either overtly or covertly disparaging the use of the passive.9 700
600
500 Press Gen Prose Learned Fiction Overall
400
300
200
100
0 1931
1961
1991
Figure 9: Conjunction for in BrE (frequencies pmw) Another previously-mentioned case for negative colloquialization is the conjunction for, illustrated by the following: (2)
A proprietary remedy should be used, for this is better than any homemade one. (B-LOB, E36)
(3)
In the first place, the statement that a real crime is one about which the good citizen would feel guilty is surely circular. For how is the good citizen to be defined in this context unless as one who feels guilty about committing the crimes that Lord Devlin would classify as ‘real’. (F-LOB, G52)
The corpora attest some ambivalence about the status of for. It is not a fullyfledged subordinator, as unlike its competitors because, as and since, it cannot precede the matrix clause. Example (3) illustrates its use as a sentence-initial form, more akin to a sentence adverb: a proportionately increasing tendency. The remarkable decrease of this conjunction from nearly 600 per million words (pmw) to little more than 200 shows an accelerating decline over the sixty years.
184
Geoffrey Leech and Nicholas Smith
However, here the pattern of subcorpora suggests a somewhat different explanation from that of the passive. Fiction and General Prose are somewhat more retentive of for than the other subcorpora: a sign, perhaps, that for has become increasingly restricted to use in ‘literary’ contexts. It is noticeable that Press, usually in the vanguard of change (see Hundt and Mair 1999), shows the least use of for overall. Yet a further plausible example of negative colloquialization is the similar fate of the preposition upon, which in many contexts can be used as a more formal or literary variant of on: (4)
But my ideal society would be based upon a certain fundamental personal and social equality. (B-LOB, G65)
(5)
The reverse design is based upon the Combined Operations badge of WWII, but with modifications. (F-LOB, E25)
1,200
1,000
800 Press Gen Prose Learned Fiction Overall
600
400
200
0 1931
1961
1991
Figure 10: Preposition upon in BrE (frequencies pmw) Like for (as a conjunction), upon undergoes a precipitous decline between the 1930s and the 1990s, but in this case most of the loss (in raw frequency terms) takes place in the first thirty years. The pattern of subcorpora is also different, indicating that, at least in the second thirty years, it is the Learned subcorpus that is more retentive of upon. But, as in the cases of the passive and for, Press shows itself the most ‘advanced’ subcorpus in its growing avoidance of this preposition.
Change and constancy in linguistic change: 1931-1991 5.
185
Americanization
We have already observed more than one trend in which AmE leads while BrE follows some way behind. This has been seen in Figure 4 (the growing use of that-relatives), Figure 5 (the declining use of wh-relatives) and Figure 7 (the declining use of the passive). Other changing frequency trends dealt with in this chapter could also be cited: the increasing use of contractions, the declining use of modal auxiliaries (Figure 13), particularly of must (Figure 14), the increasing use of noun + noun sequences (Figure 22) and of genitives (Figure 20). All these show the characteristic pattern of AmE leading BrE in frequency change and (often) in the steepness of the frequency change. Only one example in this chapter illustrates the opposite phenomenon, whereby BrE leads AmE, and this is the case of semi-modals, on which we will comment in the next section. To give a particularly compelling illustration of AmE ‘leadership’ in change, we turn to the case of infinitive complementation of the verb help. The contrast we are considering is that between help + to-infinitive and help + bare infinitive, as exemplified by (6) and (7): (6)
…he acts as a detective, helping to unravel the mystery. (F-LOB G09)
(7)
Such fame helped unlock the coffers of the Treasury…. (F-LOB G38)
Figure 11 shows from 1931 an increasing use in BrE of the bare infinitive construction, which accelerates after 1961 following a trail already blazed by AmE. In 1931 this construction is apparently rare in BrE, the to-infinitive being much more common. After 1961, the to-infinitive in BrE declines, again following the AmE trend, so that by 1991 the bare infinitive has overtaken the toinfinitive as the more common construction. From 1961, BrE seems to follow almost slavishly the respective increase and decline of the two constructions in AmE. It is notable that Figure 11 illustrates a rather rare pattern – where an increase in 1931-61 is followed by a decrease. This applies to the use of the help to construction in BrE. Admittedly there is no statistical significance with these relatively small numbers, but a change in the direction of frequency change may well be a signal of changing circumstances – perhaps in this case a sign of increasing American influence towards the end of the century. The credibility of this explanation is backed up by another example of a change of direction, that of the mandative subjunctive illustrated by (8): (8)
‘Conditions have dictated that operations be scaled down…’ (F-LOB A38)
186
Geoffrey Leech and Nicholas Smith
250
200
help + to-inf. (AmE) help + to-inf. (BrE) help + bare inf. (AmE) help + bare inf. (BrE)
150
100
50
0 1931
1961
1991
Figure 11: To- vs bare infinitives with help (all construction types) in AmE and BrE (frequencies pmw) 120
100
80
subjunctive should
60
40
20
0 1931
1961
1991
Figure 12: Mandative use of subjunctive and modal should in BrE (freqs. pmw)
Change and constancy in linguistic change: 1931-1991
187
Again, the numbers are small and non-significant, but the change of direction in Figure 12 (this time from decrease to increase) is indicative of a revival of the mandative subjunctive in BrE which has been confirmed in other studies (notably that of Övergaard 1995). The upper line in Figure 12 represents the alternative construction of should-periphrasis (as in operations should be scaled down…), which has been declining in BrE whereas the subjunctive, which is the dominant construction in AmE, has changed from a virtually terminal decline prior to 1961 to a small but appreciable increase. As the subjunctive in BrE is associated with a more formal register range than should-periphrasis, this change of direction goes against colloquialization, and the only credible explanation seems to be American influence. 6.
Grammaticalization, modals and semi-modals
With modals and semi-modals, the three factors of grammaticalization, colloquialization, and Americanization appear to come into play together. Commonly attributed to grammaticalization, as already noted, is the progressively increasing frequency of semi-modals such as have to, be going to and want to (see Krug 2000). Overall, the core modals remain more or less level (although slightly decreasing) between B-LOB and LOB (1931-1961), but then a decline of around 10% sets in – see Figure 13.
16,000 14,000 12,000 10,000 BrE AmE
8,000 6,000 4,000 2,000 0 1931
1961
1991
Figure 13: Core modals in AmE and BrE (frequencies pmw)
188
Geoffrey Leech and Nicholas Smith
(The modals, for this purpose, comprise will, would, can, could, may, might, shall, should, ought to, need (+ bare infinitive), and the contracted forms ’ll, ’d, won’t, shouldn’t, etc.) As illustrated in Figure 13, AmE is further ahead than BrE in this trend, something that is more dramatic and noteworthy if we look at must alone: 1400
1200
1000
800 BrE AmE
600
400
200
0 1931
1961
1991
Figure 14: Must in AmE and BrE (frequencies pmw) In contrast, the semi-modals increase by about the same amount in 1961-1991. However, in the Brown family of corpora, core modals as a class are about 6.5 times more frequent than semi-modals in BrE in 1961, moderating to about 5.5 times as frequent in 1991 (see Leech et al. in press, Chapters 4 and 5). (The semi-modals included here are be able to, be going to, (be) supposed to, be to, (had) better, (have) got to, have to, need to, want to, and their contracted or reduced forms gonna, ’s to, (’d) better, ’ve got to, wanna, etc. The reasons for including be able to and need to as marginal members of this list are discussed in Leech et al. in press, Chapter 5.) Here we have an apparent exception to the ‘rule’ that AmE leads and BrE follows. In Figure 15, AmE starts from a lower starting point in 1961 and ends at a lower endpoint in 1991. But too much store should not be laid on this finding. The semi-modals are a diffuse group of verbs, which remains relatively rare (apart from have to) in written English even up to the 1990s. On the other hand, other corpora10 provide evidence of far greater frequency of semi-modals in spoken English, and especially in spoken American English. A study of equivalent spoken subcorpora extracted from the Diachronic Corpus of Present-
Change and constancy in linguistic change: 1931-1991
189
day Spoken English (DCPSE)11 suggests a strong growth (a percentage growth of nearly 37%) of semi-modal usage in British speech from the 1960s to the 1990s. 2,500
2,000
1,500 BrE AmE
1,000
500
0 1931
1961
1991
Figure 15: Selected semi-modals in AmE and BrE (frequencies pmw) On pure frequency grounds, the Brown family provides little evidence that semimodals are encroaching on the territory of the core modals. On the other hand, in spoken AmE there is much more potential for rivalry between modals and semimodals (see fn. 5), and it is conceivable that the decline of core modals in written English since 1961 is an indirect reflection of this rivalry – an argument strengthened by the greater decrease in frequency of core modals in spoken English (according to our study of DCPSE) than in written English. It is worth observing a possible synergy here between the trends of grammaticalization and colloquialization. As far as the semi-modals are concerned, grammaticalization is a phenomenon of the spoken language.12 Semimodals have yet to make huge inroads into the written language, but the increase of semi-modals (9.0% noted between LOB and F-LOB) can reasonably be attributed to colloquialization. Some semi-modals, including (have) got to and be going to, do not increase at all between LOB and F-LOB, and one explanation that suggests itself for this is that some kind of ‘prestige barrier’ discourages the use of these forms in published writing. (Particularly the avoidance of forms of get in the written language, a well-known taboo, might account for the low and even declining usage of (have) got to.) The lower frequency of semi-modals in the American written corpora might again be due to such a ‘prestige barrier’, which could well be more powerful on the west side of the Atlantic.
190
Geoffrey Leech and Nicholas Smith
In the case of must and its semi-modal rivals, a more specific explanation suggests itself. We can contrast the decreasing use of must in Figures 14 and 16 with a big increase in the use of have to and need to (Figures 17 and 18). 1,800 1,600 1,400 1,200 Press Gen Prose Learned Fiction Overall
1,000 800 600 400 200 0 1931
1961
1991
Figure 16: Must in BrE (frequencies pmw) 900 800 700 600 500
BrE AmE
400 300 200 100 0 1931
1961
Figure 17: Have to in AmE and BrE (frequencies pmw)
1991
Change and constancy in linguistic change: 1931-1991
191
250
200
150 BrE AmE
100
50
0 1931
1961
1991
Figure 18: Need to in AmE and BrE (frequencies pmw) In the case of must, Learned is the most retentive subcorpus and Press the least retentive – not a surprising result in view of the reputations of these varieties as respectively conservative and innovative (see Mair and Hundt 1999). The subcorpora here do not follow a consistent pattern, but overall the decline of must shows an accelerating trend in the more recent period. Have to and especially need to, in contrast, show sharply increasing frequency profiles. A late and largely unacknowledged newcomer to the class of semi-modals, need to has only recently started to ‘take off’ (see Smith 2003, Taeymans 2004), and is much less frequent than have to and must. It is the steepness of its increase, particularly in 1961-1991, that commands attention. The more frequent semi-modal have to, on the other hand, shows a greater increase in 1931-1961. The evidence provided by the three-point line charts suggests that have to began to approach the peak of its frequency climb earlier, and its increase was decelerating at the time when that of need to was accelerating. We have argued elsewhere (Smith 2003, Leech 2003 – cf. also Myhill 1995) that the exceptionally steep decline in frequency of must is likely to have been influenced by attitudinal factors. The root use of must is associated with the speaker as the deontic centre, and hence with an authoritarian tone, particularly when combined with you and we as subjects: (9)
‘Well, you can’t [go home]. You’re ill. The doctor says you must stay….’ (F-LOB K27)
192
Geoffrey Leech and Nicholas Smith
(10)
But to compete with the world we must adapt to the 21st century. Not the 19th. (F-LOB B14)
In contrast, have to and need to typically avoid the face-threatening tone of must. This is especially true of need to, which typically attributes the necessity or obligation to factors internal to the human protagonist, and hence typically stresses beneficial aspects of the constraint: (11)
Nevertheless the picture in the mind of western man seriously needs to be corrected. The Hungarian people are no longer poor or oppressed according to their standards. (LOB E22)
(12)
We may all need to become more aware of how we use water, to learn the ways of managing and conserving supplies. (F-LOB F09)
Notice that if must were used in examples like (11) and (12), the implied obligation on the addressee would seem far more face-threatening. 7.
Densification
To illustrate densification, we move from verbal to nominal constructions. Two of the most salient trends observed over the sixty-year period are the increase of noun+noun13 sequences and the increase of s-genitives. These can both be considered ways of achieving greater compactness of meaning in the noun phrase, a trend which has been particularly associated with Press writing since the early decades of the twentieth century, and which is found in its quintessential form in newspaper passages such as: (13)
the aviation and casino kingpin Kirk Kerkorian finally sold MGM’s film entertainment division to Pathe boss Giancarlo Parretti in November. (Frown A43)
(All underlined words in (13) are nouns.) Measuring compactness in terms of the number of words expressing a given concept, we notice that in the following, compared with a prepositional construction, the use of a noun + noun sequence or s-genitive + noun reduces the number of words by a third: (14)
(a) N1 of N2: (b) N2’s N1: (c) N2 N1:
the cigarette lighter of a car a car’s cigarette lighter (Brown, E04) a car cigarette lighter
Densification manifests itself in: (a) a decrease of the use of of (Figure 19), (b) a steep rise in s-genitives (Figure 20) and (c) a steep rise also in noun + noun sequences (Figure 21).
Change and constancy in linguistic change: 1931-1991
193
60,000
50,000
40,000 Press Gen Prose Learned Fiction Overall
30,000
20,000
10,000
0 1931
1961
1991
Figure 19: Preposition of in BrE (frequencies pmw)
10,000 9,000 8,000 7,000 6,000
Press Gen Prose Learned Fiction Overall
5,000 4,000 3,000 2,000 1,000 0 1931
1961
Figure 20: S-genitive in BrE (frequencies pmw)
1991
194
Geoffrey Leech and Nicholas Smith
40,000 35,000 30,000 25,000
Press Gen Prose Learned Fiction Overall
20,000 15,000 10,000 5,000 0 1931
1961
1991
Figure 21: Noun+noun sequences: BrE (frequencies pmw) The decreasing use of of as shown in Figure 19 is, of course, only a rough and ready indication of what is happening in the noun phrase. Most ofs occur in the noun phrase, but only a minority of them fall into the category of of-genitives, where the s-genitive construction could be substituted. We examined a randomized 2% sample of the ofs in LOB and F-LOB, and discovered that the frequency loss of the of-genitive, based on this limited sample, was 24%, and therefore commensurate with the increase of the s-genitive over the same 1961-91 period. That period also showed a loss of 5% of prepositions as a whole, but the loss of of represented in Figure 20 was greater than this overall prepositional loss. It is reasonable to speculate that the decreasing use of of shown in Figure 19 is in part due to a tendency to switch from of-genitives to s-genitives over the period in question. Although increase in noun + noun sequences over the period has been highly significant in all subcorpora, it is the increase in s-genitives that invites most scrutiny. The remarkable rise in frequency over sixty years in Press emphasises that it is journalistic writing that above all spearheads this change. The subcorpus that might be thought to be the most natural home for the sgenitive is Fiction, because of the supposed restriction of this construction to human nouns. But Fiction conspicuously lacks the sharp increase of s-genitive frequency in Figure 19, and so the chart could lend itself to another suggestion that the increase is not due to increase of human nouns (which would predominate in Fiction texts), but to the extension of the genitive to other ‘quasihuman’ or inanimate categories of noun. However, studies focusing on the
Change and constancy in linguistic change: 1931-1991
195
genitive (Rosenbach 2002, Hinrichs and Szmrecsanyi 2007) have failed to detect any categorical association between increasing use of the s-genitive and extension of its use to a broader range of nominal categories. There is no doubt that increasing use of the genitive with inanimate possessors is one of the major factors of change, particularly in AmE, but as Rosenbach puts it (2002: 271) ‘the choice between the two genitive variants … is not a matter of categoriality but of preference’. Part of the explanation for the meteoric rise in the use of s-genitive, as well as for the similar growth in the use of noun + noun sequences, appears to be that genres of expository writing, in particular, above all Press, are moving towards a more densely information-packed style in the use of noun phrases (cf. Biber and Clark 2002). Bearing in mind that English-speaking society has been increasingly dominated by mass media and the information ‘overload’ of recent decades, this is not an implausible explanation, and would explain why the increasing densification trend is strongest in the Press (see Figures 20 and 21). From Figure 22 it is apparent that AmE has led BrE in this increasing use of noun + noun sequences. 35,000
30,000
25,000
20,000 BrE AmE
15,000
10,000
5,000
0 1931
1961
1991
Figure 22: Noun+noun sequences in AmE and BrE (frequencies pmw) 8.
Conclusion: the Constancy or Inconstancy of Change
Of the four processes put forward as explanatory hypotheses, three (grammaticalization, Americanization and colloquialization) can often be seen as cooperative trends. For example, the increasing use of semi-modals suggests a
196
Geoffrey Leech and Nicholas Smith
synergy of grammaticalization – accounting for increasing use of semi-modals in speech – colloquialization – leading to a similar, though less pronounced, development in writing – and Americanization – proposing that the more extreme trend in AmE, especially AmE speech, is being increasingly followed in BrE. The fourth hypothesis, however – densification – refers to a development primarily associated with the written language, and if anything ‘anti-colloquial’ in its effects. Further research will be needed to give more substance to these claims. The three-point line charts have the potential to show how change changes. But if anything, it is constancy of change that is most evident in the 3point line charts we have presented. That is, of the patterns illustrated in Figure 1, it is the patterns (a) and (b), which show little or no alteration in the increase or decrease of frequency, that are most noticeable. This is also the picture often shown by subcorpora – a further confirmation of the impression that in many cases the increase or decrease of frequency between 1961 and 1991 is simply a continuation of a trend already in progress over the previous thirty years. On the other hand, where there has been acceleration or deceleration – patterns (c) and (d) of Figure 1 – it is tempting to speculate on the alteration of rate of change. All four possible patterns are observed, and are exemplified below: (a)
Decelerating decrease: wh-relative clauses (Figure 5)
(b)
Accelerating decrease: the passive (Figure 7)
(c)
Decelerating increase: have to (Figure 17)
(d)
Accelerating increase: need to (Figure 18)
In addition, we see both kinds of change of direction: (e)
Increase followed by decrease: help to + infinitive (Figure 11)
(f)
Decrease followed by increase: mandative subjunctive (Figure 12)
Of these (a) lacks any obvious explanation, (c) and (d) might possibly be explained as different phases in the development of semi-modals as a consequence of grammaticalization, and the remaining cases seem to fit a pattern of increasing American influence in the 1961-91 period. Thus (b) reflects a steeper decline of the passive in AmE possibly due to prescriptivism; and in (e) and (f) the change of direction shows a convergence towards American preferences. Although the provisional nature of these findings counsels caution, the addition of the 1931 corpus to the ‘Brown family’ brings new precision, depth and insight to our understanding of the recent grammatical history of English. Notes 1
For practical reasons, the 1931 corpus is actually sampled from the period 1928-34, centring on the year 1931. The corpus is in a provisional state,
Change and constancy in linguistic change: 1931-1991
197
but is likely to change little. It is to be released within the next two years. A fourth matching corpus (centring on the year 1901) is now nearing completion at Lancaster. For convenience, we refer to the periods covered by the corpora as 1931-1961-1991, although 1931 (as just explained) is something of an idealization, and 1991 (the date of the F-LOB corpus) is slightly inaccurate for the corresponding AmE corpus, Frown, whose texts date from 1992. 2
We are indebted to Marianne Hundt and Christian Mair for their collaboration in research on this trio of corpora; also to Paul Rayson for collaboration and support in the development of the B-LOB corpus. Financial support for the compilation of this corpus was generously provided by the Leverhulme Trust, in the form of an Emeritus Fellowship.
3
For example, according to the F-LOB and LOB corpora, the frequency of modal auxiliaries in written British English has declined by 9.6% in thirty years: a finding which is significant as a very high level (p < .001).
4
Statistical significance is at p < .05, p < .01 and p < .001 levels are not cited in this chapter, although most increases and decreases of frequency shown in the charts are significant at least to the p < .05 level. Where they are not significant, this will be stated in the text.
5
In Leech et al (in press), in Chapter 5, Tables 5.2 and 5.3, we report relative frequencies of modals and semi-modals in two large and comparably sampled corpora of conversation from the 1990s: the Longman Corpus of Spoken American English and the demographic subcorpus of the British National Corpus. We found that in AmE conversation, must was less than half as frequent as need to and (have) got to; and less than one-eighth as frequent as have to. In BrE conversation, the differences were less extreme, but still must was less than half as frequent as have to, and also appreciably less frequent than (have) got to. (Instances transcribed gotta were counted with (have) got to.)
6
This is confirmed, in proportionate terms, by a frequency comparison of relative clause types across genres, including fiction, speech, and academic writing, in Biber et al. (1999: 610-611).
7
In the US it has been a widely-held prescriptive rule that that should be used instead of which in introducing restrictive relative clauses. See Arnold Zwicky’s ‘Language Log,’ posting, July 4, 2005 on ‘the sacred That rule’: http://itre.cis.upenn.edu/~myl/languagelog/archives/002291.html#more
8
See Seoane and Loureiro-Porto (2005) and Seoane and Williams (2006) on colloquialization and the declining use of passives in scientific English.
198
Geoffrey Leech and Nicholas Smith
9
The disparagement of the passive in favour of the active can be found in many influential style guides and the like – see, for example, Strunk and White (2000: 18).
10
Approximate figures from the AmE and BrE conversation corpora discussed in note 5 yield the following ratios: in AmE, 1.79 modals per semi-modal; in BrE, 2.79 modals per semi-modal. These contrast with the ratios for written language in the 1990s corpora Frown and F-LOB: 5.3 and 5.5 modals per semi-modal respectively. The frequency gap between modals and semi-modals in spoken language is obviously far narrower in speech (especially American speech) than in writing.
11
We are grateful to Bas Aarts for making a copy of this corpus available to us before its official release, to allow us to make comparisons of the kind discussed here. As the earlier part and the later part of the DCPSE were collected under different conditions, and samples from the 1960s and 1990s were not strictly comparable, we extracted ‘mini-subcorpora’ of approximately 140,000 words from the periods 1958-69 and 1990-1992 with the aim of achieving as nearly as possible the comparability of corpora from the 1960s and 1990s, as found in the Brown family. The mini-corpora were clearly too small to provide more than tentative results, although the differences in frequency between the earlier and later corpora, with respect to modals and semi-modals, were highly significant: a loss of 11.8% in the case of modals, and a gain of 36.8% in the case of semi-modals (see Leech et al., in press. Chapters 4 and 5).
12
This is not only because the semi-modals as a class are much more frequent in the spoken language, but also because the phonetic reduction of the semi-modals, which is a salient indicator of the ongoing grammaticalization process, appears only in the spoken language. The reduced written forms gonna, wanna, gotta, better, supposed to (the last two lacking the finite auxiliary) are frequent in transcriptions of speech – for example in the BNC and the LCSAE – but are rare in the Brown family, even in representations of spoken dialogue. The occurrence of these transcribed reduced forms can be taken in general as an indicator of phonetic reduction, although in any given case the transcriber’s subjective impression is involved in the choice of a standard or non-standard spelling.
13
Strictly, ‘noun + noun’ means ‘noun + common noun’. We excluded strings where the second noun was a proper noun, as this would have included personal names such as Hillary Clinton and Gordon Brown, as well as place names spelt as two words (such as Los Angeles and New York), which, according to the tagging conventions of the C8 tagset, are tagged as a sequence of two proper nouns. The frequency count for ‘noun
Change and constancy in linguistic change: 1931-1991
199
+ common noun’ sequences included multiple counts where a sequence of three or more nouns occurred. For example, the sequence film entertainment division in example (13) would count as two noun + noun sequences: film entertainment and entertainment division. References Biber, D. (1989), Variation across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. (2003), ‘Compressed noun-phrase structures in newspaper discourse: the competing demands of popularization vs. economy.’, in: J. Aitchison and D.M. Lewis (eds.), New media language. London: Routledge, pp. 169181. Biber, D. and E. Finegan (1989), ‘Drift and evolution of English style: A history of three genres’. Language 65(3), 487-517. Biber, D. and E. Finegan (1997), ‘Diachronic relations among speech-based and written registers in English’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.), To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Mémoires de la Société Néophilologique de Helsinki, pp. 253-75. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. London: Longman. Biber, D. and V. Clark. (2002), ‘Historical shifts in modification patterns with complex noun phrase structures.’, in: T. Fanego, M. López-Couso and J. Pérez-Guerra (eds.). English Historical Morphology: Selected Papers from 11 ICEHL, Santiago de Compostela, 7-11 September, 2000. Amsterdam: Benjamins, pp.43-66. Facchinetti, R., M. Krug and F. Palmer (eds.) (2003), Modality in Contemporary English. Berlin: Mouton de Gruyter. Hinrichs, L. and B. Szmrecsanyi (2007), ‘Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora.’ English Language and Linguistics 11(3), 335378. Hopper, P.J. and E. Closs Traugott (2003), Grammaticalization. Cambridge: Cambridge University Press. Hundt, M. (1997), ‘Has BrE been catching up with AmE over the past thirty years?’, in: M. Ljung (ed.), Corpus-based Studies in English: Papers from the 17th International Conference on English Language Research on Computerised Corpora (ICAME 17), Stockholm, May 15-19, 1996. Amsterdam: Rodopi, pp. 135-151. Hundt, M. and C. Mair (1999), ‘“Agile” and “Uptight” Genres: The CorpusBased Approach to Language Change in Progress.’ International Journal of Corpus Linguistics 4, 221-242. Krug, M. (2000), Emerging English modals: a corpus-based study of grammaticalization. Berlin and New York: Mouton de Gruyter.
200
Geoffrey Leech and Nicholas Smith
Leech, G. (2003), ‘Modality on the move: The English modal auxiliaries 19611992’, in: Facchinetti et al., pp. 223-40. Leech, G. and N. Smith (2006), ‘Recent grammatical change in written English 1961-1992’, in: A. Renouf and A. Kehoe (eds.). The Changing Face of Corpus Linguistics. Amsterdam: Rodopi, pp. 185-204. Leech, G., M. Hundt, C. Mair and N. Smith (in press), Contemporary Change in English: A Grammatical Study. Cambridge: Cambridge University Press. Leonard, R. (1968), The Types and Currency of Noun + Noun Sequences in Prose Usage 1750-1950. Unpublished MPhil thesis, University of London. Mair, C. (1998), ‘Corpora and the study of major varieties of English: Issues and results’, in: H. Lindquist, S. Klintborg, M. Levin and M. Estling (eds.), The major varieties of English: Papers from MAVEN 97. Växjö: Acta Wexionensia, pp. 139-157. Mair, C. (2006), Twentieth Century English: History, Variation and Standardization. Cambridge: Cambridge University Press. Myhill, J. (1995), ‘Change and continuity in the functions of the American English modals.’ Linguistics: An Interdisciplinary Journal of the Language Sciences 33, 157-211. Övergaard, G. (1995), The Mandative Subjunctive in American and British English in the 20th Century (Studia Anglistica Upsaliensia 94). Uppsala: Acta Universitatis Upsaliensis. Rosenbach, A. (2002), Genitive variation in English: conceptual factors in synchronic and diachronic studies. Berlin: Mouton de Gruyter. Rosenbach, A. (2006), ‘On the track of noun+noun constructions in Modern English’, in: C. Houswitschka, G. Knappe and A. Müller (eds), Anglistentag 2005 Bamberg. Proceedings. Trier: Wissenschaftlicher Verlag, pp. 543-557. Seoane, E. and C. Williams. (2006), ‘Changing the rules: A comparison of recent trends in English in academic scientific discourse and prescriptive legal discourse’, in: M. Dossena and I. Taavitsainen (eds.) Diachronic Perspectives on Domain-Specific English. Bern: Peter Lang, pp. 255-276. Seoane, E. and L. Loureiro-Porto (2005), ‘On the colloquialization of scientific British and American English.’ ESP Across Cultures 2, 106-118. Smith, N. (2003), ‘Changes in the modals and semi-modals of strong obligation and epistemic necessity in recent British English’, in: Facchinetti et al., pp. 241-266. Strunk, W., Jr and E.B. White (2000), The Elements of Style. London and New York: Longman. Tagliamonte, S.A. (2004), ‘Have to, gotta, must: Grammaticalization, variation and specialization in English deontic modality’, in: H. Lindqvist and C. Mair (eds), Corpus Research on Grammaticalization in English. Amsterdam: John Benjamins, pp. 33-55.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form: A critical discussion of selected lexicographic parameters and query options Alexander Onysko, Manfred Markus and Reinhard Heuberger University of Innsbruck1 Abstract The digitised version of Joseph Wright’s English Dialect Dictionary (EDD, 1896–1905) promises to be a lexicographic milestone for English dialect terms and phrases of the 18th and 19th centuries. In a research project in the English Department at the University of Innsbruck, the c.5000 pages of the dictionary have been transferred into machine-readable text and parsed. Our aim is to produce an online version of the dictionary for research on the history of spoken and dialectal Late Modern English. The paper demonstrates the complexity of the entries in the EDD and focuses on the questions of dialect attribution and of the definition of words and phrases as two cases in point. Beyond that, we will provide a survey of the search interface and specifically discuss the implementation of the two issues of dialect area and definition.
1.
Introduction
More than 100 years after its first publication, Joseph Wright’s English Dialect Dictionary (EDD) remains the most comprehensive and reliable lexicographic work on English dialects, also going beyond the Oxford English Dictionary (OED) in its treatment of dialectal forms.2 The dictionary was published in 6 volumes between 1898 and 1905 and, according to the editor, it “includes, so far as possible the complete vocabulary of all dialect words which are still in use or known to have been in use at any time during the last two hundred years in England, Scotland and Wales” (Wright Vol.1 1898: v.). In order to draw a comprehensive picture on dialect terms in the period from about 1700 to the time of publication, Wright used a plethora of sources. He gathered information from 274 correspondents, 90 printed dialect glossaries (many compiled by individual members of the English Dialect Society) and 342 unprinted collections of dialect words. Furthermore, Wright incorporated dialect terms from various reference works, literary sources, magazines and press publications, as well as folkloristic sources such as songs and games, which, altogether, amount to more than 2000 bibliographic entries. Overall, the six volumes of the EDD feature close to 65.000 entries of dialect words including further thousands of variant forms, derivations, compounds and phrases, the majority of which are meticulously labelled according to their use in various counties, regions and nations. In spite of Wright’s detail-oriented manner of compilation, the analysis of the dictionary entries has shown that he employed multiple, sometimes hardly
202
Alexander Onysko, Manfred Markus and Reinhard Heuberger
accessible ways of indicating the areal distribution of dialect terms. The semantic information in the EDD is similarly complex, with features of meaning being included outside the definition proper. Apart from discussing these selected problems, the paper will also provide an overview of the complexity of the planned online search interface, specifically focussing on the retrieval of dialect areas and semantic information. 2.
The parameters of the entries
First of all, a few introductory remarks have to be made about the lexicographic structure of the EDD. There are eight basic recurrent units in the dictionary entries: (1) headwords, (2) parts of speech, (3) labels, (4) counties, regions and nations (dialect areas), (5) phonetic transcription and (6) definitions or meaning(s), which are often exemplified by a large number of (7) citations; optionally, Wright provides (8) comments, particularly on etymology, at the end of the entries. Figure 1 - for the sake of brevity only a simple example - illustrates the role of the eight fields (cf. Markus and Heuberger 2007). (1) AFTER-DAMP, (2) sb. (3) Tech. (4) Nhb. Dur, w.Yks. (5) [aftԥ-damp.] (6) The noxious gas resulting from a colliery explosion (Wedgwood). (7) Nhb. & Dur. After-damp, carbonic acid, stythe. The products of the combustion of firedamp, NICHOLSON Coal Tr. Gl. (1888). Nhb.1 After-damp, the noxious gas resulting from a colliery explosion. This after-damp is called choak-damp and surfeit by the colliers, and is the carbonic acid gas of chymists, HODGSON A Description of Felling Colliery. w.Yks. The after-damp completed their death, N. & Q. (1876) 5th S. v. 325. Miners’ tech. Carbonic acid gas, or choke damp, which the miners call after-damp, CORE (1886) 228.
(8) [After + damp, q.v.; cp. choak-damp.] Figure 1: Standard entry structure in Wright’s EDD While the first five parameters are relatively homogeneous in layout, definition has turned out to be rather variable: sometimes it is appended to the paragraph of the lemma, as in Figure 1, but in polysemous entries it is split into separate paragraphs. This quality of the definition as a “trouble maker” has prompted us to insert a caesura after the phonetic transcription and to subdivide the structure of the entries into entry heads and entry bodies. The head comprises the first five
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
203
units, i.e. from headwords to phonetic transcription, while the body consists of the other three units: definitions, citations and comments. In our work so far, we have focussed on correct parsing. It has soon become obvious that the main entry parts contain various sub-parameters which may be of interest to the user of the electronic version of the EDD, such as spelling variants, compounds, derivations, phrases and sources. However, the two basic pieces of information that almost every entry offers are dialect areas and definition. This paper will, therefore, use these two fields to demonstrate the complexity and wealth of information provided by Wright’s dialectal treasure house. 3.
Dialect attribution3
One has to distinguish between general dialect attribution, which is given in the form of dialect labels, and specific information which can be inferred from the so-called source codes or correspondent codes. Both types occur very frequently in the EDD and need to be considered equally for a thorough philological investigation of the dictionary. 3.1
General dialect attribution: dialect labels
The general dialect labels usually appear as abbreviations in the entry head (cf. Figure 1). They signify the area where a headword is used. The prominent location of the labels in the entry head marks this information as relevant for the lemma as a whole. Dialect labels occur on three different levels, i.e. counties, regions and nations. Whenever Wright had precise information about the spatial distribution of a certain dialect term, he listed all counties in which that term was used. In cases where his sources were less precise or when there were too many counties to mention, he preferred to indicate the regions or the nations in which the usage occurred. The three levels of dialect labels are briefly explained in the following. 3.1.1 County labels Wright used 126 county labels in the EDD, 42 of which pertain to England. Scotland (39), Ireland4 (32) and Wales (13) are classified accordingly. Counties are always indicated by means of three- or (occasionally) four-letter codes, e.g. Bnff. (Banffshire / Scotland). As mentioned, they usually occur in the entry head between the usage labels and the phonetic transcriptions, but one of the peculiarities of the dictionary is that they can also be found in the entry body, typically in front of the citations (cf. chapter 3.3.). The various counties in Great Britain and Ireland have not been dealt with in equal measure, presumably as a result of Wright’s sources and informants being unevenly spread. An electronic analysis of the dictionary shows, for example, that Yorkshire has been mentioned most often by far (76.661), whereas
204
Alexander Onysko, Manfred Markus and Reinhard Heuberger
the counties Denbighshire (England / 21), Sutherland (Scotland / 54) and Limerick (Ireland / 22) have received comparatively little attention. Some Irish counties, e.g. Kilkenny and Monaghan, merely have a single reference in the entire dictionary. In addition, Joseph Wright also uses cardinal directions in combination with counties to further specify dialect areas (e.g. n.Cmb. for ‘north Cambridgeshire’, sw.Cor. for ‘south-west Cornwall’). This issue will be discussed in more detail in chapter 5.2.1., as an example of how dialect reference is incorporated in the search interface. 3.1.2 Regional labels For England, Scotland, Wales and Ireland, Wright has also made regional distinctions. The names of these regions appear in abbreviated form, e.g. e.Cy., (East Country – England), and they generally comprise several counties. The areal definition of the regions mainly follows Wright’s considerations in his English Dialect Grammar (1905: 1-3) which is, in turn, partly based on Ellis’ classification of the dialects of England (cf. 1968 [1889]). Defining n.Cy., for example, Wright states the following: When I use the expression n.Cy., the northern counties or the northern dialects, I mean thereby Nhb. [Northumberland], Dur. [Durham], Cum. [Cumberland], Wm. [Westmoreland], Yks. [Yorkshire] (except sw. & s.Yks.) and the northern portion of Lan. [Lancashire]. (1905: 2) Since it is impossible to determine the exact signification of expressions such as “the northern portion of Lan.” and, in the EDD, Wright most consistently refers to counties as a whole, we have decided to interpret such definitions inclusively, i.e. a search for n.Cy. will include the lemmas of the entire counties concerned. On a general level of regional specification, Scotland and Ireland show a similarly detailed classification while Wales is only regionally subdivided into north and south. The regions that figure most prominently are the English n.Cy. (14.966) and e.An. (8.514). The Scottish regions have a few thousand, the Irish a few hundred references on average. 3.1.3 National labels The list of nations in the EDD is restricted to countries and continents where English was spoken natively in the 19th century: that is the USA, Canada, Australia, New Zealand, England, Scotland, Ireland and Wales. The USA is only mentioned 34 times, whereas the hypernym America occurs 2553 times5. Interestingly enough, Scotland is cited significantly more often than England (11.359 vs 4.440 occurrences). This striking difference is presumably due to the fact that Wright had less precise information about dialectal usage in the Scottish counties, thus preferring the more general nation tag.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form 3.2
205
Specific dialect attribution: source codes
The citations of the EDD are drawn from a great number of printed glossaries, which are referred to in the entries by means of source codes. These codes usually consist of an abbreviation of a certain dialect area in combination with a superscript figure. The majority of the source codes is mnemonic since the numeral represents the only difference between source codes and corresponding county abbreviations. Thus, reference to Cheshire in citations from the printed glossaries is immediately recognizable as Chs.1, Chs.2 and Chs.3. Only the abbreviation N.I.1, which stands for the counties Antrim & Down in Northern Ireland, lacks immediate transparency. The following entry shows how source codes are integrated in the dictionary. BACK-ORDER, sb. Chs. Der. [ba.k.oda(r).] A countermand, a reversal of a previous command. s.Chs.1 Ahy woz tu utoo’kn dhem bee uss tu)th faer, bu mestur sent mi baak au rdurz [I was to ha’ tooken them beas-s to th’ fair, bu’ mester sent me back-orders]. Der. (H.R.)
Figure 2: Source codes in Wright’s EDD (example 1) The source code s.Chs.1 indicates that the subsequent citation was taken from a book identified in the reference list as The Folk-Speech of South Cheshire by Th. Darlington. The source code is even more precise than the county label Chs. in the head and emphasizes that there can be variance of areal dialect information between different parts of the same entry. This kind of graded precision of dialect areas relates to the fact that Wright uses the head to give the general picture and to summarize information mentioned in other parts of the entries. Figure 3 provides another example of how information in the head can diverge from dialectal source codes since the latter prove the existence of a specific sense of the entry at large. AGAIN, prep. Var. dial. uses in Sc. Irel. and Eng. Also written agaan, agean, agen, agin, agyen. See below. [agien, age’n, egin.] Used for against, in most of its mod. meanings. I. Of position. 1. Near, beside. n.Yks. Just ageean t’pleeace where Ah wur bred, Broad Yks. (1885) 27 ; n.Yks.2 i.e.Yks.1 Oor spot ligs agaan Helmsla. e.Yks.1 w.Yks. Nelly always sits again John (F.P.T.) ; Poor Bill, he wur leynd ageean t’wall, PRESTON
206
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Poems, &c. (1864) 24. Lan.1 Agenvth’ heawseeend wur a little cloof o’ full o brids and fleawrs. Chs.1 He lives agen th’ chapel; [...]
Figure 3: Source codes in Wright’s EDD (example 2) From a user perspective this shows that dialect information in an entry can be both general and specific. While information in the head commonly bears scope over the whole entry, source codes in citations are more closely bound to their individual usage examples and can characterise only specific meanings. This variance in the identification of dialect areas has led us to link areal abbreviations and source codes in the search application so that a user will get the full picture and be able to retrieve entries for a particular county even if there is no explicit county label in the entry head. 3.3
Specific dialect attribution: correspondent codes
Another type of specific dialect information that can also be inferred from the citations are oral sources, usually given by the initials of the correspondents in combination with a dialectal abbreviation. In Figure 4, E.S.F. stands for Rev. E. S. Fox from West Yorkshire (w.Yks.), and J.W.P. for J. W. Partridge from Worcestershire (Wor.). BACK-LANE, 5*. Yks. Lin. Rut. Lei. War. Wor. [bak-len, Yks. ba’k-loin.] A narrow, unfrequented street, gen. a by-way leading from the main thoroughfare. w.Yks. The side street in Snaith running parallel to the High Street is usually called Back Lane (E.S.F.). Lin. I tooke to my heels as hard as I could runne and got my selfe into a back-lane, BERNARD Terence (1629) 156. n.Lin.1 Thaay’re buildin’ a sight o’new hooses agean As’by back-laane fer th’ iron-stoan men to live in. Rut.1, Lei.1 War.3 When there is more than one road through a village, the least important is generally known as the back lane. Wor. (J.W.P.)
Figure 4: Correspondent codes in Wright’s EDD Similar to source codes, the combination of correspondent codes and dialect labels indicates where a word or phrase was used. Since this piece of information can be very helpful in cases where there is no precise dialect label in the entry head, we intend to link dialect area and oral sources. This also implies that the informational value of source codes and correspondent codes is twofold. On the one hand, they explicitly mark specific sources and will be retrievable as such on the search interface. On the other hand, they implicitly mark geographical
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
207
distribution and will thus be automatically included in queries for specific dialect areas. 3.4
Some selected features and problems of dialect attribution in the EDD
One of the greatest merits of the electronic EDD (eEDD) will be the automated cross-linking between all counties, regions and nations, allowing the user to retrieve a significantly greater number of relevant results. More precisely, we have tried to allocate all counties to the regions to which they belong geographically, and the same has been done with regard to nations. For example, a search for the abbreviation of Wales (Wal.) merely yields 134 lemmas. These results will be multiplied with an inclusive query that uncovers not only the explicitly marked entries but also those referring to the regions of North Wales and South Wales and to all the Welsh counties like Anglesea, Merionethshire and Flintshire. The result list for this inclusive query will be arranged hierarchically so that the most relevant results will be displayed on top. Thus, the hits for Wales-nation will be listed first, followed by Walesregions and, finally, by the lemmas of individual Welsh counties. This feature will prove helpful for dialectologists who require a vast amount of data and do not mind the graded precision of the results6. Despite this feature, the most apparent difficulty for the dictionary user is the heterogeneous positioning of dialect areas in the EDD and the varying modes of referring to dialect information. As mentioned, general dialect labels occur in the entry head right after the word class labels. But they may also refer to variants or phonetic transcription. Very often, dialect information is included in the citations, specifically in connection with Wright’s written and oral sources. Dialect information may further be provided in the comments section or, more exceptionally, in the definitions. This scattered mode of marking dialect areas needs to be taken into account when devising queries on the search interface (cf. chapter 5.2.1.). To add to the complexity, the EDD also contains fuzzy dialect indications, e.g. in some parts of England, in other parts of Scotland. The problem of separating such fuzzy dialect information from more specific reference demands a layered structure of dialect areas on the search interface (cf. 5.2.1.). Furthermore, counties, regions and nations are sometimes also referred to negatively, i.e. they are actually excluded as dialect areas (e.g. not in Sc., or not in gloss. of s.Chs. and Shr.). Such negated codes were kept apart as a separate subset of dialect information and excluded from further consideration during the parsing process. 4.
The parameter of definition
As far as the meaning of the lemmas is concerned, Wright gives their definitions in various layouts, conditioned by the single case of an entry. Moreover, definitions can be steered by aspects of word formation, syntax and semantics. On
208
Alexander Onysko, Manfred Markus and Reinhard Heuberger
the other hand, elements of meaning can also be gleaned from other parts of the entries than the definitions proper. 4.1
Layout
The simplest case of layout is demonstrated in Figure 1: the definition is given in the paragraph of the head. However, with polysemous headwords or in the case of a phrase formed with the entry word, meanings are listed in separate paragraphs. Figure 5 shows an excerpt of the complex definition of the verb ACT (definition 6):
Figure 5: Listed meanings in definition 6 of the entry ACT It is clear from this example that meanings are not listed by Wright as pearls on a string, but mixed with phrases and, in this case, with derivations (Acting/Action). They are also interrupted by citations. This type of layout has caused special problems for our parser, which we have partly solved by classifying definitions manually. In other cases the slot of definition is not irritatingly complex, but empty (Figure 6):
Figure 6: Empty definition slot Here Wright has not given the definition of acant in his own words, but silently refers the readers to his citations. The parser, of course, misses such taciturn references, unless we mark them manually.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
209
In yet other cases the definition is indicated by a cross-reference, as in Figure 7, definition 3:
Figure 7: Entry of DUMMOCK with cross-reference (see below) in definition 3 The entry in Figure 7 also demonstrates that the layout of meaning can take different shapes in one entry: the first meaning, of the noun, is given as an appendix in the first paragraph; the second meaning is that of the verb, and the third meaning, of the phrase, is not explicitly given but made clear by crossreference. In sum, it is only up to a point that layout can be helpful for the computational identification of definitions. 4.2
Patterns of word formation and syntax
The diverse layout correlates with patterns of word formation and syntax. Specifically Hence (see Figure 5) appears as a marker of introducing derivations, with the definitions given after the part of speech labels. Figure 8 provides another example of this function of Hence:
Figure 8: Hence, marking derivations, at the beginning of a paragraph Compounds, by contrast, can be traced by the introductory marker comp. (see Figure 9) and phrases (see Figure 7) by the initial string phr.:
210
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Figure 9: Use of comp. in entry ADDLE (lit. ‘stagnant water’) In addition to phr. and comp., Wright has sometimes used the tag comb. (‘combination’), obviously in cases where he found it hard to decide whether a word cluster should be classified as a compound or as a syntactic group (phr.). We have decided to abide by this intermediary category so as to respect Wright’s expertise and the historical importance of the text. In other cases, however, Wright has summarised a paragraph of word clusters by the word pair comp. and phr., as in Figure 10. While the expressions concerned can be manually disentangled and attributed to either of the two types, we will, for the time being, keep the mixed bag comp. and phr. untouched as an autonomous category. 4.3
Semantic patterns
Given that meanings are parsed appropriately as definitions, the question arises in what way this information can be retrieved. In the EDD meaning stands for many different things, depending on whether we are talking about Wright’s explanation of the letter A, the definition of a flower term by its Latin name, or the grammatical explanation of a function word. Moreover, dialect terms can be tied to specific phrasal expressions. In this case, the definition of the lemma consists of the actual phrase, i.e. the definition is given by its idiomatic use. This leads to a structural overlap between the search options in phrases and definitions (see Figure 11) and, as such, will have to be corrected manually. Figure 10 (lemma ALL) demonstrates a further problem involved in classifying definitions caused by the interspersed reference of meaning, compounds and phrasal elements. While parsing definitions fragmentised in single words disallows the immediate retrieval of the whole definition and its appropriate designation (e.g. All-a-bits ‘in pieces or rags’; Figure 10), a keyword search in definitions will guide the user close to the defined phrase or compound. In answer to these restrictions, we have planned to make the phrases explicitly available for analysing dialectal variation. In view of the fact that the EDD contains the
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
211
abbreviation phr. 9818 times and the related label comb., as a marker of syntactic groups, 3756 times, the definitions of these lexical units are a most promising field of dialectal research; the more so since every single marker may introduce a dozen or more examples, as in the case of ALL in Figure 10.
Figure 10: Variable form of definition 4.4
Semantic elements outside definitions
Elements of meaning can be found in Wright’s EDD also outside the definitions proper, e.g. among usage labels, citations and comments. The latter are given in square brackets at the end of entries7. Suffice it to give one example of such implicit meaning: the marker diminutive, which appears 200 times in total, mostly occurs in comments but sometimes also in other entry parts (e.g. head and definition). The use of diminutive suffixes with an emotional connotation is a striking feature of English dialects. This example is meant to remind us that dialectal features may also be couched in morphological productivity and are, thus, more than just the nomenclature affiliated with country life, from hay to horses.8 Wherever features of meaning are hidden in an entry, we try to trace them most comprehensively and to make them retrievable via the search interface. 5.
Outlining the search interface
In general, the EDD appears as very consistent in its lexicographic macrostructure and type-setting (cf. Markus 2007, Thompson 2006: 12-14), which allows postulating formal rules to automatically parse the various structural units (head, definition, citation and comment) to a sufficient extent9. This parsing is taken as the basis for retrieving information from the eEDD. For the conception of the search interface our aims are twofold: on the one hand, we follow the premise to map the eEDD as closely to the original as possible. This involves retaining its entry structure, highlighting the various
212
Alexander Onysko, Manfred Markus and Reinhard Heuberger
subsections in the entry, and providing the possibility of comparing the digital output with an image of the original dictionary entry. On the other hand, the search interface should allow sophisticated access to the information contained in the dictionary. This means that users should be able to tailor their queries for certain entry parts and that they should be able to search for specific dialectal information (e.g. for finding all verbs that occur in Yorkshire, Northumberland and Cumberland, or all nouns in Scotland that are etymologically marked as deriving from Old Norse). 5.1
Functional outline
In line with these aims, we have devised a search interface that strives to provide flexibility in accessing the dictionary and to offer high granularity of information retrieval. Figure 11 provides a schematic overview of the search interface.
Figure 11: Schematic outline of the search interface Basically, the user will have two main options of accessing the dictionary. First of all, a headword search will be available as the default mode: any string (including truncation symbols) can be entered in the field Search for and the respective hits will be retrieved from the database. This word (or string) search can be limited in scope to the various structural units in the entries (full text, heads, definitions, citations, comments, variants, compounds, derivations and phrases).
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
213
The second basic option is to tap the information in the dictionary by selecting Search filters (dialect area, usage label, part of speech, source, phonetic, morphemic, etymology and time span). This will allow a user, for example, to choose Scotland from the list of dialect area (nation) and find all the entries that bear reference to Scotland (for further details on dialect areal queries see 5.2.1.). Similarly, a user can search for entries containing specific usage labels (e.g. frequently, obsolete, slang, etc), etymological information (e.g. Middle High German, Old Norse, Middle Dutch, etc), part of speech categories (e.g. noun, adjective, etc; this also includes basic morphological and syntactic markers, e.g. diminutive, intransitive, etc) and source information (e.g. printed sources, correspondent codes and works from the general bibliography). These filter options are for the most part based on information explicitly provided in the dictionary. Thus, the standard abbreviations for counties, regions and nations, the part of speech labels, etymological references and the abbreviations for correspondents, dialect glossaries and general reference works are extracted from Wright’s lists of abbreviations, correspondents and unpublished works, as well as from the general bibliography at the end of volume 6. Usage label appears as a hybrid category in the sense that abbreviations employed by Joseph Wright are mixed with various keywords (e.g. inanimate, synonym, cant, derogatory, etc). We have defined the latter as separate labels from the dictionary according to their regular occurrence in the entries and for their relevance from a current lexicographic perspective (cf. Béjoint 2000). The remaining options among Search filters incorporate functions going beyond Wright’s basic lexicographic markers as given in lists of abbreviations and sources. Thus, the phonetic search (envisaged for a later stage of the project) will feature a separate search window, allowing queries for phonetic symbols. The category morphemic will offer a list of derivational morphemes, and time span will create the possibility of searching dictionary entries per year and period. This means that a user can search for dialect words for any given year and by periods of ten years between 1700 and 1900. In addition, we also intend to implement a search for the earliest date of mention of entries in the EDD. Since dates are based on bibliographic information, they merely provide indirect evidence. Nevertheless, they can facilitate investigating the use of dialect words in literary works cited in the entries. In addition to these features, we are planning to implement parts of the database, in particular the citations, as self-contained corpora, with our software allowing work indexes and concordances. Apart from offering searches in specific entry sections or providing a selection of parameters in scroll-down menus among search filters, the interface will turn into a fully fledged query engine when the user combines search filters, specific string searches and queries in selected entry parts. If users, for example, are interested in terms relating to cow in England, they can type in cow*, search in definitions, and select the nation label England from the search filters. In addition, basic Boolean operators provide the option of combining the categories of search filters and also of selecting several parameters among the different search filters. While cross-categorial combinations between dialect area, usage
214
Alexander Onysko, Manfred Markus and Reinhard Heuberger
label, part of speech, etymology and source are restricted to AND queries, selections within categories allow for AND and OR logic. Consequently, highly complex queries become possible, e.g. (Scotland OR England OR Wales) AND (Substantive AND Plural) AND (Slang) AND (Old Norse). These can basically be run in the various entry parts (cf. Figure 11)10. In order to maintain an overview of one’s filter selections, a search protocol will automatically be generated, and the last 10 search commands will be stored for immediate retrievability. As the last example indicates, the search engine will have the capacity to minutely tap the deepest wells of the EDD and thus clearly bring to light both its potential and its limitations. In its options of combining query parameters, the search interface might at times appear too powerful and even dig beyond the substance of dialectal information in the EDD. 5.2
Retrieving dialect areas and searching in definitions
After the presentation of the basic structure of the search interface, this section will touch upon issues raised in the preceding chapters and show their implementation and their repercussions on query design. The multi-layered reference to dialect areas (cf. chapter 3) calls for a nested, albeit more complicated, interface structure. The vicissitudes of meaning appear at first hand as limiting factors as far as retrieving information from the entry area definitions is concerned. With the proper scope of the queries, however, the initial limitations can be compensated. 5.2.1 Structure and representation of dialect areas In line with the tripartite classification of dialect areas into nations, regions and counties and further sub-classifications into cardinal directions and fuzzy phrasal expressions (cf. chapter 3.1.), the search interface offers the possibility of selecting areal abbreviations among these categories. In addition, we have decided to combine the levels of counties and regions with nations. Accordingly, selecting England will not only yield the entries that contain the explicit abbreviation Eng. but also all the lemmas that are attributed to any of the English counties or regions. On the other hand, if users are interested in investigating general dialectal terms of England, they will have to be able to search for the explicit label Eng. only. These considerations lead to a layering of filter options on the search interface. Figure 12 provides an overview of the types of dialect information. While directional and fuzzy information occurs among nations, regions and counties, Wright did not comprehensively apply these modes of reference to all basic abbreviations. In the case of Yorkshire (Yks.), where he drew on a large pool of informants and other sources, he made more subtle distinctions between n.Yks., s.Yks., e.Yks., w.Yks., m.Yks. (mid Yorkshire), ne.Yks., nw.Yks., se.Yks. and sw.Yks. Counties for which he lacked substantial information are devoid of further subdivisions. This is particularly evident in Ireland, for which only 5 out
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
215
of 32 counties include directional specification. Furthermore, in the majority of Scottish (25 of 44) and Welsh counties (7 of 11), Wright abstained from providing directional specification. Dialect Area
Nation
Region
explicit
directional
County
fuzzy
Figure 12: Overview of areal dialect information in the EDD As regards fuzzy areal information, merely 35 out of 126 counties are referred to by means of vague phrasal expressions, such as in some parts of Lancashire, in many parts of Cheshire. Nevertheless, this information of restricted value is incorporated in the search interface. Figure 13 shows the varying types of dialect specification for a few English counties.
Figure 13: Excerpt of search interface: dialect area – county
216
Alexander Onysko, Manfred Markus and Reinhard Heuberger
5.2.2 The elusiveness of meaning The discussion in chapter 4 has demonstrated that the explication of meaning is far from homogeneous in the dictionary. From the perspective of the eEDD, the different facets of meaning representation evoke questions as to how far definitions can serve as elements of investigation, and which search strategies can be employed in order to adequately retrieve meaning. Preliminary investigations indicate that searching the lemma definitions of the eEDD will prove interesting in terms of lexical fields. Dialects are generally rich in lexical variation for concepts pertaining to speakers’ everyday concerns (cf. Chambers and Trudgill 1998). As the EDD covers the period of Late Modern English11, speakers’ language use is prone to be shaped by an interesting clash of the traditional and the modern, i.e. the unification of rural values of agricultural life on the one hand, with concepts arising from the heyday of the industrial revolution12. Probing the traditional side, terms of the lexical field of domesticated animals were presumably popular among rural dialect speakers in the 18th and 19th centuries. Indeed, the number of concordances found in the dictionary for terms like cattle (TF13 1564), cow (TF 1777), sheep (TF 2674), pig (TF 1062), dog (TF 1496) and horse (TF 3259) hint at the prolific documentation of dialectal terms in this lexical field. How can a user be guided in finding dialect words by such semantic keyword searches? The primary home of meaning in the eEDD is the entry part classified as definitions (cf. Figure 11). Restricting the search to this part of the dictionary entries will cover the standard case of meaning explication for monosemous and polysemous lemmas. The way this works may be seen from the definition in the entry BEEF. The first four senses are the following: 1.An ox or cow intended for slaughter. 2.A fibrous carbonate of lime, with a texture resembling fossil wood. 3.Riming slang for ‘stop thief!’ 4.Comp. (1) Beef-balks, a shelf or beam for storing beef; (2) -ball, a beef-dumpling; (3) -brewis, beef-broth (4) -case, a laddershaped frame, hung horizontally under the ceiling near the fire, on which beef was placed to dry (5) -eater, see below; (6) -head, a blockhead, fool; (7) -heart, a cow’s heart ready for cooking; (8) -steak rock, (9) -tree, see below. Applying keyword queries, a search for the term cow* will uncover the metonymical extension of the anthropocentrically driven functionalisation of the animal for its meat (definition 1). A user interested in finding dialectal terms for thief will be able to retrieve the meaning of beef as “Riming slang for ‘stop thief!’” from definition 3. The fourth sense of BEEF is an example of the mixture of compounds and their respective meanings under the larger structural umbrella of the definition of a lemma. At the stage of automated parsing such mixed bags
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
217
of compounds and meanings were parsed as definitions. In order to reduce the lexical paraphernalia in this entry part, manual classification of the actual meanings would be necessary. The completion of this task is envisaged for the second phase of our research project SPEED in 2009 and after. At that stage, we will also try to minimise another shortcoming of searching in definitions. This is that a keyword search in definitions alone currently neglects instances of meaning present in other entry parts (e.g. citations). To avoid losing this information, it is advisable to extend the semantic keyword search to citations and comments. At present, this is a slight limitation in terms of user friendliness and accessibility of the data in the dictionary. 6.
Conclusion
As the remodelling of the EDD into an electronic version is still under way, this article provides an overview of some of its most important lexicographic parameters and discusses their implementation on the search interface. The attribution of dialectal terms to specific areas is one of the characteristic features of the EDD as Wright painstakingly tried to provide comprehensive information on where a dialect expression was used. Accordingly, reference to dialect area is encoded in the dictionary in multiple ways: in abbreviations pertaining to national, regional and county nomenclatures as well as in reference to dialect glossaries and correspondent codes. Furthermore, Wright employed cardinal directions to fine-line areal differences, and, in a few instances, he provided areal information in the form of fuzzy phrasal expressions. To adequately reflect these layers of reference to dialect areas, the electronic version will allow inclusive and combined searches of these parameters while still retaining the flexibility for searching explicit and individual dialect areas. The issue of lexical meaning looms as a further crucial and complex matter since meaning can be incorporated at various locations in an entry (e.g. in citations, cross-referenced, and immediately following compounds, derivations and phrasal expressions). This structural heterogeneity renders the automated recognition of meaning elements a highly difficult undertaking and has forced us to focus our present attention on the standard case of meaning in monosemous and polysemous entries. For the time being, searching among definitions in the electronic version will bear the marks of these structural restrictions. They can, however, be compensated with more integrative Search in selections. Overall, the possible combinations of filter parameters and entry parts will allow a comprehensive grasp of the information contained in the EDD and, thus, inspire new insights into the dialectal landscape of Late Modern English. Notes 1
The research team of the government-funded project SPEED (Spoken English in Early Dialects) includes Manfred Markus (director), Reinhard
218
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Heuberger (co-director), Alexander Onysko (project manager), Raphael Unterweger (software developer), Christian Peer and Christoph Praxmarer (junior researchers). 2
A tentative query in the OED 2 on CD-Rom for dial* provides merely 7,942 results, and specific dialect areas such as counties are rarely mentioned. Searching for Worcestershire, for example, yields eleven entries, four of which refer to the word “Worcester(shire)” itself. The current online-version of the OED, however, is in the process of integrating entries of the English Dialect Dictionary (Philip Durkin, personal communication).
3
The term dialect is used here in Wright’s non-modern sense of dialect as regional variation (cf. Quirk et. al. 1985: 16).
4
No distinction is made between Northern Ireland and the Republic of Ireland in line with the political reality towards the end of the 19th century.
5
Wright refers to America as a cover term of the U.S.A. and Canada. In some entries, e.g. POINTER, the abbreviations U.S.A. and Can. are listed individually, whereas there seem to be no entries where Amer. is combined with either of the other dialect markers.
6
The geographical precision of the hits is reduced for dialect areas that appear lower in the result list as they extend the literal, i.e. narrow, interpretation of the search item. The benefit of more inclusive searches, however, becomes evident as soon as further search filters (e.g. adj., slang or figurative) are selected since they tend to demand a reasonably sized pool of data to deliver results.
7
Another way of tracing meaningful elements in the various slots is by searching for words such as meaning and sense. These items occur 422 and 754 times in the dictionary.
8
Cf. the keyword lists in various lexical fields as listed by Francis (1983, 54-60) in line with previous research.
9
Manual proofreading shows an error rate of about 15% for automated structural parsing. The head, for example, was specified as the area between the lemma and the phonetic transcription, formally marked by square brackets. For the fairly numerous entries that lack phonetic transcription, a set of function words and auxiliary verbs (e.g. a, the, of, having) were taken as boundary markers since they typically initiate definitions. Apart from the main structural units, however, the proper recognition of variants, phrases, compounds and derivations demands
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
219
plenty of manual post-editing as these entry segments fail to provide clear boundary markers. 10
For complex queries, however, it is advisable to keep the Search in options fairly broad to obtain results. Furthermore, a few filter parameters are tied to specific entry parts such as etymological abbreviations, which are exclusively given in the comments section.
11
Cf. Beal (2004) for a comprehensive overview of this period.
12
Joseph Wright himself grew up under harsh circumstances during the late industrial revolution. He had to work in a cotton mill already at age 7 and learned how to read and write only in his teenage years, attending evening classes and Sunday school (cf. Holder 2004: 229-34).
13
Token Frequency was ascertained with Wordsmith Tools for the full text of all dictionary entries.
References Beal, J.C. (2004), English in Modern Times. London: Arnold. Béjoint, H. (2000), Modern Lexicography. Oxford: OUP. Chambers, J.K. & P. Trudgill (1998), Dialectology. 2nd ed. Cambridge: CUP. Darlington, Th. (1887), The Folk-Speech of South Cheshire. English Dialect Society. Ellis, A.J. (1968 [1889]), On Early English Pronunciation. Reprint. New York: Greenwood Press. Francis, W.N. (1983), Dialectology. An Introduction. New York: Longman. Holder, R.W. (2004), The dictionary men: their lives and times. Bath: Bath UP. Markus, M. (2007), ‘Wright’s English Dialect Dictionary Computerised: Towards a New Source of Information’, Online publication VARIENG Helsinki. Markus, M. (2007), ‘Wright’s EDD Computerised: Architecture and Retrieval Routine’, Online publication Conference Dagstuhl Dec. 2006. Markus, M. and R. Heuberger (2007), ‘The Architecture of Joseph Wright’s English Dialect Dictionary: Preparing the Computerised Version’, International Journal of Lexicography 20 (4), 355-68. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Scott, M. (2004), Wordsmith Tools. 4th ed. Oxford: OUP. Thompson, A. (2006), Joseph Wright’s slips: a linguistic evaluation of the making of the English Dialect Dictionary. Unpublished M.A. Thesis: Leeds. Wright, J. (1898-1905), The English Dialect Dictionary. 6 vols. Oxford: Henry Frowde. Wright, J. (1905), The English Dialect Grammar. Oxford: Henry Frowde, [repr. 1968].
How representative are the ‘Philosophical Transactions of the Royal Society’ of 17th-century scientific writing? Lilo Moessner RWTH Aachen, Germany Abstract The focus of the paper is on the notion of representativeness. It is approached from three different angles. In the first section, representativeness as a (desirable? possible?) property of linguistic corpora is discussed. Then the point of view is narrowed down to the R (for ‘representative’) in ARCHER, and here in particular to the register ‘science’. In the following empirical part, a multidimensional analysis of English science texts of the 17th century is presented. It is based on a corpus which comes in equal parts from ARCHER and from other sources. The comparative analysis reveals major differences between the sub-corpora. They are interpreted in section 4 as different degrees of representativeness. The last section contains a summary and the conclusion that the linguistic structure of English science texts of the 17th century is not fully represented by a random sample of texts from the Philosophical Transactions.
1.
Representativeness as a property of linguistic corpora
When the pioneers of sociolinguistics applied sociological methods in linguistic studies, they were very explicit about the requirement of representativeness and about the sampling methods to be used in order to achieve this goal: “There is only one way to ensure that the results obtained in an incomplete survey of this kind can legitimately be said to apply to the population as a whole: the section of the population which is to be studied must be selected by ‘accepted statistical methods’ (...). The informants, that is, must constitute a genuine representative sample of the city’s population.” (Trudgill 1974: 21) The ‘accepted statistical methods’, which are described in much detail (Trudgill 1974: 21-25), produced what handbooks of empirical social sciences refer to as quasi-random or stratified samples. They were not completely random, because not every member of the population had the same chance of being selected. In Trudgill’s study on the social stratification of Norwich English it was important that members of all social strata were included in the sample, and this would not have been guaranteed by complete random sampling. The compilers of the Brown Corpus probably aimed at representativeness, too, since Francis’s understanding of a corpus was “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to
222
Lilo Moessner
be used for linguistic analysis” (Francis 1979: 110). It is, however, doubtful if representativeness was achieved. One problem lies in the determination of the universe of texts from which the samples are to be taken, and another in the decision on the number of text categories as well as the number and size of the texts to be included in each category. For the Brown Corpus the universe of texts was defined as “edited English prose printed in the United States during the calendar year 1961” (Brown Corpus Manual). For practical reasons, most texts were taken from the holdings of Brown University Library, which narrows down the scope of the textual universe. The list of the text categories and the number and size of the texts to be included in each category were set up at a conference at Brown University in 1963. After these initial decisions were taken, the actual sampling process could start, and here the rules of random sampling were followed. These shortcomings concerning representativeness are explicitly mentioned in the manual to the LOB Corpus: “... the present corpus is not representative in a strict statistical sense..... The true ‘representativeness’ of the present corpus arises from the deliberate attempt to include relevant categories and subcategories of texts rather than from blind statistical choice.” (Johansson 1978: 14). Since the FLOB and Frown corpora were planned as exact counterparts to LOB and Brown, the same compilation principles were followed. Necessary modifications in the press section of FLOB are described in Sand/Siemund (1992). It is generally agreed that the compilation of historical corpora is an even more daunting enterprise, especially when early periods are to be included (cf. Biber et al. 1998: 251-53; Meyer 2002: 37f.). I need not point out the problems arising from the limited amount of data and their poor quality. What is more relevant in the present context is the classification of the textual universe, be it ever so limited and so poor. A lot of work lies behind the division of the texts in the Helsinki Corpus into 33 text types and 7 prototypical text categories, respectively (cf. Kytö/Rissanen 1993: 10-14; Rissanen 1994: 76f.; Kytö 1996: 46). But the principles of classification are as subjective as in the corpora mentioned before, and the universe of texts from which the text selection was made is even less precisely specified. I would like to stress that I have no intention of playing down the merits of the Helsinki Corpus; like so many others I am very grateful that we have it. But we should be aware of its limitations, when we want to generalize on the basis of results obtained from its data. ARCHER is the long-term diachronic English corpus which complements the Helsinki Corpus with texts from the middle of the 17th to the end of the 20th century. Its name documents that is was intended as A Representative Corpus of Historical English Registers. This aim is explicitly stated in the description of the corpus: “It was our aim ... to collect a representative corpus of texts in each of the several registers” (Biber/Finegan/Atkinson 1994: 4). The compilers of ARCHER were aware of the problems ensuing from their ambitious aim, and the guidelines which they followed in the compilation process allow us to assess the amount of representativeness they could hope to attain. Ideally the textual universe would
17th-century scientific writing
223
have been derived from an exhaustive list of all texts written between 1650 and 1999. For practical reasons, the starting-point were “the major research libraries of the University of Southern California, the University of California at Los Angeles, and the Huntington Library in San Marino” (ibid.: 5). This is basically the same procedure as that adopted for the Brown Corpus. The selection of the texts and their assignment to 10 categories were probably as much a result of careful deliberation and agreement among the research team as in the compilation of the Brown Corpus. Some categories of the Brown Corpus have counterparts in ARCHER, some Brown categories are missing from ARCHER, and some ARCHER categories are absent from the Brown Corpus.1 Special decisions were taken for the compilation of the register ‘science’, and these will be dealt with separately. After these initial considerations it seems that representativeness is an unachievable goal and therefore not worthwhile aiming at any longer. The first part of this claim was explicitly supported by Mukherjee in his review article of three recent introductions to corpus linguistics,2 and it was implicitly admitted by the expression “the holy grail of representativeness” in Leech (2007: 134).3 This pessimistic attitude is easy to understand if we ask which conditions must be fulfilled for a corpus to be representative. A representative corpus is a random sample of a population, which exactly maps the structure of the population. This, however, can only be guaranteed when we know so much about the structure of the population that sampling becomes superfluous.4 This consequence is not drawn by either of the linguists mentioned before, nor does it correspond to the research reality in corpus linguistics. This more optimistic point of view is based on the conviction that representativeness is not an either/or property, but that there are degrees of representativeness and that there are methods helping to approach the ideal of maximum representativeness. These methods involve the use of the competence of expert corpus linguists in determining the number and relative importance of genres, the appropriate degree of their subclassification, and the size of the text samples to be included. Leech (2007: 140f.) contrasts two proposals for achieving a still higher level of representativeness. His own consists in making ACEs (Atomic Communicative Events) the yardstick for the inclusion of texts into a corpus. This means that the proportion of texts of a special category is not measured by the number of texts produced, but by the number of text recipients. One of the consequences of this approach would be that in a corpus of PDE, tabloid papers and everyday conversations would be represented by more texts than broadsheet papers and academic lectures or sermons. The other method was proposed by Biber (1993) and taken up in Biber et al. (1998: 250). It consists in a cyclic procedure, starting with the compilation of a pilot corpus. The results of a multidimensional analysis of this corpus reveal those registers in which the linguistic variation of the whole corpus is insufficiently evidenced. More material is added to them, and the cycle starts again with the enlarged corpus. It is repeated again and again, until stable variation is achieved. A similar line of research is also envisaged for the compilation of a web-based corpus in Biber et al. (2007: 126).
224
Lilo Moessner
2.
The register ‘science’ in ARCHER
In the context of representativeness, three registers of ARCHER deserve special attention; these are ‘legal opinion’, ‘medicine’, and ‘science’. The universes from which their texts were chosen were narrowed down to “appellate and Supreme Court decisions of the Commonwealth of Pennsylvania” (Biber et al. 1994: 6), to articles published in the Edinburgh Medical Journal, and to those in the Philosophical Transactions of the Royal Society (PTRS).These restrictions may have been motivated by the research aim to investigate the evolution of “representative journals” (ibid: 2),5 but it is equally probable that more practical reasons dictated this strategy. That this was definitely so and that this led to even greater restrictions for the register ‘science’, is openly admitted: “as an expedient in the face of diminishing resources, we targeted volumes representing central years within each period rather than all volumes throughout the period” (ibid.: 6). This means that the first 50-year period of the register ‘science’ is represented by articles of the PTRS of the years 1674 and 1675. They deal with very different topics, ranging from “Brief Directions on How to Tan Leather”, “A Phytological Observation concerning Orenges and Limons”, “Advertisements ... upon Frosts in some parts of Scotland” to “Considerations Touching the Compression of the Air”. They come in 10 files of about 2,000 words each. As a consequence of random sampling and due to the fact that scientific articles of the 17th century were often much shorter than those of modern times, the majority of files contains more than one text. The text passages of the 13 identifiable authors range from a little more than 100 to just over 2,000 words. The size of the texts whose authors are not indicated or are marked ‘anonymous’ amounts to nearly 40%. This structure of the science sub-corpus of ARCHER casts some doubt on its representativeness. A more serious problem lies in the choice of the textual universe from which the samples were taken. Although the leading role of the Royal Society in 17th century science is undisputed and although its language policy had a shaping influence on scientific English,6 the PTRS were not its only, perhaps not even its most important, voice. Papers read during the meetings of the Royal Society could be included in its History, in its Register-book, or be published as monographs. Following Biber (1988: 13), who argues that there is a systematic relation between extralinguistically defined registers, in this case science, and their linguistic structure, it will be assumed that different publishing formats of 17th century science texts resulted in different variation patterns. If this hypothesis can be empirically supported, the representativeness of the sub-corpus ‘science’ of ARCHER is not yet perfect. In the next section I will present results from a comparative analysis of the 17th century sub-corpus ‘science’ of ARCHER and of a matching corpus of science texts written by scholars connected with the Royal Society and published in book form.
17th-century scientific writing 3.
A multidimensional analysis of 17th century science texts
3.1
The data7
225
The texts of the 10 science files from ARCHER were originally sent to Henry Oldenburg, the editor of the PTRS, some of them as responses to queries from the Royal Society. Others reached him through intermediaries who wished to bring them to the attention of a wider scientific public. It depended mainly on Oldenburg if an article was included in the PTRS or not. All ARCHER files are prefixed by an identification label which specifies the date of publication and the author of the first text of the file. In this study, the files will be referred to by abbreviations of the text authors.8 The control corpus also contains 10 files of about 2,000 words each, but each file consists of just one passage from one text. The files were produced as transcripts from microfilm versions of the original texts9 or from facsimile editions accessible through Early English Books Online. Emendations were kept at a minimum, only obvious spelling errors were corrected. Care was taken to include passages from different parts of the texts. The files will also be referred to by abbreviations of the names of their authors. The following texts, which date from between 1661 and 1691, were used: Nehemiah Grew: Experiments in Consort of the Luctation arising from the Affusion of several Menstruums upon all sorts of Bodies (1678) [= Grew] Robert Hooke: An attempt for the explication of phaenomena (1661) [= Hook] Henry Power: Experimental Philosophy (1663) [= Pow] Robert Boyle: A Continuation of New Experiments (1669) [= Boyle] Sir Kenelm Digby: Chymical Secrets (1683) [= Dig]10 George Sinclair: Hydrostatical Experiments (1672) [= Sincl] Hugh Gregg: Curiosities in Chymistry (1691) [= Gregg] Thibaut, P.: The art of chymistry (1675) [= Thib] John Wallis: Discourse of Gravity and Gravitation (1675) [= Wall] John Evelyn: A Philosophical Discourse of Earth (1676) [= Eve] All texts were published as book-length studies, and they deal with topics of the fields of physics and chemistry. Their authors had close connections with the Royal Society (RS); this is why the files will be referred to as RS-files. Most authors were fellows of the RS. Sinclair was an active correspondent, whose observations can also be found in the PTRS. Thibaut’s book was originally written in French, but the English translation was made by a fellow of the RS. The whole corpus contains about 41,000 words, 20,751 words coming from the ARCHER sub-corpus and 21,019 words from the RS-subcorpus. Its structure is shown in Table 1.
226
Lilo Moessner
Table 1: Structure and size of the corpus ARCHER size RS size
3.2
Ano1 2,054 Grew 2,028
Leew 2,127 Hook 2,026
A.I. 2,042 Pow 2,050
Ano2 2,083 Boyle 2,012
Beal 2,056 Dig 2,024
B.R. 2,057 Sincl 2,168
Hook 1,899 Gregg 2,198
Hugy 2,001 Thib 2,281
Leib 2,181 Wall 2,082
Ray 2,251 Eve 2,150
Research method and linguistic features
The corpus was analysed with the method of multidimensional analysis (MD analysis); it assumes that -
texts are characterized not by one, but by a combination of several communicative functions, these functions can be described as dimensions of variation, the dimensions of variation can be derived from the co-occurrence patterns of linguistic features, the co-occurrence patterns can be automatically produced by a statistical program as the output of a factor analysis.
The input for the factor analysis are frequency counts of the features which are assumed to determine the functional profile of the texts under consideration. Strictly speaking, each new MD analysis should start with a factor analysis. In practice, however, apart from a couple of noteworthy exceptions (Taavitsainen 1993, Biber 2001), it has become customary to take over the dimensions of variation established in Biber (1988) for PDE, but the linguistic features which are counted are adapted to the research interests of the respective linguists and to practical necessities (Biber/Finegan 1997, González-Álvarez/Pérez-Guerra 1998, Atkinson 1999). This is the strategy followed here, too. The linguistic features were counted in all 20 files, and the absolute frequencies were normalised on the basis of 1,000 words to avoid skewing of the results by the different sizes of the individual files. The mean frequencies of all features were calculated, and their standard deviations in both sub-corpora, i.e. in the ARCHER-files and in the RS-texts, were established. Then all frequencies were standardized to a mean of 0.0 and a standard deviation of 1.0. Text dimension scores were computed on all dimensions as the sums of the standardized frequencies of the features, and their means yielded the genre dimension scores for the two sub-corpora.11 The inventory of linguistic features which were counted comprises the following 17 elements:
17th-century scientific writing -
227
verb forms: present tense, past tense, finite perfect aspect forms, be as main verb, passive verbal syntagms; modal auxiliaries: possibility, prediction, and necessity modals; pronouns: first, second, and third person forms of personal pronouns, reflexive pronouns and possessive determiners; nominal postmodifiers: relative clauses, past participle clauses ; adverbial subordinate clauses: conditional clauses, all other subordinate clauses.
-
Since the texts of the corpus were written in the 17th century, the realisation of the investigated features was not necessarily identical with that of the corresponding features in PDE (e.g. the feature ‘second person pronoun’ is realised by thou and by you, perfect aspect forms contain the auxiliary have or the auxiliary be). EModE relative constructions required a different treatment from PDE relative constructions. In Biber’s PDE model, relative clauses introduced by that figure on the dimension which marks an overt expression of persuasion, whereas relative clauses introduced by wh-pronouns are treated as markers of elaborate reference. In the 17th century the distribution of these two types of relative markers did not yet follow the PDE rules, and therefore all relative clauses were interpreted as markers of elaborate reference. 3.3
The results
On dimension 1, which measures the degree of involvement and interaction of a text, the following features were counted: present tense verb, second person pronoun, first person pronoun, main verb be, and possibility modal. Table 2 contrasts text and genre dimension scores of the sub-corpora. Table 2: Text and genre dimension scores of the sub-corpora on dimension 1 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -0.51 -2.84 2.20
Ano2 Beal B.R. -3.60 3.20 0.97 -0.32 Grew Hook Pow Boyle Dig Sincl -3.16 0.06 -1.79 -2.41 -2.11 0.48 -1.09
Hook 2.05
Hugy Leib Ray -2.42 -1.50 -0.71
Gregg Thib -1.52 0.15
Wall Eve -0.26 -0.32
Six of the ARCHER-files have negative dimension scores, and they are not counterbalanced by high positive dimension scores of the other files. Consequently, the average dimension score of -0.32, which characterizes these texts as a whole and is usually referred to as ‘genre dimension score’, situates them slightly below the dividing-line between involved and informational. Since even 7 RS-texts have negative dimension scores, it is to be expected that the
228
Lilo Moessner
absolute value of the genre dimension score is higher. The genre dimension score of -1.09 indicates that the RS-texts are more informational than the others. The individual text dimension scores also reveal that the RS-texts are linguistically much more homogeneous than the ARCHER-files. Homogeneity is given when the range of text dimension scores, i.e. the difference between the highest and lowest text dimension score, is low. On this dimension, the range of the ARCHER-files is 6.80, and this contrasts sharply with 3.64 for the RS-texts. Biber (1988: 171) suggests two interpretations of high ranges in a genre; either the genre contains several sub-genres (e.g. the category ‘academic prose’ in his PDE corpus, which contains the sub-genres natural science, medical, mathematics, social science, politics/education, humanities, and technology/engineering), or it is not well-defined (e.g. the category ‘conversation’ in his PDE corpus). A comparison of the extreme values among the text dimension scores yields a further interesting result. The lowest score of an ARCHER-file (Ano2 = -3.60) is lower than the lowest score of an RS-text (Grew = -3.16), and the highest score of an ARCHER-file (Beal = 3.20) is higher than the highest score of an RS-text (Sincl = 0.48). This constellation means that the ARCHER-subcorpus covers more variation patterns than the RS-subcorpus. On dimension 2, which measures the degree of narrativity, the features past tense verb, perfective verb (perfect aspect forms of verbs), and third person pronoun were counted. Text and genre dimension scores are entered in Table 3. Table 3: Text and genre dimension scores of the sub-corpora on dimension 2 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -1.05 -1.27 0.93
Ano2 Beal B.R. 4.41 2.72 3.61 1.43 Grew Hook Pow Boyle Dig Sincl -5.25 -0.94 -3.11 -3.63 -0.95 -6.38 -3.56
Hook -1.09
Hugy Leib Ray 4.91 0.71 0.45
Gregg Thib -2.64 -3.90
Wall Eve -4.65 -4.11
Here the difference between the two genre dimensions scores is even bigger. The value 1.43 for the ARCHER-files attests them a small degree of narrativity, whereas the genre dimension score of -3.56 situates the RS-texts considerably below the dividing-line between narrative and non-narrative texts. All RS-texts have a negative text dimension score. The range of text dimension scores is again bigger for the ARCHER-files (6.18) than for the RS-texts (5.44). As on dimension 1, the ARCHER-files cover more variation patterns than the RS-texts. But it is important to note that on dimension 2 the range of the RS-texts does not lie within that of the ARCHERfiles, but that the ranges overlap. The lowest text dimension score of all files (-6.38) comes from the RS-text Sincl, the highest (4.91) from the ARCHER-file Hugy. This yields an overall range of 11.29. The linguistic features which were counted on dimension 3 as indicators of the degree of explicit/elaborate reference were relative clauses of three types:
17th-century scientific writing
229
relative clauses with the relative marker in subject position, relative clauses with the relative marker in object position, and pied piping constructions. Zerointroduced relative clauses were left out of consideration. Table 4 contains the text and genre dimension scores of the sub-corpora. Table 4: Text and genre dimension scores of the sub-corpora on dimension 3 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -0.41 -2.03 0.84
Ano2 Beal B.R. 3.88 0.75 3.56 0.42 Grew Hook Pow Boyle Dig Sincl -2.88 -0.35 -1.05 -1.12 2.64 0.13 -0.80
Hook -1.73
Hugy Leib Ray -1.37 0.16 0.60
Gregg Thib -0.90 -2.16
Wall Eve -3.94 1.67
The text dimension scores of the individual texts yield a positive genre dimension score for the ARCHER-subcorpus (0.42) and a negative value for the RSsubcorpus (-0.80). Consequently, the ARCHER-texts are marked by a small degree of elaborate/explicit reference, whereas the reference system in the RStexts is more situation-dependent. With respect to their reference system, the degree of heterogeneity does not differ much in the two sub-corpora; the range in the ARCHER-subcorpus is 5.91, that in the RS-subcorpus is 6.58. But for the first time it is the RS-subcorpus which covers more variation patterns than the ARCHER-subcorpus. As on dimension 2, the two ranges overlap. The lowest value (-3.94) comes from the RS-text Wall, the highest (3.88) from the ARCHER-file Ano2. Dimension 4 measures the degree of open persuasion. Here the following features were counted: prediction modal (the modal auxiliaries will, shall, would), conditional subordination (conditional clauses), and necessity modal (the modal auxiliaries must and should). The usual figures are given in Table 5. Table 5: Text and genre dimension scores of the sub-corpora on dimension 4 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. Ano2 Beal B.R. -1.50 0.28 0.26 -0.72 -4.34 -1.82 -1.61 Grew Hook Pow Boyle Dig Sincl -2.28 1.16 2.59 -0.38 1.55 1.25 0.81
Hook -2.24
Hugy Leib Ray -3.67 -2.72 0.41
Gregg Thib 0.38 2.29
Wall Eve 3.93 -2.39
The genre dimension score of -1.61 for the ARCHER-files is to be interpreted as a very low degree of open persuasion, whereas their genre dimension score of 0.81 situates the RS-texts slightly above the baseline of the scale of persuasion. As on dimension 3, the range of the text dimension scores is smaller in the ARCHER-subcorpus (4.75) than in the RS-subcorpus (6.32). Consequently, the RS-texts cover more variation patterns. The ranges overlap as on dimensions 2 and 3; the lowest value (-4.34) comes from the ARCHER-file Beal, the highest (3.93) from the RS-text Wall.
230
Lilo Moessner
Dimension 5 measures the degree of abstractness or impersonality. On this dimension the following features were counted: passive verbal syntagms, nominal postmodifiers realized by past participle constructions, and adverbial subordinate clauses except conditional clauses (they are considered a feature of dimension 4). Table 6 contains the relevant figures. Table 6: Text and genre dimension scores of the sub-corpora on dimension 5 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. Ano2 Beal B.R. 1.67 -1.20 -1.67 -0.03 -1.30 0.28 -0.40 Grew Hook Pow Boyle Dig Sincl -0.18 -1.14 -1.31 1.50 0.31 0.84 0.32
Hook 0.98
Hugy Leib Ray 1.10 -1.42 -2.42
Gregg Thib 3.85 1.94
Wall Eve 0.88 -3.44
The genre dimension score of the ARCHER-files (-0.40) situates them just below the dividing-line between abstract and non-abstract texts, whereas the value for the RS-texts (0.32) situates them slightly above the dividing-line on the abstract side. As on dimensions 3 and 4, the RS-texts show a greater range of text dimension scores (7.29) than the ARCHER-files (4.09). A comparison of the respective positions of the two ranges yields a mirror image of dimension 1. On dimension 5, both the lowest value (-3.44) and the highest value (3.85) come from RS-texts. 4.
Interpretation of the results of the analysis
The results of the analysis presented in the preceding section show that the two sub-corpora differ in the following four parameters: the extreme values of the text dimension scores, the position of the range of these values on the dimension scales, the genre dimension scores and their position on the dimension scales. These parameters are illustrated in Figure 1. On dimension 1, all text dimension scores of the RS-texts lie within the range of the ARCHER-subcorpus, and both sub-corpora have genre dimensions scores which mark them as more informational than involved. Consequently, the control corpus did not yield additional variation patterns, and the ARCHERsubcorpus with its wider range represents the genre ‘science’ better than the RSsubcorpus.
17th-century scientific writing
231
Figure 1: Ranges of text dimension scores and genre dimension scores It is interesting to note that the files with the lowest and the highest text dimension scores, Ano2 and Beal, represent different sub-genres. Both files contain one text only; Ano2 is an account of discoveries made during expeditions in search of a north passage to China and Japan, whereas Beal is a letter to the editor of the PTRS, in which the author describes his observations about frosts in Scotland. In the last paragraph of this letter, which does not form part of the file, the author describes his own language like this: “But you must expect no other language, or composure, than what comes first to a running pen, and agrees with rusticities; for which I have more affections, than spare minutes to offer to you.” (PTRS vol. 10, p. 367) The text of the letter is in line with this description; it is a first person report with many possibility modals, mostly written in present tense. This is the beginning of the letter: “It may seem, by the curious Remarks sent to you from Scotland that we are yet to seek out the Causes and original Source, as well as the Principles and Nature, of Frosts. I wish, I were able to name all circumstances that may be causative of Frosts, Heats, Winds, and Tempests. I know by experience, that the situation of the place is considerable for some of these; but after much diligence and troublesome researches, I cannot define the proximity or distance, not all the requisites, that ought to be concurrent for all the strange effects I have observ’d in them.” First and second person pronouns as well as possibility modals are absent from the beginning of Ano2. The prevalent tenses are past and present perfect.
232
Lilo Moessner “It is sufficiently known to those who have made any inspection into the Navigation of this and the former Age, how studiously and sollicitously the Lords of the United Netherlands have, for these eighty years and more, laboured to encourage those that should first discover a more compendious and shorter passage by the North to China, Japon, and other Oriental countries. But those who first adventured upon this Enterprize, found by sad experience, that the success answered not their expectation and hopes: whose calamitous encounters I shall not go about to recite, since their own Narratives have run through most hands.”
The situation on dimension 5 is the mirror image of dimension 1. All text dimensions scores of the ARCHER-files lie within the range of the RS-subcorpus. Since additionally the genre dimension scores of the two sub-corpora are situated on different sides of the dividing-line between abstract and non-abstract texts, the control corpus proved particularly helpful in establishing the linguistic structure of science texts. It represents the genre more adequately than the ARCHERsubcorpus. Unlike on dimension 1, the texts with the lowest and the highest text dimension scores cannot be attributed to different sub-genres of scientific writing. Gregg is an account of the chemical properties of natural substances, and Eve describes the consistency of several geological layers. The following extract from Gregg shows that the high score for abstractness is mainly due to the big number of passive constructions: “If you pour Spirit of Salt, by degrees, upon a Lee of Salt of Tartar, (or of any other Alcalisate Salt, ) ‘till it be almost satiated, (which is known by the abating of the Effervescence, ) you shall observe a kind of Earth precipitate out of the fixt Salt, (namely because, upon the mutual conflict, between an Acid and an Alcali, whatsoever heterogeneous substance is contained in either of them uses to precipitate.) The Earthy part of the Salt of Tartar being thus separated, the saline part is thereby render’d Volatile, and would actually fly away, were it not for the Acid that fixes it anew: and if you separate this Acid, by the addition of new Salt of Tartar, it will by this means be set at liberty, and strike your Nostrils with an Urinous odour.” In Eve, by contrast, the properties of the individual substances are fore-grounded, and passive constructions are very rare. This results in a less abstract style, as is illustrated by the following extract: “But all Sand does easily admit of Heat and Moisture, and yet for that not much the better; for either it dismisses and lets them pass too soon, and so contracts no ligature; or retains it too long; especially where the bottom is of Clay, by which it parches, or chills, producing
17th-century scientific writing
233
nothing but Moss, and disposes to Cancerous infirmities: But if, as sometimes it fortunes, that the Sand have a surface of more genial mould, and a fund of Gravel or loose stone; though it do not long maintain the virtue it receives from Heaven; yet it produces as forward springing, and is parent of sweet Grass, which, though soon burnt up in dry weather, is as soon recover’d, with the first rain that falls.” The interpretation of the results on the other three dimensions is less straightforward. The ranges on these dimensions overlap, and where the genre dimension scores are positive for one sub-corpus they are negative for the other and vice versa. The obvious conclusion is that on these dimensions neither subcorpus is sufficiently representative of English scientific writing in the 17th century. The combination of the two sub-corpora would be a promising move in the direction towards a more representative corpus of 17th century scientific writing. Its degree of representativeness would then have to be tested by comparing it with still other texts of the same register, complete representativeness being reached when no new variation patterns are discovered. 5.
Summary and conclusion
In this paper I argued that although the existing multi-purpose corpora cannot be considered representative samples, representativeness should nevertheless be an aim of corpus builders. This claim rests on the assumption that representativeness is not an either/or quality, but that corpora can be placed on a scale of representativeness. Then I tested the hypothesis that the 17th century PTRS texts of the register ‘science’ in ARCHER are representative of scientific writing of that period. I analysed the 10 ARCHER-files and a control corpus of the same size with the method of MD-analysis. The range of text dimension scores and their position on the dimension scales as well as the genre dimension scores and their position on the respective scales were established. These parameters were used to measure the degree of representativeness of the two sub-corpora. On dimension 1, the ARCHER sub-corpus proved more representative of English scientific writing in the 17th century, whereas on dimension 5 the control corpus showed a higher degree of representativeness. Only on dimension 1 did the lowest and highest text dimension scores correlate with different sub-genres of the genre science. On dimensions 2-4 neither subcorpus reached a sufficient degree of representativeness. The results of the present study strongly suggest the compilation of a more representative corpus of 17th century scientific writing, which should contain texts from diverse sources. It seems that the situational parameter ‘format’ with the dichotomy “published vs non-published and various formats within ‘published’” (Biber 1993: 245) is of particular relevance here.
234
Lilo Moessner
Notes 1
The Brown Corpus categories A-C (Press: Reportage, Press: Editorial, Press: Reviews) correspond to the ARCHER category ‘News’, the Brown Corpus category J (Learned) is split up in ARCHER into the categories ‘Legal opinion’, ‘Medicine’, and ‘Science’. The Brown Corpus categories E-G (Skills and Hobbies; Popular Lore; Belles Lettres, Biography, Memoirs, etc.) have no clearly recognizable counterparts in ARCHER, and the registers ‘Journals/Diaries’ and ‘Letters’ are absent from the Brown Corpus.
2
“Absolute representativeness is an unattainable ideal” (2004: 214).
3
Leech is more explicit towards the end of the same article when he writes: “... the absolute goal of representativeness is not attainable in practical circumstances” (2007: 140). It is interesting though that he argues in terms of practical circumstances, not on the basis of theoretical considerations.
4
This position was taken in Váradi (2001) and reported by Leech (2007: 136); cf. also Rieger (1979).
5
This is a problematic goal in itself, because the PTRS were the first and for about 200 years the only scientific journal in England (Kronick 1962: 6, Lambert 1985: 9).
6
cf. Moessner, forthcoming.
7
I wish to thank Christian Mair (Freiburg) for letting me have access to the ARCHER-files.
8
The following abbreviations are used: Ano1 (anonymous, PTRS 9), Leew (Antony van Leewenhoeck), A.I. (initials of an unidentifiable author), Ano2 (anonymous, PTRS 10), Beal (John Beal), B.R. (initials of an unidentifiable author), Hook (Robert Hooke), Hugy (Christian Huygens), Leib (Gottfried Wilhelm Leibniz), Ray (John Ray).
9
When the Boyle file was produced, the Hunter/Davis edition of Boyle’s works was not accessible. In the meantime the file has been collated with the edited text.
10
Dibgy’s book was published posthumously by his assistant George Hartman.
11
For a detailed description of these procedural steps cf. Biber 1988: 93-97.
Primary sources ARCHER. A Representative Corpus of Historical English Registers.
17th-century scientific writing
235
Boyle, Robert (1669), A Continuation of New Experiments Physico-Mechanical, Touching the Spring and Weight of the Air, and their Effects. Oxford: Henry Hall. Digby, Sir Kenelm (1683), Chymical Secrets, and rare Experiments in Physick and Philosophy. London: Will. Cooper. Evelyn, John (1676), A Philosophical Discourse of Earth, Relating to the Culture and Improvement of it for Vegetation and the Propagation of Plants, etc. as it was presented to the Royal Society, April 29, 1675. London: John Martyn. Gregg, Hugh (1691), Curiosities in Chymistry: Being new Experiments and Observations Concerning the Principles of Natural Bodies. London: Stafford Anson. Grew, Jeremiah (1678), Experiments in Consort of the Luctation arising from the Affusion of several Menstruums upon all sorts of Bodies. London: John Martyn. Hall, Marie Boas (ed.) (1966), Experimental Philosophy, in Three Books: Containing New Experiments Microscopical, Mercurial, Magnetical, by Henry Power. New York/London: Johnson Reprint Corporation. Hooke, Robert (1661), An Attempt for the Explication of the Phænomena, Observable in an Experiment Published by the Honourable Robert Boyle. London: Sam. Thomson. Sinclair, George (1672), The hydrostaticks, or, The weight, force, and pressure of fluid bodies, made evident by physical, and sensible experiments by G.S. Edinburgh: George Swintoun, James Glen, and Thomas Brown. Thibaut, P. (1675), The Art of Chymistry: as it is now Practised. Written in French By P. Thibaut, Chymist to the French King. And now Translated into English, By A Fellow of the Royal Society. London: John Starkey. Wallis, John (1675), A Discourse of Gravity and Gravitation, Grounded on Experimental Observations: Presented to the Royal Society, November 12, 1674. London: John Martyn. References Atkinson, D. (1999), Scientific Discourse in Sociohistorical Context. The Philosophical Transactions of the Royal Society of London, 1675-1975. London/Mahwah, NJ: Laurence Erlbaum. Biber, D. (2006), University Language: A Corpus-based study of spoken and written registers. Amsterdam/Philadelphia: John Benjamins. Biber, D. (2001), ‘Dimensions of variation among 18th-century speech-based and written registers’, in: H. Diller and M. Görlach (eds.) Towards a History of English as a History of Genres. Heidelberg: Winter. 89-109. Biber, D. (1993), ‘Representativeness in Corpus Design’. Literary and Linguistic Computing 8: 241-57. Biber, D. (1988), Variation across speech and writing. Cambridge: CUP.
236
Lilo Moessner
Biber, D. and J. Kurjian. (2007), ‘Towards a taxonomy of web registers and text types: a multidimensional analysis’, in: M. Hundt, N. Nesselhauf and C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York, NY: Rodopi. 109-31. Biber, D., S. Conrad, R. Reppen (1998), Corpus linguistics. Investigating language structure and use. Cambridge: CUP. Biber, D. and E. Finegan. (1997), ‘Diachronic Relations among Speech-Based and Written Registers in English’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.), To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Société Néophilologique. 253-76. Biber, D., E. Finegan and D. Atkinson (1994), ‘ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers’, in: U. Fries, G. Tottie and P. Schneider (eds.), Creating and Using English Language Corpora. Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993. Amsterdam/Atlanta, GA: Rodopi. 1-13. Birch, T. (1576 [1968]), The History of the Royal Society of London for Improving of Natural Knowledge, 4 vols. London: Millar [facsimile reprint Hildesheim: Olms]. Brown Corpus Manual http://khnt.hit.uib.no/icame/manuals/brown Francis, W.N. (1979), ‘Problems of assembling and computerizing large corpora’, in: H. Bergenholtz and B. Schaeder (eds.), Empirische Textwissenschaft: Ausbau und Auswertung von Text-Corpora. Königstein: Scriptor. 110-23. González-Álvarez, D. and J. Pérez-Guerra (1998), ‘Texting the written evidence: On register analysis in late Middle English and early Modern English’. Text 18 (3): 321-48. Gotti, M. (2006), ‘Disseminating Early Modern Science: Specialized News Discourse in the Philosophical Transactions’, in: N. Brownlees (ed.), News Discourse in Early Modern Britain. Selected Papers from CHINED 2004. Bern: Peter Lang. 41-70. Johansson, S. (in collaboration with G. Leech and H. Goodluck) (1978), Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English for use with digital computers. Oslo: University of Oslo, Department of English. Kronick, D.A. (1962) A History of Scientific and Technical Periodicals. New York: The Scarecrow Press. Kytö, M. (comp.) (1996), Manual to the diachronic part of the Helsinki Corpus of English Texts. 3rd ed. Helsinki: University of Helsinki, Department of English. Kytö, M. and M. Rissanen (1993), ‘General Introduction’, in: M. Rissanen, M. Kytö and M. Palander-Collin (eds.), Early English in the Computer Age. Explorations through the Helsinki Corpus. Berlin/New York: Mouton de Gruyter. 1-17.
17th-century scientific writing
237
Labov, W. (1966), The Social Stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics. Lambert, J. (1985), Scientific and Technical Journals. London: Clive Bingley. Leech, G. (2007), ‘New resources, or just better old ones? The Holy Grail of representativeness’, in: M. Hundt, N. Nesselhauf and C. Biewer (eds.), Corpus Linguistics and the Web. Amsterdam/New York, NY: Rodopi. 133-49. Meyer, C. (2002), English Corpus Linguistics. An introduction. Cambridge: CUP. Moessner, L. (2009), ‘The Influence of the Royal Society on 17th-Century Scientific Writing’. ICAME Journal 33. Mukherjee, J. (2004), ‘The state of the art in corpus linguistics: three book-length perspectives’, English Language and Linguistics 8 (1): 103-19. Rieger, B. (1979), ‘Repräsentativität: von der Unangemessentheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung’, in: H. Bergenholtz and B. Schaeder (eds.), Empirische Textwissenschaft: Ausbau und Auswertung von Text-Corpora. Königstein: Scriptor. 52-70. Rissanen, M. (1994), ‘The Helsinki Corpus of English Texts’, in: M. Kytö, M. Rissanen and S. Wright (eds.), Corpora Across the Centuries. Proceedings of the First International Colloquium on English Diachronic Corpora. St. Catharine’s College Cambridge, 25-27 March 1993. Amsterdam/Atlanta, GA: Rodopi. 73-79. Sand, A. and R. Siemund (1992), ‘LOB - 30 years on ...’. ICAME Journal 16: 119-22. Taavitsainen, I. (1993), ‘Genre/subgenre styles in Late Middle English?’, in: M. Rissanen, M. Kytö and M. Palander-Collin (eds), Early English in the Computer Age: Explorations through the Helsinki Corpus. Berlin/New York: Mouton de Gruyter. 171-99. The Royal Society of London. Philosophical Transactions, Volume 10 (1675) [Facsimile reprint 1963]. New York: Johnson Reprint Corporation and Kraus Reprint Corporation. Trudgill, P. (1974), The Social Differentiation of English in Norwich. Cambridge: University Press. Váradi, T. (2001), ‘The linguistic relevance of corpus linguistics’, in: P. Rayson, A. Wilson, T. McEnery, A Hardie and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster University: UCREL Technical Papers 13. 587-93.
A multi-dimensional analysis of a learner corpus Bertus van Rooy and Lize Terblanche North-West University, Vaal Triangle Campus, South Africa Abstract The present study reports on a multi-dimensional analysis (Biber, 1988) of the Tswana Learner English (TLE) corpus, together with the Louvain Corpus of Native English Essays (LOCNESS). A new multidimensional model is extracted, since the similarities between nativeness and non-nativeness mask differences between linguistic features to such an extent that it is not possible to come to a complete understanding of such differences using the standard 1988 model. A basic five factor model was extracted. Dimension 1 can be taken to capture advanced literacy, specifically as far as complex noun phrase structure is concerned, with the function of expressing information densely. Dimension 2 can be regarded as an indication of transparency and Dimension 3 captures a range of informal style features. The features that group together as Dimension 4 represent a style of writing that is more nuanced and precise and as a provisional label, we propose contextualisation of information. Dimension 5 can be regarded as the persuasive dimension in student writing, a feature that has been identified as a very important characteristic by Biber and Grabe (1987), and also in our own study of student writing. The most striking differences between the two corpora are on Dimensions 1 and 4. LOCNESS shows more advanced literacy than the TLE, and also contextualises information more extensively than the TLE. On the other dimensions, both corpora contain essays that display the various different styles available, showing that as a register, student writing allows for some internal stylistic variation independent of whether the writers are native or non-native speakers of English. The results confirm the usefulness of the multidimensional model, particularly to the extent that a new model is extracted. Substantial overlap between some of the dimensions in this study and dimensions in other models indicate that multidimensional modals are sensitive to particular kinds of feature groupings, which should be taken as evidence in favour of the general validity of this kind of approach.
1.
Introduction
Previous research on English second language writing usually focuses either on ‘errors’ and non-standard features that regard New Varieties of English as imperfect versions of standard metropolitan varieties, or examines specific individual linguistic features. Relatively few studies examine broader patterns of co-occurrence of features, with notable exceptions being Nkemleke’s (2006) study of expository writing in Cameroon English and Mesthrie’s (2006) work on non-deletions in Black South African English.
240
Bertus van Rooy and Lize Terblanche
To remedy this, we propose to adopt a multidimensional approach (Biber 1988) and apply it to corpora of student writing. Biber (1988) originally developed the model in an attempt to characterise differences between various spoken and written registers of English on the basis of the distribution of 67 different linguistic features. He extracted the frequencies of all these features from each of the texts that made up his corpus, before submitting the values to a factor analysis. The purpose of such an analysis is to find patterns of co-occurrence of features and to group them in factors that are assumed to reflect meaningful underlying dimensions. Biber (2006: 181) argues that the 1988 model can be applied successfully to new discourse domains. However, he acknowledges that using the 1988 model may make it impossible to describe the dimensions that are most important in a particular domain of use. This means that linguistic features can occur in particular ways in different discourse domains and that these features reflect the specialized properties of the domains (Biber, 2006: 181). Biber (2006: 181) proposes a completely new MD analysis to identify the co-occurrence patterns in a corpus when analysing a new discourse domain with many different text categories. This approach is proposed when comparing various registers. However, Biber’s (2006) argument also suggests that for any new register, it is possible that the application of the 1988 model may overlook certain unique functions of particular linguistic features. Van Rooy & Terblanche (2006) compared Dimension 1 on Biber’s 1998 model with native and non-native student writing. The differences between the two student corpora were so slight, they all but disappeared once they were compared to other registers (Van Rooy & Terblanche 2006: 178). In an application of the entire original MD model, Van Rooy (2008) finds once again that the construct student writing is a much stronger determinant of the emerging patterns in the data than any differences between native and non-native writing. Furthermore, his analysis shows that certain dimension scores present a misleading characterisation of the non-native data. Therefore, the motivation for a new MD model is that the similarities between L1 and L2 due to both corpora representing the register student writing mask the differences to such an extent that it is not possible to come to a complete understanding of the linguistic differences between native and non-native student writing. As opposed to comparing various registers, the present study looks at the differences between native and non-native writing, while keeping the variable register constant. In this study, a new factor analysis of the data is conducted to uncover dimensions that are able to distinguish more clearly between the two corpora. We attempt to present a comprehensive characterisation of differences and similarities between second language writing and native speaker writing, serving as benchmark for future microscopic investigations into more specific aspects of such differences and similarities. The methodology is presented in the next section, followed by the results and discussion, before conclusions are offered in the final section of the article.
A multi-dimensional analysis of a learner corpus 2.
Methodology
2.1
Corpora
241
In order to enable comparisons between the analyses performed using Biber’s original factor analyses (reported by Van Rooy 2008) and the new MD/MF model, the same corpora are used in both studies. A corpus of student writing produced by native speakers of Setswana from South Africa and Botswana, the Tswana Learner English Corpus (TLE), is analysed and compared to a corpus of native speaker writing, the Louvain Corpus of Native English Speaking Students (LOCNESS). Both these corpora are from the International Corpus of Learner English project. In the case of the TLE, the present study analysed 383 essays with an average length of 392 words (S.D. 135). A total of 188 essays from LOCNESS were analysed, with an average length of 1075 words (S.D. 620). The essays were written on a wide range of topics. A list of suggested topics is provided by the ICLE guidelines1, which was adjusted for the South African context in the case of the TLE. The topics for LOCNESS were more diverse, with literary and philosophical topics included alongside those more similar to the TLE2. The TLE data contain essays that were written in class, but under relatively unconstrained contexts. Students had between one and two hours to plan and write. None complained of not having enough time. No reference tools were used, though. The LOCNESS essays include a proportion of exam essays that were written under strict time controls, as well as some untimed essays. While this results in a less homogenous sample, it is the closest available as basis for comparison with the TLE. No similar corpus of student essays from native South Africans is available, and there is relatively limited contact between the students who wrote for the TLE and native speakers, thus there is very little chance of influence between the two segments of the population. 2.2
Feature extraction
All 67 features originally analysed by Biber (1988) were included in the present study. All spelling errors in the learner data were corrected before further analyses were undertaken. Subsequently, both corpora were part-of-speech tagged with the Support Vector Machine-tagger developed by Giménez and Màrquez (2004) using the Penn Treebank tagset. Due to the non-standard features frequently encountered in learner data, the majority of features were extracted using a combination of manual and automatic procedures. Features that could be extracted using lexical items only, such as the various types of adverbial subordinators, were extracted fully automatically using the TextMiner function in Statistica. In the case of all features that required partof-speech tags, the data were inspected manually in Wordsmith Tools, and only valid cases were accepted. Wherever possibilities for part-of-speech tag confusion existed, related part-of-speech tags were also extracted and inspected, in an
242
Bertus van Rooy and Lize Terblanche
attempt to ensure not only high precision in the classification, but also to maximise recall (in the standard information retrieval senses of the terms, see Van Rijsbergen, 1979: 144-150). Frequencies were normalised to a relative frequency per 1000 words, with the exception of the type-token ratio and average word length. In the case of the type-token ratio, the 1000-word normalisation was not feasible, because most essays contained fewer than 1000 words. Consequently, the type-token ratio was normalised to 200 words for both corpora. 2.3
Statistical analysis
Biber (1988) used factor analysis to cluster the linguistic features together in factors, which were interpreted as functional dimensions underlying the statistical factors. He extended the analysis by computing factor scores on the basis of the dimensions. The linguistic features with absolute factor loadings higher than Ň0.35Ň were included in this calculation, with the additional proviso that each feature was included only once, in the factor where it had the highest absolute loading. When new factor analyses have been undertaken since Biber’s 1988 study, e.g. by Reynolds (2005) or Biber (2006), the data is standardised in its own terms, to a mean of 0 and standard deviation of 1 for each variable. Feature 62, split infinitives, was excluded from the present study on account of the almost complete absence of data in both corpora. On the basis of the standardised data, we extracted a new factor model, using Promax rotation as is customary in the MD approach. The solution with five factors was chosen as optimal; beyond the fifth factor, it became increasingly difficult to assign a meaningful functional interpretation to the data, and the number of variables that loaded onto a factor became very small. After the factors were identified, dimension scores on the new factors were calculated for both corpora. Possible differences in means between the two corpora were assessed using Cohen’s (1969) d-statistic, which is simply calculated by dividing the absolute difference in means between the bigger of the two standard deviations of the two corpora. Cohen (1969:18-25) provides guidelines for interpreting the effect size obtained in this manner. For the purposes of this paper, where the focus is on the most salient differences between the corpora, the focus in the interpretation of the data will be on large effect sizes (d>0.8). Cohen (1969: 25) indicates that these differences correspond to ‘grossly perceptible’ differences, such as the difference in length between 13- and 18-year old girls, or IQ differences between PhD graduates and typical first year university students.
A multi-dimensional analysis of a learner corpus 3.
Results
3.1
The factor model
243
The basic five factor model, with the linguistic features loading onto them, is presented in Table 1. Only variables with absolute values of 0.3 and higher are included, and variables are only included in the factor where they have the highest absolute value. These dimensions can be interpreted as follows. Dimension 1 can be taken to capture advanced literacy. It includes two of the three typical feature clusters Biber (2006: 186) associates with literacy in contrast with orality: complex structures in noun phrases and information density. Differences in grammatical complexity is a well-known finding from research on second language acquisition (Grant & Ginther 2000; Hinkel, 2002, 2005; Reynolds, 2005). However, most previous studies focus on individual features and cannot give an overview of second language writing as a whole. As second language speakers develop as writers, they increase their use of more complex grammatical structures, such as nominalizations, subordination and passives (Grant & Ginther, 2000: 140). Apart from the two very general features of informational density – type/token-ratio and word length, several noun phrase specific structures can be identified: nominalisations (V14), gerunds (V15), total other nouns (V16), attributive adjectives (V40), and predicative adjectives (V41). However, unlike Biber’s (2006) finding that literacy contrasts in the first dimension of MD models with orality, we have no corresponding set of features encoding the orality dimension in our dimension 1. This is not surprising, since we compare two written corpora. In previous research, we showed that while there are some minor correspondences between the TLE and certain spoken registers, these are not substantial enough to show up in multidimensional models (Van Rooy & Terblanche 2006, Van Rooy 2008). Thus, we propose to regard the data only in terms of the literacy dimension, but then postulate that high positive dimension scores will be indicative of advanced literacy, in contrast to lower literacy levels. This dimension overlaps substantially with the negative features of the first dimension of Biber’s (2006) model of university language, where he terms this collection of features literate discourse. It also includes all negative features of the 1988 model, where they are labelled informational production. Likewise, in Reppen’s (2001) study on language for and by children, a first dimension with positive features overlapping with our features was identified and labelled as edited information discourse, and found in the school textbooks written for but not by children.
244
Bertus van Rooy and Lize Terblanche
Table 1: New Factorial pattern Dimension 1 V40_attr Adjectives V16_noun all V64_phras coord V39_prep. Phrase V44_word length V14_nominalisation V27_past part WHdel V43_TTR V15_gerund V28_pres part relatives V41_pred Adjectives
0.73 0.67 0.64 0.62 0.55 0.51 0.45 0.45 0.42 0.40 0.40
Dimension 2 V3_present tense V8_3p pronoun V31_WH rel subj V35_causal subord V19_be main verb V18_passive by V38_Adv subord V52_mod possibility V24_infinitive V11_indef pronoun V66_synt negation V10_dem pronoun
0.61 0.59 0.59 0.58 0.44 0.40 0.39 0.39 0.38 0.37 0.34 0.32
Dimension 3 V59_contractions V7_2p pronoun V6_1p pronoun V49_emphatic V12_do pro-verb V13_dir WH-question V50_discourse part
0.56 0.54 0.53 0.45 0.40 0.37 0.30
Dimension 4 V1_past tense V55_public verbs V2_perfect V5_time adverbials V23_wh clause V46_down toner V21_that verb-comp V36_concessive V33_pied piping V20_exthere V4_place adverbials
0.61 0.52 0.44 0.43 0.42 0.35 0.32 0.31 0.31 -0.38 -0.53
Dimension 5 V63_split auxiliaries V53_mod necessity V17_pass agentless V54_mod prediction V37_conditional V67_anal negation V57_suasive verbs V42_adverbs V61_stranded prep
0.75 0.70 0.64 0.61 0.55 0.42 0.39 0.38 0.30
Passage 1 illustrates the very high frequency of nouns, nominalisations and attributive adjectives that occur in the LOCNESS corpus. The frequency of nominalisations is much higher in the native speaker corpus than the TLE.
A multi-dimensional analysis of a learner corpus
245
Overall, passage 1 is a text that uses grammatically complex linguistic features visibly more than passage 2 from the TLE corpus, and as a consequence, information is presented much more densely in passage 1 than in passage 2. While spelling mistakes were corrected before analysing the data, the sample passages below are all from the raw, unedited corpora. (1)
Word count: 90 nouns (33/100 words) nominalisations (4/100 words) attributive adjectives (9/100 words)
Alcoholism is a growing problem in the United States today that affects all ages. Too many students fight alcoholism in high school and college environment. This problem could easily be curtailed by lowering the drinking age from twenty-one to eighteen. Changing the drinking age from twenty-one to eighteen would lower the amount of crimes among young adults, encourage a more responsible approach to alcohol in the United States and improve the health of the nation. Allowing alcohol consumption at age would change the way America viewed alcohol-use as a society. (2)
Word count: 196 words nouns (15/100 words) nominalisations (1/100 words) attributive adjectives (2/100 words)
Poverty is the cause caeus people in Africa are very poor to can surpport themselves and their families so some of those people in order for them to survave they go to the street and just sell their body so that they can get the money and buy food for their famillies. Because of poverty some of us can not go to school and study so that we can get a better jobs and make an hounest living, that is why some of us go out there and sell our selfs. And at the end one endup getting HIV/AIDS because of it - not only can we get HIV/AIDS by selling our bodies. Some of the people do not have places to stay and because it is cold outside and they dont have food they just go to some strangers and ask for some help, so a stranger will take an advantage of that poor person. On the other hand our government is giving out free condoms that are not even 100% safe so people just go for those condoms because they can not afford to by that ones that are bein sold at the camisty.
246
Bertus van Rooy and Lize Terblanche
Dimension 2 can be regarded as an indication of a transparency. It overlaps with six of the features on the positive side of Dimension 1 in the Biber (1988) model, labelled involvement. The features that occur on both models are present tense verbs, causative subordination, BE as main verb, adverbial subordinators (which have a higher loading on factor 5), possibility modals, indefinite pronouns and demonstrative pronouns. It likewise shows a degree of overlap with the positive features of the first dimension, oral discourse, in Biber (2006). This can be seen through the use of present tense verbs which describe actions in the immediate context of interaction (Biber 1988: 105). Overt cohesive devices such as causal and other subordinators are used. Various pronominal forms such as third person, indefinite and demonstrative pronouns occur as grammatical means to achieve reference cohesion. Biber’s (1988) model contains a large number of features loading on Dimension 1. The features with a negative loading are associated with high informational density in a text. However, the interpretation of the positive features is more complex. Biber (1988: 105-107) describes these features as representing an interactive focus on the one hand and the effects of real-time planning constraints on the other hand. Our model helps to shed some light on the complex set of positive Dimension 1 features in the Biber (1988) model. Dimension 2 in the present study overlaps with the part of the original dimension that selects fairly plain wording and grammatical structures, and is much more verbal than nominal in focus. Another subset of features from Biber’s Dimension 1 overlaps with our Dimension 3, which can be interpreted in terms of a different style choice. Our Dimension 2 features tend to overlap more with features that show evidence of real-time constraints, resulting in generalised lexical choice and sequentially structured, non-integrated information, combined with very explicit marking of particular cohesive relations. Timed student essays may well have this effect on occasion, where students start writing before planning adequately, and therefore present fragmented (rather than integrated and dense) information. Passage 3 is an excerpt from the TLE corpus that contains a high frequency of third person pronouns, as well as causal and adverbial subordinators. These features are all typical examples of a plain and direct style. The fourth passage shows that Dimension 2 reflects a choice of style rather than a limited access to grammatical features, since it shows an example where another TLE student avoids using these features: (3)
Word count: 138 adverbial subordinators (1.4/100 words) causal subordinators (1.4/100 words) third person pronouns (4.3/100 words)
One can describe being poor as having no many. Most of people in Africa have no money to survive so they find different ways of find money example of prostitution. There are cases whereby ladies trade
A multi-dimensional analysis of a learner corpus
247
sex for money in order to have money with different number of people. They usually practice unsafe sex because their costumer cannot pay for a protected sex so there is high risk of getting HIV/Aids through this practice though they get money. Since Africa is not devoloped there is poor health. Most of poor people end up eating unheath food because they do not have money to boy heath food. Africa does not have good and many heath faciliticies which its people can get medicines cheap to cures sexual transmitted diseases, tuberculosis and other HIV/aids related diseases before they develop to Aids. (4)
Word count: 150 words adverbial subordinators 0 causal subordinators 0 third person pronouns (2.7/100 words)
Many countries of Africa are poor and this means that the population also is poor. Most of these countries are over populated, this means that not everyone in the country will be able to get a job even if they are educated and these people who are not working are the ones who are involved in some activities like being prostitudes in order to make a living. Unprotected sex can be dangerous as it spreads an uncurable disease called HIV/AIDS. When more people get infected, the country have to buy or import expensive medicine from other continents in order to cure people. People should be given condoms to reduce the spread of AIDS. People should also be advised by the social workers and nurses on dangers of engaging themselves in unprotected sex. Students should also be taught about the dangers of involving themselves on sex whilst they are still young. Dimension 3 seems relatively straight-forward to interpret; it captures a range of very typical informal style features and overlaps fully with a subset of the positive features in Biber’s Dimension 1 in the 1988 model. The overlapping features are: contractions, second person pronouns, first person pronouns, emphatics, do as pro-verb, direct WH-questions and discourse particles. This means that all of the features that load on our Dimension 3 were originally on Biber’s Dimension 1. High loadings on these factors reflect an informal writing style, typical of texts with a high degree of involvement. Likewise, in Reppen’s (2001) study of children’s language, a third dimension, labelled involved personal discourse was identified, which overlaps to an extent with our Dimension 3. As noted earlier, the split between Dimensions 2 and 3 draws apart two different aspects of the positive features on the first dimension identified by Biber
248
Bertus van Rooy and Lize Terblanche
(1988). Our Dimension 2 reflects a style choice of presenting information in a planned manner or more fragmented, under real-time planning constraints. On the other hand, Dimension 3 is a purer type of style dimension, where greater involvement of the writer and more informal style choices correspond with a high dimension score. Passage 5 is an example from LOCNESS where the frequent use of first person pronouns, contractions and emphatics reflect an informal writing style. This proves that dimensions 2 and 3 reflect a choice of style, since some native speakers evidently choose to write in a more informal manner. Passage 6 is from the TLE and contains only one first person pronoun and none of the other features: (5)
Word count: 157 first person pronouns (8/100 words) contractions (4/100 words) emphatics (2/100 words)
Upon entering college I didn’t know I would still have a curfew. Nor did I know I would be treated as if I were age thirteen. I thought if I had a male guest, friend, brother, or cousin, they could spend the night. I guess if I were a resident of one of the “special” dorms A could, co-ed. if some dorms can have overnight visitation all of them should. Just because a dorm is co-ed doesn’t mean overnight visitation is allowed. They still have a 2 a.m. curfew. A friend of mine that’s a Bates House resident just has another resident of the opposite sex to sign her mate guest in and he spends the night with her. She’s not the only resident doing it. Students in universities and colleges should not have to sneak around just to spend quality time with someone. We’re not at home we don’t have certain luxuries anymore like a car. (6)
Word count: 236 Bold=first person pronouns (0.4/100 words) Italics=contractions (0) Underlined=emphatics (0)
In South-Africa, North-west is one of the best tourists attraction. The proble is that the industry is still growing. It is not like in other country like United State of America were the tourism industry there is very big. They can even see the cannon that was used by the Barolong to defeat the British. The Taung skull heretage sites is also very attractive to the tourist because it is known all over the world. That place is very known because of the skull that was found in a cave at Taung. That
A multi-dimensional analysis of a learner corpus
249
skull scientics they were disagree that it was not a human skull but they end up agree that maybe it was an ape skull. Those animal they are more or less the same as human being. The were working straight like human but their back was to a beat carve. The tourists can go to that place see and the community can benefit. By selling food and some of African pottery and dressing. The built environment can also attract tourist. Places like museum. let us take Mafikang Meseum as example. The Museum is a place were thing that have been used in the past were stored. The tourists can found information of the history of the place and anything that was from the past. The history of the war of Boroling and the British, the warren fought, war between the british and Barolong. The features that group together as Dimension 4 constitute a slightly less coherent set. A number of them deal with marked forms within the verb phrase, but at least downtoners, concessive adverbials and pied piping constructions do not fall in this category. It is also the only dimension with negative features, the existential there and place adverbials. The positive features overlap in part with the positive features of the third dimension in Biber’s 2006 model, where he regards the features as indicative of a reconstructed account of events. These features are that verb complements and past tense verbs. There is also some overlap with the positive features of Dimension 2 in Reppen (2001), which she terms lexically elaborate narrative. One way of analysing these features, is that they represent a style of writing that is more nuanced and precise. Events are properly situated in time through the use of the past tense, perfect aspect and/or time adverbials, suitably hedged by means of downtoners and concessive adverbials, and attributed to appropriate sources of origin through public verbs with that-clause complements or WH-clause complements. For example, the concessive adverbial subordinators are used to introduce background information or for discourse framing (Biber 1988: 236). On the negative side, the use of the existential there and place adverbials serve to highlight and particularise information, without necessarily presenting it in a more subtle manner. As a provisional label, we propose contextualisation of information for Dimension 4. The use of past tense verbs and the perfect aspect give a reconstructed account of events in passage 7. The use of concessive adverbial subordinators and public verbs reflect a precise and nuanced text which contains subtle contextualisation, for example through the use of public verbs to specify the acknowledged sources in the text. This contextualisation of information is absent in passage 8 from the TLE, which is emphatic and forceful: (7)
Word count: 182 Concessive adverbial subordinators (1.6/100 words) Public verbs (1.6/100 words) Past tense and perfect aspect (4.9/100 words)
250
Bertus van Rooy and Lize Terblanche His optimism is however renewed on his arrival in South America. The naïve Candide remarks on how the sea and climate are much better here than in Europe and so decides that this must certainly be ‘le meilleur des mondes possibles’. When Candide remarks that although Pangloss said everything was for the best, he noticed that things always went badly in Westphalia. But this is not a complete rejection of the philosophy of optimism. It is not until his meeting with Cacambo that Candide realises how naïve Pangloss’s views were, and also how restricted they were. He decides that the views of a person can be changed by travel such as has happened to him. At the end of the ‘comte’ we see Candide and Pangloss much more resigned to their fate. Although the thing which Candide has been pursuing all through the novel, that is Cunégonde does not quite turn out as he expected. Although this work would appear to be light hearted, it does contain a very real condemnation of the attitudes of society and the naïve philosophy of optimism. (8)
Word count: 182 Concessive adverbial subordinators 0 Public verbs 0 Past tense and perfect aspect 0
My friend I would so much wish to advise you to open a saving account at ABSA bank, because at ABSA banker are provided with first preverance. The staff of ABSA a well training in serving bankers. They are aware that bankers are the people who brings in money in their bank otherwise the bank would be closed. The warmth, love and the way they welcome you is realy impressive. You will wish to have all your banking with them. Their service is really excellent. You feel so welcome to ask as much questions as you wish. Even if you want to see the manager you are allowed to. The ABSA bank is truelly secured. At the door there is a securityguard always. There are too many people coming to bank and some enquiring about savings, fixed deposit, loans and withdrawings. The que is running so fast that you don’t spend too much time in the bank, and the ABSA bank has many branches. Even in one two town you get too many banks. Their interest are higher than any other banks. The fifth dimension in our model overlaps largely with the fourth dimension, overt expression of persuasion, in the Biber (1988) model. The five linguistic features that occur in both models are prediction modals, suasive verbs,
A multi-dimensional analysis of a learner corpus
251
conditional adverbial subordinators, necessity modals and split auxiliaries. The only feature that occurs in Biber’s (1988) model, but not in this study, is infinitives which load on our Dimension 2. Dimension 5 in the present model goes even further by incorporating other features that can simply be regarded as the persuasive dimension in student writing, a feature that has been identified as a very important characteristic by Biber and Grabe (1987). Of course, the topics given to the students invited argumentative writing, so the persuasiveness is not unexpected. What is unexpected in student writing, however, is the extent to which such features are used, outscoring political speeches and newspaper editorials in terms of persuasiveness. The final passage is an example of persuasive writing from the TLE corpus, a style which is typical of student writing in general. The most obvious marker for this dimension is suasive verbs, but necessity and predictive modals, various adverbs, as well as the conditional adverbial subordinator if all reflect a persuasive text: (9)
Word count: 353 words suasive verbs (1.1/100 words) necessity and predictive modals (3.7/100 words) conditional adverbial subordinators (1.4/100 words) adverbs (3.4/100 words)
I fully agree with the topic that poverty is the cause of the HIV/AIDS epidemic in Africa. I think that if it was not for poverty or if everybody was rich in Africa then this HIV/AIDS epidemic would not be spreading so rapidly in our beloved country. Today you will find young people leaving their homes saying that they are going to look for jobs only to find out that there aren’t jobs out there. They end up on the streets and the only way to survive will be to get boyfriends so that you may get some sort of income. You are definitely going to go for the cash thinking that it is only for that time it will pass and at least you’ve got money to buy food and clothes to get going. Everywhere you will hear people say nasty things about prostitutes. The honest fact is those people did not ask to be what they are now. If everyone was rich, the world would be a better place to live on. Everyone will be concentrating on his or her belongings. No one will be short of anything that will make her or him to end up in the street. Now rich people know that they can go out there hunting for those in need and asking for the impossible from them. Now because the HIV/AIDS epidemic does not have its own people or only specific type of people you would not tell if one has it or not. What I think can be done is our government can give us free education and not ask for the so called experience so that we can all get jobs and be able to maintain our families and that way will be
252
Bertus van Rooy and Lize Terblanche fighting a lot of things. Shooting two birds with one stone is a great thing to do. If our adults can afford then this prostitution and being charmed by people who can afford will come to an end. If we started that way then the HIV/AIDS epidemic would also be stopped. People will now not have a reason for being prostitutes.
3.2
Dimension scores
Table 2: Mean dimension scores for the two corpora, together with standard deviations and Cohen’s d-value. Large effect sizes are indicated by an asterisk Dim 1 2 3 4 5
Advanced literacy Transparency Informal style Contextualisation Persuasion
Mean LOC 7.44 1.48 -0.04 5.00 2.24
Std.Dev. LOC 4.42 3.17 3.62 3.98 3.93
Mean TLE -3.65 -0.73 0.02 -2.45 -1.10
Std.Dev. TLE 5.82 6.80 3.53 3.40 5.21
d-value 1.91* 0.33 0.02 1.87* -.64
The dimension scores for the five new dimensions are reported in Table 2, alongside their standard deviations and a d-value, which evaluates the difference in means between the two corpora. The comparison makes it clear that there are major differences in the dimensions that incorporate grammatical resources that play a role in information transfer (Dimensions 1 and 4). By contrast, for the purer style dimensions (2, 3 and 5), the results are much closer together. It seems as if the style dimensions, particularly 2 and 3, can be interpreted in terms of choices between more transparent, and informal, or more longwinded, and formal. Both corpora contain essays that have higher positive and higher negative scores on these dimensions, as is clear from the relatively high values for the standard deviations, given the mean values. As far as persuasiveness is concerned, LOCNESS makes more frequent use of the relevant linguistic resources than the TLE. However, compared to other registers analysed by Biber (1988), even the TLE makes substantial use of the resources of persuasion. If dimension scores were calculated in terms of the original Biber dimension, using Biber’s standardisation algorithm, a positive score of 1.4 would be obtained for the TLE. While lower than the 4.5 of LOCNESS, this is higher than the vast majority of registers examined. The situation is very different for Dimensions 1 and 4 in our model. On these two dimensions, the TLE has strong negative scores, and LOCNESS has strong positive scores. It should be clear that the grammatical resources for information packaging and for conveying subtle senses about the information are not as readily available to the TLE writers. The overall effect is perhaps best illustrated by a comparison of extracts 7 and 8. In extract 7, more subtle
A multi-dimensional analysis of a learner corpus
253
argumentation is presented about the information, contextualised in its historical context, with concession to other views. In extract 8, a very forceful argument is presented in the present tense, drawing on the general truth sense of the tense, with little concession to other views. While both passages have many nouns, the density is higher in passage 7, as is the density of adjectives, particularly attributive ones. This means that the expressive flexibility of the TLE writers is constrained by the availability of the relevant grammatical features. 4.
Conclusions
Firstly, by extracting a new multidimensional model, it has been possible to detect that there are more grammatical differences than differences relating to a particular writing style. The dimensions that highlight grammatical complexity are Dimensions 1 and 4 on our model. These dimensions illustrate that the TLE writers do not enjoy the same access to linguistic features associated with the kind of grammatical complexity that allows for integrated, yet subtle presentation of information, as opposed to the native speaker writers who regularly incorporate these features into their writing. The scores on Dimension 2, 3 and 5 are much closer together and do not distinguish between the TLE and LOCNESS to the same extent. This signifies that these dimensions reflect a certain style of writing rather than grammatical complexity. However, LOCNESS uses more of the features that are associated with persuasiveness. Both native and non-native speakers decide to use the linguistic features associated with style to a greater or a lesser degree. Thus, some TLE students write in a direct and plain style, while others write in more elaborate or ornamental ways. Likewise, some LOCNESS students make use of the direct and plain style, but others do not. Secondly, the results validate the decision to extract a new multidimensional model, since it was possible to gain deeper insights into student writing. These insights would have been impossible if the study had focused on isolated linguistic features, because the conspiracy between different features to achieve functional effects would have been lost. Likewise, extracting a new feature model rather than using the dimensions of the original Biber (1988) model enabled us to separate style dimensions from grammar and information presentation dimensions in a way that the original model did not allow. A final conclusion is that general dimension patterns exist, which emerge across different multidimensional models. There are three basic patterns that can be isolated: firstly, a dimension with a dense informational/nominal structure, secondly a dimension with a strong oral and informal style and lastly a dimension that reflects the intensely persuasive nature of student writing. The L2 data in the present study differ much more from Standard English than any data used in previous projects. Therefore, the finding of similarities across very different multidimensional studies are strong support for a claim that certain dimensions are invariantly present across registers and varieties of English.
254
Bertus van Rooy and Lize Terblanche
Notes 1. http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm#heading5 2. http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm References Biber. D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (2006), University Language: A corpus-based study of spoken and written registers. Amsterdam/Philadelphia: Benjamins. Cohen, J. (1969), Statistical Power Analysis for the Behavioral Sciences. New York/London: Academic Press. Conrad, S., & Biber, D. (eds.) (2001), Variation in English: Multi-Dimensional Studies. Harlow: Longman. Giménez, J., & Márquez, L. (2004), ‘SVMTool: A general POS tagger generator based on Support Vector Machines’, Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC ‘04). Lisbon, Portugal. Grabe, W., & Biber, D. (1987), ‘Freshman student writing and the contrastive rhetoric hypothesis’, Paper presented at SLRF7, University of Southern California. Grabe. W. & Kaplan, R.B. (1996), Theory and practice of writing: an applied linguistic perspective. London/New York: Longman. Grant, L. & Ginther, A. (2002), ‘Using computer-tagged linguistic features to describe L2 writing differences’, Journal of Second Language Writing, 2: 123-145. Hinkel, E. (2002), Second language writers’ text: Linguistic and rhetorical features. Mahwah: Lawrence Erlbaum Associates. Hinkel, E. (2005), ‘Analyses of second language text and what can be learnt from them’, in: E. Hinkel (ed.) Handbook of Research in Second Language Teaching and Learning. Mahwah, N.J.: Lawrence Erlbaum. 615-628. Mesthrie, R. (2006), ‘Anti-deletions in an L2 grammar: A study of Black South African English mesolect’, English World-Wide, 27: 111-145. Nkemleke, D.A. (2006), ‘Some characteristics of expository writing in Cameroon English’, English World-Wide, 27: 25-44. Reppen, R. (2001), ‘Register variation in student and adult speech and writing’, in S. Conrad & D. Biber (eds.) Variation in English: Multi-Dimensional Studies. London: Longman. 187-199. Reynolds. D.W. (2005), ‘Linguistic correlates of second language literacy development: evidence from middle-grade learner essays’, Journal of Second Language Writing 14: 19-45. Van Rijsbergen, C.J. (1979), Information retrieval. London: Butterworths. Van Rooy, B. (2008), ‘A multidimensional analysis of student writing in Black South African English’, English World-Wide, 29 (3): 268-305. Van Rooy, B. & Terblanche, L. (2006), ‘A corpus-based analysis of involved aspects of student writing’, Language Matters, 37 (2): 160-182.
Weaving web data into a diachronic corpus patchwork Andrew Kehoe and Matt Gee Research & Development Unit for English Studies, Birmingham City University Abstract This paper offers a reassessment of the role of web data in diachronic linguistic analysis. We introduce the diachronic search facilities provided by the WebCorp Linguist’s Search Engine, including the use of a new ‘heat map’ graph for the analysis of changes in collocational patterns over time. We illustrate how web data can be used to supplement data from standard corpora in lexicological studies. Our focus is on the vogue phrase credit crunch and the paper compares examples from standard corpora (BNC, Brown, LOB, Frown, LOB) with those found in web-accessible newspaper texts. Contrary to previous studies, we do not rely on the web solely for the most up-to-date usage examples. Instead, we show how web-accessible texts dating back to the beginning of the 20th Century can be used to fill gaps in and sharpen the picture provided by standard corpora.
1.
Introduction
The original WebCorp project (Kehoe & Renouf 2002, Renouf 2003) was an experiment to see whether we could develop a system to extract linguistic data from web text efficiently and present this to the linguist in as usable as fashion as it is presented in traditional corpora. The system (http://www.webcorp.org.uk) receives a word or phrase and other requirements from the user, passes these to a commercial search engine (Google, AltaVista, etc), and extracts the ‘hit’ pages from the search engine results. Each page is accessed and processed and the extracted concordances are presented to the user in a choice of formats. The WebCorp tool established that web text, though problematic, is nevertheless a resource that can complement corpus evidence with examples of usage that is rare, re-emergent, new or productive. The WebCorp Linguist’s Search Engine (WebCorpLSE) is designed to bypass the commercial search engines upon which WebCorp relied as gatekeepers to the web.1 WebCorpLSE is crawling and processing the web to build a 10 billion word (or 7 terabyte) text corpus, including a multi-terabyte ‘mini-web’, designed to act as a microcosm of the web itself (Kehoe & Gee 2007). In addition to the mini-web, WebCorpLSE has built a newspaper subcorpus, containing daily issues of UK broadsheets from 1984-present and recent issues of other UK and international newspapers. We have also worked with our university colleagues to build collections to assist in their research and teaching, including sub-corpora of blogs, science fiction and major English literary works. All collections are searchable via linguistically-tailored front-ends.
256
Andrew Kehoe & Matt Gee
It is now generally accepted that web data are of value in supplementing evidence from traditional, or ‘standard’, corpora when examining linguistic change over time. Previous work has tended to turn to the web as a source of evidence of the very latest trends in language use and of new coinages not found in standard corpora. Mair, for example, in a study of change and variation in present day English, states that the best way to ‘minimise the risk’ of relying on the web as a corpus is to use it not as a stand-alone source of data, but in conjunction with tried and tested closed corpora. In diachronic work, such corpora are positively indispensable because they add the necessary element of time depth to the web. (Mair 2007: 236) The approach described by Mair is, in part, necessitated by the bias in commercial search engines, like Google, toward the most recently updated pages and the difficulty in extracting older data from the web through these search engines (cf. Kehoe 2006). In this paper, we shall describe the corpus search tools available in WebCorpLSE and the new possibilities which these open up for diachronic linguistic study. We shall illustrate that carefully selected web data can, in fact, provide the necessary ‘time depth’ by overlapping with and filling gaps in the data provided by standard corpora. The web data can, thus, sharpen the diachronic picture presented by standard corpora rather than simply widening it at the most recent end of the timeline. Our analysis will focus on a phrase which, given media preoccupations at the time of writing, may initially seem to be a perfect example of the kind of vogue construction for which linguists have, thus far, turned to the web for evidence: credit crunch. The phrase does not appear in the Oxford English Dictionary (OED) and was named as the Oxford University Press ‘Word’ of the Year for 20082, an honour frequently bestowed on new coinages. One may therefore assume from this that credit crunch will not be found in standard corpora but, as we will show in the next section, there are examples of the phrase in corpora. In fact, the corpus-based Cambridge Advanced Learner’s Dictionary (CALD) includes credit crunch in its entry for another phrase, credit squeeze: credit squeeze UK noun [C] (US credit crunch) INFORMAL a period of economic difficulty when it is difficult to borrow money from banks (http://dictionary.cambridge.org/define.asp?key=18170&dict=CALD) This dictionary (latest printed version 2008) is based on the Cambridge International Corpus: just over 240 million words of UK and US writing and speech, with an emphasis on business, legal, academic and financial English. It is likely that the supposed distinction between US credit crunch and UK credit squeeze was drawn from the last of these, a ‘collection of books, journals, newspaper articles relating to economics and finance’3. This distinction between UK and US usage is one which will be investigated in this paper. We shall use
Weaving web data into a diachronic corpus patchwork
257
standard corpora of British and American English and data from the web to examine usage patterns over time and determine to what extent the US/UK distinction in the CALD entry holds true. Initially, our analysis will focus on credit crunch and we will return to credit squeeze in section 4. 2.
Evidence from standard corpora
The phrase credit crunch appears seventeen times in the British National Corpus (BNC) in texts from 1991-3, shown in Figure 1. The Economist, 1991 The Federal Reserve is struggling to allay fears of a “credit crunch” – when banks are reluctant to lend except to the most creditworthy borrowers. (ABD 81) 2 Fears of a “credit crunch” have prompted policy changes at the Federal Reserve in recent months. (ABD 2335) 3 A credit crunch is the name economists give to a sudden reluctance among banks to lend money. (ABD 2339) 4 Typically, a credit crunch happens when banks start to worry about the creditworthiness of their borrowers. (ABD 2341) 5 A credit crunch – mild, as yet – is undoubtedly under way in America. (ABD 2347) 6 There is a risk, though, that the supply of credit will start to fall faster than the demand; in other words, that a credit crunch will start to drive the process of credit contraction. (ABD 2361) 7 The Bank of England, responding to fears of a credit crunch, has asked banks to think twice before turning away would-be corporate borrowers. (ABD 2367) 8 Such frightening costs undermine the credibility of the FDIC, because, if a banking crisis were to start, the government might find itself facing a credit crunch of its own. (ABD 2381) 9 This demand on the international capital markets raises interest rates, aggravating the problems of debt and credit crunch. (ABD 2386) 10 The Federal Reserve’s Alan Greenspan said the Fed would do what it could to ease America’s credit crunch. (ABG 3211) 11 Yet they, too, complain of aches and pains, of being squeezed by a “credit crunch” under which borrowing has become harder even while interest rates have been falling. (ABJ 3178) 12 There is no generalised credit crunch in Japan, but particular firms are being hurt. (ABJ 3982) 13 That suggests that a credit crunch is taking place, especially since banks are still under orders from the central bank not to increase lending to property companies beyond the overall rate of loan growth. (ABK 2395) Daily Telegraph, electronic edition of 15/04/1992 (AKJ 453) 14 That would cause a severe credit crunch. Unigram X, 1993 (CTG 399) 15 Debt-laden Tustin, California-based business systems supplier MAI Systems Corp appears to have hit a credit crunch according to the German weekly Computerwoche.
1
258
Andrew Kehoe & Matt Gee
Keesings Contemporary Archives, Longman, 1991 (HLC 632) 16 It also took measures to ease the so-called “credit crunch “, mainly by relaxing regulatory pressures in order to encourage bank lending. The Scotsman: Business section, unknown date (K59 3187) 17 Writing in the February issue of the Lloyds Bank Economic Bulletin, he says: “The restoration of financial balance will mean that, far from there being a credit crunch, banks are likely to continue to find very little net demand for loans from companies.”
Figure 1: All examples of credit crunch from the BNC4 It is clear from the BNC concordances that the phrase was current in British English in 1991, and also that a credit crunch was underway in the United States at that time and was in danger of occurring in the United Kingdom. However, the fact that the phrase occurs in double quotes, complete with a full gloss, in three articles from The Economist (whose readers may be expected to be more familiar with economic terms than readers of general audience newspapers) indicates that the phrase was still new and unfamiliar to the majority of UK readers. The BNC data seem to indicate that the phrase credit crunch, like the economic situation it describes, first occurred in the US, thus confirming the CALD definition. This opens up the possibility of turning to another set of standard corpora: the Brown family, ‘corpora equivalently sampled from the language, though different in temporal as well as geographical provenance – as a means of identifying rather precisely how the use of the language developed over a period’ (Leech and Smith, this volume). The 1961 Brown and LOB corpora, with 1 million words each of written American English (AmE) and British English (BrE) respectively, contain crunch only in a literal sense. FLOB, the 1 million word BrE corpus from 1991, does not contain any instances of crunch (though it does include literal crunching and crunchiness). However, Frown, the 1 million word AmE corpus from 1992, includes three instances of crunch, all of which are used in a metaphorical sense to refer to financial situations, including one occurrence of credit crunch (Figure 2). 1.
2.
3.
Hallinan introduced the legislation following an Examiner story that revealed that some city bureaucrats were commuting in style at taxpayer expense despite a severe budget crunch that has required reduction of some vital health services. (A25 15-18) For all of Mr. Kornbluth’s cultural observations, the book is not yet written that closely tracks [US financier Michael] Milken’s persecution with the credit crunch and recession. (C12 89-91) They lend legitimacy to the racist and misogynist stereotypes so popular with conservative politicians and disgruntled taxpayers who feel an economic crunch and are looking for someone to blame. (G23 160-163)
Figure 2: All examples of crunch from Frown corpus5 The limitations of the Brown family for lexical rather than grammatical studies, as pointed out by Leech and Smith (this volume), are clear from these results. A
Weaving web data into a diachronic corpus patchwork
259
study of crunch based solely on Frown would have little choice but to conclude that the word is used only to refer to negative financial situations (the semantic prosody of severe, disgruntled and blame is clear). It is worth noting, however, that the authors of these AmE texts from Frown, unlike the authors of BrE texts from the same period in the BNC, do not feel it necessary to provide a gloss for credit crunch and use crunch in a wider sense to refer to a variety of financial situations. Indeed, the example of credit crunch in this corpus is mentioned in passing in an article which focuses on a different topic; it is given rather than new information. 3.
Turning to the web
Nesselhauf (2007) makes the distinction between two types of web-based diachronic linguistic analysis. The first is the approach taken by us with the original WebCorp system: the analysis of short term changes in texts produced specifically for the web (Kehoe 2006). The second is the analysis of changes in ‘larger and/or earlier time-spans based on texts written for other media and later made available on the internet’ (Nesselhauf 2007: 287). WebCorpLSE moves us toward this second approach to web-based diachronic analysis. As outlined in section 1, the system provides access, via the web, to a variety of sub-corpora, many of which were compiled from web-accessible text collections such as Project Gutenberg. In this paper we focus on the WebCorpLSE newspaper sub-corpus. With regard to this text-type, Nesselhauf’s distinction between the two kinds of diachronic analysis becomes somewhat blurred in that modern newspaper articles are not produced ‘specifically for the web’ but nor are they made available on the web only at a later date. For the past decade, printed newspaper texts have been made available simultaneously on the web. We shall return to this point in section 3.2, which provides details of our newspaper sub-corpus and the kinds of diachronic search possible using WebCorpLSE. Before looking at this, we outline in 3.1 the restricted (though useful) provision for diachronic linguistic analysis in the web-based Google newspaper archive. Throughout our analyses in section 3.1 and 3.2, we will attempt to confirm the accuracy of the CALD definition of credit crunch by examining: i) ii)
what web data can tell us about credit crunch in AmE, including first occurrence what web data can tell us about the introduction of credit crunch into BrE
3.1
Google News
Google News (http://news.google.com) is a ‘news aggregator’: a website that collates, from multiple sources, news stories which may be of interest to an individual user and presents these on a single page. In addition, the Google News
260
Andrew Kehoe & Matt Gee
site contains an archive of major international newspapers and magazines dating back over 200 years. More specifically, Google News provides a master index to several existing newspaper archives (New York Times, Washington Post, etc) and has begun to digitise print newspapers which were not previously available in electronic form.6 Google is working with publishers to make ‘millions of pages of news archives’ available, in facsimile and in a form searchable by keyword. The Google News Archive is not a corpus in the sense used by linguists. Accurate word frequency information is not available and only very limited word contexts are provided, as we shall show in the examples below. However, Google News does allow us to pinpoint when a particular word or phrase entered the lexicon of newspapers in the English-speaking world.7 By default, the Google News Archive search interface8 shows results in ‘relevance’ order, in a similar manner to a standard Google search. A secondary ‘timeline’ option allows the results to be viewed in date order, as shown in Figure 3 for the phrase credit crunch.
Figure 3: ‘Timeline’ results from Google News Archive for credit crunch Figure 3 would initially seem to indicate that there are examples of credit crunch dating back to 1906. However, this output highlights a severe limitation of
Weaving web data into a diachronic corpus patchwork
261
Google News for linguistic search. Many of the years associated with articles in the results list are not the year the article was written but the year in which the event being discussed took place. For example, the first result in Figure 3 (listed as 1906) is actually from a book published in 2002, and the second result (1926) is from an article dated May 29th 2008 in the New Zealand newspaper Timaru Herald.9 The fundamental difference between the dates required for informational search and for linguistic search (cf. Kehoe 2006) makes Google News an inadequate search interface for the latter. It is undoubtedly useful to know that there was a credit crunch in 1906 but it is also clear that the term itself was not used at that time. The last example in Figure 3 encapsulates this as it is a genuine example from the Chicago Tribune of November 16th 1967, which states that there was a credit crunch the previous year. The point is that this 1966 credit crunch appears to have been referred to as such only in retrospect. Finding the earliest occurrence of a term with Google News is a rather laborious process. After finding the earliest genuine occurrence on the timeline by experimenting with different date ranges, it is necessary to switch back to the default view to determine if there are any earlier occurrences. As the default view does not show results in date order, all results must be examined. By carrying out this procedure, we found the earliest examples of credit crunch to be not the November example from the Chicago Tribune but the examples from earlier in 1967 shown in Figure 4.10 New York Times, June 4 1967 avoid a repetition of last year’s credit crunch Washington Post, June 26 1967 highest interest rate since the 1920s - even a little higher than the rates late last summer during the credit crunch Washington Post, June 29 1967 Is the Nation heading into another credit “crunch” like last year’s, with soaring interest rates, competition for savers’ funds, and a new slump in the housing industry? New York Times, June 30 1967 danger that we will be moving toward another “credit crunch”. To avoid this, we urgently need greater fiscal restraint by the Federal Government Hartford Courant, July 2 1967 Five change in federal housing laws, designed to prevent a credit crunch of the 1966 type, were proposed last week to a Senate committee by the National Assn. of Real Estate New York Times, July 2 1967 Interest rates, the ‘topic and concern of the and financial these days, have been climbing steadily and fears of a new credit crunch similar to last summer...
Figure 4: Earliest examples of credit crunch in Google News, extracted manually All the examples in Figure 4 refer to the credit crunch as something which happened the previous year. We cannot say so conclusively but, given that the
262
Andrew Kehoe & Matt Gee
Google archive contains editions of these and other newspapers from 1966 yet returns no hits from that year, it seems likely that the term did not appear in the public domain in the United States until 1967. In the next section, we shall outline how WebCorpLSE, running on a combination of offline newspaper archives and newspaper data extracted from the web, can be used to trace the introduction of the term credit crunch in to the UK. 3.2
Newspaper corpora accessible via WebCorpLSE
We know from the BNC that the phrase credit crunch was used in the UK in 1991 but was not widespread and required explanation. Using the diachronic search facility in WebCorpLSE, we are able to trace the use of the phrase across a 25 year continuous span of UK broadsheet newspapers, segmented into months. The corpus contains 950 million tokens and consists of:11 i) ii) iii)
a complete archive of The Guardian (1984-88) a complete archive of The Independent (1989-99) The Guardian, downloaded from the web (2000-08)12
This corpus combines the two kinds of web-based diachronic analysis outlined by Nesselhauf (2007). On the one hand, the Guardian articles from 2000 onwards are pre-existing web texts. On the other, the early Guardian and Independent articles are off-line resources, being made available online in a form suitable for linguistic study by WebCorpLSE.13
Figure 5: Frequency of credit#crunch across time in the WebCorpLSE newspaper archive (per million words) 14
Weaving web data into a diachronic corpus patchwork
263
The graph in Figure 5 shows the frequency of credit crunch across time in the WebCorpLSE UK newspaper corpus. All frequencies are normalised to account for the varying size of the monthly segments across the years. The dotted line is the normalised monthly frequency and the solid line is a 12 month moving average. We have been examining such graphs for several years but have never seen a case as extreme as this, where the frequency increases from fewer than 1 occurrence per million words to almost 120 per million words within a single year. One of the earliest occurrences of credit crunch in the newspaper corpus is in a sentence from an August 1988 Guardian article, which includes a definition of the term and an indication of its origin: Indeed there is a possibility of a US-style credit crunch, where interest rates are pushed up hard for a short period. However, the phrase is used only 22 times in the 7 years before 1991 and the monthly frequency never rises above 1 per million words. The two noticeable ‘blips’ in Figure 5, prior to the massive upward trend in 2007-8, are accounted for by the concordances in Figures 6 and 7. These concordances were produced by WebCorpLSE, with sentence span selected and the results sorted by date, from earliest to most recent.
Figure 6: WebCorpLSE concordances for credit#crunch from The Independent, January-February 1991 (case insensitive, sentence span)
264
Andrew Kehoe & Matt Gee
Figure 6 shows the occurrences of credit crunch in The Independent in early 1991 which were responsible for the increase in frequency to a peak of 5.4 per million words in February of that year. Again we see some occurrences in quotes, complete with glosses (lines 24, 27, 31, 35 and 37) and other lexical signals such as ‘so called’ (cf. Renouf & Bauer 2001 on ‘contextual clues’). These concordances are contemporary with those from the BNC and it is clear from Figure 5 that, by chance, the BNC compilers captured a phrase, associated with a particular news story, which was at a peak of popularity in BrE.15 This again highlights the limitations of short time-span synchronic corpora in lexical studies. A study of credit crunch based on data from the BNC may overestimate the significance of the phrase in late 20th Century BrE. In order to trace the development of a word or phrase fully, it is necessary to use a larger monitor corpus like the newspaper sub-corpus in WebCorpLSE.
Figure 7: WebCorpLSE concordances for credit#crunch from The Independent, September-November 1998 (case insensitive, sentence span) After 1991, credit crunch appeared rarely (fewer than 50 occurrences in 6½ years) until late 1998 when it appeared 67 times in 3 months, including the cases shown in Figure 7. This 1998 peak in the frequency of the phrase appears to have
Weaving web data into a diachronic corpus patchwork
265
been sparked by comments from the chief executive of Barclays Bank (mentioned by name in lines 138-141 and 146). As in 1991, this peak in credit crunch was fleeting and the frequency of occurrence had fallen back below 1 per million words by December 1998. It then remained at that level until July 2007, when the massive increase in frequency began. Turning to concordances from July 2007 (Figure 8), one is struck initially by the lack of quotation marks around credit crunch and lack of any explanation of the term.16 It may appear that, by this point, the phrase has entered the lexicon of the newspaper to the extent that the journalists no longer feel it necessary to provide an explanation when using it. However, if we then look at a selection of concordances from later in 2007 and into 2008 (Figure 9), with the frequency of credit crunch continuing to rise, we find further examples where credit crunch is defined by the writer. We also see early evidence of the increasing trend for metalinguistic discussion of the phrase credit crunch and its meaning.17
Figure 8: WebCorpLSE concordances for credit#crunch from The Guardian, July 2007 (case insensitive, sentence span)
266
Andrew Kehoe & Matt Gee
Figure 9: Filtered WebCorpLSE concordances for credit#crunch from The Guardian, 2007-8 (case insensitive, sentence span) A possible explanation for this lies in Figure 10, which shows the proportion of occurrences of credit crunch which appeared in each sub-section of The Guardian.
Figure 10: Proportion of occurrences of credit#crunch across sections of The Guardian, 2007-818
Weaving web data into a diachronic corpus patchwork
267
In the early months of 2007, the phrase appeared only in the ‘Business’ section. By July 2007 it was also appearing in the ‘Money’ and ‘Comment’ sections, and by August it had spread to ‘Media’ and ‘Life’. Eventually, in December 2008, credit crunch was appearing in all sections of the newspaper, including ‘Sport’, ‘Education’ and ‘Culture’, thus confirming the notion in Figure 9, concordance 11 that ‘the esoteric “credit crunch” has moved out of the so-called “interbank money markets” and into the consciousness and pockets of the British people’. 3.3
Collocational analyses
Although the filtering options in WebCorpLSE can be used to make manual data analysis a more manageable task, the number of results can be prohibitively large when dealing with frequent lexis. In her 1987 study of ‘lexical resolution’, using a corpus of 13 million words, Renouf concluded that ‘eventually a point may be reached in corpus development where all word forms in which there is a lexicological interest are sufficiently exemplified’ (Renouf 1987: 130). It could be argued that we have now gone beyond this point, to a situation where corpora are so large that, for all but the rarest word forms, we are presented with more concordance data than can be analysed manually. As a result, statistical analyses have become increasingly important. One way to examine the growth of credit crunch over time is to produce span 1 collocational statistics for one or both of the words which constitute the phrase. We have chosen to take credit as this is the more frequent of the two words in our corpus and we felt that an analysis of its collocates may provide more information about squeeze and other related words. Figure 11 shows the span 1 collocates of credit for all months up to and including December 1988 (with a stopword filter switched on), whilst Figure 12 shows the same information but with the time period extended 20 years to the end of the corpus (December 2008). A z-score calculation is used to compare the expected frequency of collocation (based on the frequencies of each word) with the actual, observed frequency. Such collocational statistics are now standard in corpus linguistics and they are undoubtedly useful, as in this case where they reveal that crunch, which did not appear as a statistically significant collocate of credit in 1988, had become its most significant collocate by 2008. (In fact, viewed from the opposite perspective, credit accounts for 90% of the significant collocates of crunch in L1 – immediate left – position in the corpus as a whole.) WebCorpLSE provides an enhanced collocation tool which allows the tracking of changes in collocational patterns across time. We refer to this as a collocational ‘heat map’, where heat is used as a metaphor for collocational strength. To generate a heat map, WebCorpLSE ranks all collocates of the target word in the whole corpus by z-score, and selects the top 200 significant collocates for further analysis. These are then broken down into groups by month and year to create a diachronic table of collocation frequency. The monthly z-scores are used to plot the strength of collocation on a graph by translating them into shades of red.
268
Andrew Kehoe & Matt Gee
Collocate card cards consumer Suisse Family family boom Consumer rating export Export Guarantee Act bank Lyonnais reference scoring controls facilities tax lines limit insurance balances unions
L1 TOT 1 807 6 507 225 225 104 102 102 193 194 90 80 80 82 83 83 72 73 64 67 71 74 44 48 1 47 40 41 62 65 42 2 35 1 37 26 31
R1 Z-score 806 619.54 501 419.92 183.95 104 101.68 92.72 1 87.76 90 80.21 76.48 82 75.50 75.42 1 71.41 64 62.78 67 51.52 3 46.36 44 42.67 48 41.36 46 40.94 40 34.97 41 34.52 3 33.65 42 31.65 33 28.89 36 26.85 26 24.45 31 24.30
Figure 11: Significant span 1 colls. of credit, up to end of 1988
Collocate crunch card Suisse cards Lyonnais rating Consumer tax consumer Agricole Tax Counselling reference deserves pension ratings squeeze export balances Card interest-free unions Family facility markets
L1
TOT R1 Z-score 3 7031 7028 4149.97 3217249 17217 4010.65 1 3311 3310 2739.34 54 8825 8771 2478.06 2 1638 1636 1461.61 8 2053 2045 964.25 1183 1186 3 774.05 4329 4333 4 597.99 1430 1435 5 389.45 1 369 368 355.72 486 490 4 334.70 2 349 347 325.30 751 751 260.96 518 518 256.05 1048 1050 2 248.81 2 551 549 239.53 4 424 420 233.01 508 509 1 217.58 2 266 264 196.68 1 236 235 196.14 209 209 185.52 9 684 675 183.78 429 431 2 181.24 2 343 341 178.28 15 856 841 174.02
Figure 12: Significant span 1 colls. of credit, up to end of 2008
Figure 13 is a heat map for the span 1 collocates of credit from 1985 to 2008.19 This output highlights the fine-grained approach to collocation provided by WebCorpLSE heat maps. We see Lyonnais, a strong span 1 collocate of credit for over 10 years, disappear from the map in 2003, at the point when the French bank Credit Lyonnais became known as LCL. We also see Family disappear and Tax appear in 1998-9, when the ‘Working Families' Tax Credit’ replaced ‘Family Credit’ in the UK welfare benefit system. These are not linguistically interesting examples in themselves but they indicate that the methodology is sound and allow us to draw more meaningful conclusions when, for example, reference, ratings, histories and limit become strong collocates of credit (relating to ‘debt worthiness’). Figure 13 also captures the cyclical nature of credit crunches, with crunch appearing as a significant collocate of credit for specific short periods (1991-2, 1998-9) before ‘fading’ out of use again. We also see squeeze appearing as a span 1 collocate of credit at similar, but not identical, points in time (appearing more gradually from 1988-91, and weakly in 1993-4 and 1998-9). We shall examine squeeze in section 4.
Weaving web data into a diachronic corpus patchwork
269
Figure 13: Top of ‘heat map’ for span 1 collocates of credit (case insensitive) in newspaper corpus 1985-2008 (left and right collocates) Both crunch and squeeze re-emerge as strong collocates of credit in 2007-8 and it remains to be seen how long this particular event will last. Given that the phrase credit crunch is being used more frequently than ever before and that collocates indicating severity (global, crisis) also appear as strongly significant in 2007-8, it would seem that it will be much longer before this particular credit crunch fades from the heat map. We should also note in our discussion of collocation that WebCorpLSE allows the generation of collocates for any search term and is not restricted to single words searches. Figure 14 shows the span 4 collocates of the phrase credit crunch over time. Until 2007, the phrase had few statistically significant collocates, though banks first appeared in 1991 and global, fears and markets had appeared by 1999 (the time of the second ‘blip’ in Figure 5). By 2008, there is a long list of words describing the credit crunch, its causes and effects, some of which are classed as significant as a result of their own newness and rarity (subprime, write-downs). It will be interesting to monitor changes in the collocational profile of credit crunch in future years. 4.
A brief discussion of credit squeeze
Space does not permit a full discussion of credit squeeze but we have conducted a diachronic analysis of the phrase and will summarise the main findings here. Unlike credit crunch, credit squeeze does appear in the OED, under the headword credit (Figure 15).
270
Andrew Kehoe & Matt Gee
Figure 14: Top of ‘heat map’ for span 4 collocates of credit#crunch (case insensitive) in newspaper corpus 1991-2008 (left and right collocates) 14. attrib. and Comb.[...] credit squeeze, the restriction of financial credit facilities through banks etc. 1955 Times 18 July 15/1 As early as last February I applied a little of the curb-what is sometimes called the credit squeeze. 1957 Britannica Bk. of Year 511/2 A verb-form to credit-squeeze, to restrict investment or speculation by reducing financial credits.
1962 H. O. BEECHENO Introd. Bus. Stud. xiv. 138 ‘Credit squeezes’-i.e. making it more difficult to obtain loans from banks and, perhaps, restricting hire purchase business... This check can be applied selectively. Figure 15: OED definition of credit squeeze
Weaving web data into a diachronic corpus patchwork
271
The whole phrase does not appear in the Brown family of corpora but there is one occurrence of squeeze in this sense in the BrE LOB corpus: The big “squeeze” means that it is going to be more difficult to arrange a loan or overdraft. (A06 206-207; Daily Sketch, 4 August 1961) The phrase is not quite as frequent as credit crunch in the BNC, appearing 13 times in texts from 1976-93 (Figure 16). 1
2
3
4
5 6
7
8 9 10
11 12
13
In 1974 his property and investment group also faced problems brought on by a credit squeeze and downturn in the building market. (AAS 11: Guardian Business section, 31/12/89) The capital standards, negotiated through the Bank for International Settlements (BIS), are a natural scapegoat for the credit squeeze that is deepening the recessions in Britain and America and may provoke one in Japan. (ABE 159: Economist, 1991) The higher interest rates and credit squeeze control used by the Conservatives did, however, slow down growth in the economy overall. (CRD 480: Engineers, managers and politicians, 1993) The Conservatives had clearly let the economy overheat for electoral advantage in 1955, but as soon as the election was over, clamped down with a credit squeeze. (CRD 559: Engineers, managers and politicians, 1993) Foreign business also has a more practical complaint: because of China’s credit squeeze, bills are no longer paid on time. (EDU 578: Marxism Today) In Britain the apparently smooth growth during the long boom was marked by dramatic events that, at the time, seemed to be crises: for example, the 1957 credit squeeze and record interest rate jump (FA0 588: Restructuring Britain: the economy in question, 1988) In a less obvious but equally influential manner, if a credit squeeze is applied as a macroeconomic policy, the resulting high interest rates will reduce the number of people able to take out mortgages. (FB2 719: Rural Britain: a social geography, 1985) It won’t be affected by the credit squeeze ...? (G0F 1376: Sweet dreams, 1976) This is true in that consumer demand has collapsed as a result of the credit squeeze (G38 485: Marketing Week, 17/01/92) Britain therefore experienced a credit squeeze in the early 1990s during a period of recession in much the same way -- and for much the same reasons -- that she experienced a credit boom during the period of growth and “overheating “ in the mid-1980s. (H91 296: A treaty too far, 1992) The government responded to the payments crisis with a credit squeeze. (K8U 225: Capitalism since 1945, 1991) This situation would occur in circumstances as in the late 1960s, when due to a credit squeeze, interest rates rose. (K8W 1292: UK financial institutions and markets, 1991)
Second and simultaneously, in order not to release a consumer credit squeeze that would second imports, they should introduce controls on the supply of credit (KRT 3495: Fox FM News: radio programme)
Figure 16: All examples of credit squeeze from the BNC
272
Andrew Kehoe & Matt Gee
It is noticeable that credit squeeze appears far more in the BNC in books, discussing past events (sources underlined in Figure 16), than in news stories. These results are significantly different from those for credit crunch and may indicate that crunch was in the process of replacing squeeze in this context in BrE texts discussing current events. We cannot, of course, draw this conclusion purely from an analysis of the BNC or other standard corpora, for reasons outlined above. However, a diachronic analysis of our UK newspaper corpus using WebCorpLSE (Figure 17) does provide further evidence for this.
Figure 17: Frequency of credit#squeeze across time in the WebCorpLSE newspaper archive (per million words) The phrase credit squeeze appears in the newspaper corpus earlier than credit crunch (1984 versus 1987) but there are only 422 occurrences of the former, compared with 7069 of the latter, and squeeze does not reach the same peaks in frequency reached by crunch. We also used Google News to extract the earliest occurrence of credit squeeze in newspapers, in the same way described above for credit crunch. This revealed the earliest occurrences to be in two New York Times articles from 26th March 1929 (complete with Google OCR errors): alt of which have been recently b3 the stock market. threw out the intimation that a credit squeeze of major proportions was inevitable if the use of ... The tightest credit squeeze in almost nine years tools place On the S:Ork Stock Exchan=a yesterday, when the call loan rate advanced to 74 per cent
Weaving web data into a diachronic corpus patchwork
273
These early occurrences of the phrase in AmE are contrary to the claim in the Cambridge Advanced Learner’s Dictionary (quoted in Section 1) that credit squeeze is a UK term, equivalent to the US credit crunch. It is, of course, conceivable that credit squeeze was once the preferred term in AmE and that, at some point after the coining of the phrase credit crunch in the US in 1967 and before the earliest articles in our newspaper corpus (1984), credit squeeze was still used more widely than credit crunch in BrE. What is certain from our analysis is that, given the recent global credit crunch and massive increase in usage of the phrase in UK newspapers, this distinction between UK and US usage no longer holds true. It is beyond the scope of the current paper to examine the semantics of the two phrases in depth, a task which would require economic as well as linguistic insight, but the phrases credit squeeze and credit crunch do not appear to be as synonymous as the CALD definition implies. It is clear from the OED citations and LOB and BNC concordances that, from the 1950s-1990s, a credit squeeze was a measure applied by a government as a deliberate economic policy. A credit crunch, in its most recent incarnation at least, is something over which governments seemingly have little control.
5.
From the credit crunch to the crunch
As we have seen, the vast increase in use of the phrase credit crunch in mid-2007 was mirrored by an increase in the less used credit squeeze, with both phrases being used to describe the same event. During the same period, we have also noted an increase in the elliptical form the crunch and have examined this by using the date filter option in WebCorpLSE to view all occurrences of the phrase in The Guardian from 20078. These were then analysed manually and divided into five categories: i) ii) iii) iv) v)
crunch as a premodifier (e.g. the crunch vote, the crunch game) the crunch referring to the credit crunch COME+to the crunch (including the crunch came, etc) literal crunch (the crunch of gravel) other
A graph of the results (Figure 18) reveals that, whilst the other meanings have remained constant, the crunch as an abbreviated form of the credit crunch has increased in frequency following first occurrence in July 2007.
274
Andrew Kehoe & Matt Gee
Figure 18: Frequency of the crunch, 2007-8 (per million words), differentiated by sense: 0=other, 1=premodifier, 2=credit crunch, 3=COME to crunch, 4=literal The manual analysis also revealed some creative uses of the crunch, where two meanings have been conflated by journalists for effect, including: 1.
2. 3.
Analysts at Evolution Securities said the worst was still to come, with the “crunch” for Greene King and other licensed retailers arriving this winter and next spring. (04/07/08) When it comes to the crunch, price matters. (07/12/08) The worry is that when it comes to the crunch multinationals will close overseas plants rather than domestic ones and overseas utilities will not pass on cost decreases arising from oil. (08/12/08)
Examples 2 and 3 here are from articles about the credit crunch and the use of the idiom ‘when it comes to the crunch’ appears to be a conscious decision by the writer, certainly so in 2, a sub-headline. The writer of example 1 uses the COME+to the crunch construction (and signals the play on words with the double quotes around crunch) but then selects arriving instead of coming. We would suggest that this was a deliberate choice by the journalist (or possibly a subeditor) to ensure that the ‘credit crunch’ meaning was not ‘lost’ in the idiom. There appear to be two factors driving the growth in the ‘shorthand’ form the crunch. Firstly, journalists tend to tire of ‘buzz’ phrases quickly and begin to look for ‘snappier’ alternatives. Secondly, the vast increase in usage of the phrase (the) credit crunch over a relatively short period of time has left it (and the associated concept) in the public consciousness to such an extent that the shorthand form the crunch is interpretable instantly, without a gloss.
Weaving web data into a diachronic corpus patchwork 6.
275
Conclusion
In this paper, we have illustrated how the web can be used to supplement usage examples from standard corpora in diachronic linguistic analysis. When considering a recent linguistic phenomenon such as the rise of credit crunch, the web offers a solution to the restrictions posed by the ‘dearth of corpora of English spanning the whole of the twentieth century, or more particularly spanning the early part of it’ (Leech 2005: 85). We have shown that, through careful data selection and the use of advanced diachronic analysis tools in WebCorpLSE, it is possible to widen the focus and trace the development of a word or phrase across the twentieth century, in British and American English. Our analysis of credit crunch and associated phrases has highlighted the value of Google News as a repository of twentieth century texts, but has also revealed the limitations, for linguistic search, of the search software provided by Google. The ideal solution would be to access the Google News archive via WebCorpLSE or other similar interface, thus allowing full-scale diachronic linguistic search of twentieth century newspaper text. Of course, newspaper corpora are not an ideal data source for the analysis of all kinds of linguistic phenomena but, as Hundt and Mair (1999) point out, newspapers are usually at the forefront of linguistic change and are, thus, a valuable resource in the kind of linguistic analysis carried out in this paper. Our analysis has focussed on usage patterns rather than semantics but the work has allowed us to make some observations about the meaning and status of the phrase credit crunch and of crunch individually, as relates to squeeze. In fact, our analysis of the ‘shorthand’ form the crunch in The Guardian uncovered a meta-linguistic discussion of crunch to which we now refer in conclusion: What exactly is a crunch? Crunch in this context has two meanings, the first being “critical moment”, as in “coming to the crunch”. This is the older meaning of the two, almost certainly dating to Winston Churchill’s use of it in a 1939 Daily Telegraph interview. […] The second, more modern meaning is the sense of “squeeze”, arising from paucity – this is how we get “energy crunch”. […] Generally, the two meanings bisect, so the word conveys an urgent scarcity. […] But the two meanings have not yet coalesced entirely. (Zoe Williams, The Guardian, 7 January 2008)20 What this journalist refers to as the ‘more modern meaning’ is the wider AmE use of crunch which was already apparent in the 1992 Frown examples discussed in Section 2. This meaning has apparently made its way into BrE as a result of the massive surge in frequency of credit crunch. Prior to 2008, the ‘paucity’ example, energy crunch, had only appeared in our newspaper corpus 7 times, but there were then 13 occurrences in 2008 alone (3 of which appeared days before Zoe Williams’ comments and are apparently what sparked them). As we noted in
276
Andrew Kehoe & Matt Gee
section 3.3, 90% of the immediate left collocates of crunch in our newspaper corpus are accounted for by case variants of credit, so there is little evidence for the wider use of crunch to refer to other kinds of ‘squeeze’ at present. Apart from energy crunch, we do note a handful of occurrences of other crunches in 2007-8 data (supply, pensions, housing, oil). In our analysis of the crunch, we also note two examples where crunch appears to fill a slot more commonly filled by the semantically related pinch: 1. 2.
Harriet Harman, has repeatedly and patronisingly said that “ordinary” families are feeling the crunch from rising fuel and food prices (06/05/08) Budget hotels are raking it in as business people feel the crunch (05/10/08)
The second example here could be interpreted as ‘feel the effects of the credit crunch’, but the first is seemingly equivalent to ‘feeling the pinch’ (the use of from rather than through precluding the interpretation ‘feeling the credit crunch’). This use of crunch is reminiscent of its use in a wider financial sense in the AmE concordances from Frown (Figure 2), where it does indeed convey both scarcity and urgency. This paper has traced the assimilation of the phrase credit crunch in to BrE. During the 1990s, the phrase was used periodically but infrequently in UK newspaper texts, reflecting the cyclical nature of the economic phenomenon it describes. As a result, each time the phrase re-emerged, journalists found it necessary to provide a full gloss. Since mid-2007, however, credit crunch has increased in usage to such an extent that the elliptical form the crunch is now interpretable immediately by the UK public. In fact, as a result of the spread of credit crunch, the word crunch is itself beginning to take on new meanings, including some not linked directly to the financial domain. It, thus, seems unlikely that the phrase credit crunch will require a gloss if it is to re-emerge once again in future years. Acknowledgement Development of WebCorpLSE was in part funded by the UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/E001300/1. Notes 1 A recent upgrade to the original WebCorp system (and renaming to ‘WebCorp Live’) has increased processing speed, but the reliance on commercial search engines remains and the range of searches possible is thus still limited. We are maintaining the original WebCorp system for the benefit of those users who wish to conduct ‘live’ searches of the ‘whole’ web, as accessible through commercial search engines. 2
http://www.askoxford.com/worldofwords/wordfrom/wordsoftheyear2008
3
http://www.cambridge.org/elt/corpus/international_corpus.htm
Weaving web data into a diachronic corpus patchwork
277
4
For each occurrence, the BNC file and line number are given in parentheses. Concordance lines are grouped according to the publication and article from which they are taken (the latter extracted manually from the source files). The BNC was designed as a synchronic corpus and is not ideally structured for diachronic study. For example, the file ABD contains 9 occurrences of credit crunch but it is not immediately clear that the last 8 of these all occur in the same article. Nor is it clear on exactly which day each newspaper article was published and, in some cases there is no date information at all other than a wide range (e.g. the article from The Scotsman in figure 1: 1985-1994). Results are presented in figure 1 in BNC file order, which is not necessarily date order.
5
The Frown manual (http://khnt.hit.uib.no/icame/manuals/frown) reveals the sources of these examples to be: A25: Press: Reportage: San Francisco Examiner: ‘S.F. Supervisors Crack Down on Use of City Cars’ (06/10/92). C12: Press: Review: Wall Street Journal: ‘The Persecution of Milken’ (25/08/92). G23: Belles Lettres, Biographies, Essays: Ruth Conniff ‘The Culture of Cruelty’, The Progressive (09/92).
6
See http://googleblog.blogspot.com/2008/09/bringing-history-online-onenewspaper.html.
7
There are several limitations, some of which we go on to outline below. The main limitation is that, at present, the compilers of the Google News Archive are focussing their attention, for the earlier periods of history, on US newspapers. This is not so much of a problem in our case, since we are searching for a term which we believe to have originated in the US.
8
http://news.google.com/archivesearch, The searches discussed in this paper were carried out in January 2009.
9
The Google News results pages carry the disclaimer “Dates associated with search results are estimated and are determined automatically by a computer program”. Kehoe (2006) detailed the ways in which a computer program could estimate the authorship date of web texts for use in linguistic analysis, with a high accuracy rate. Newspaper articles contain far more reliable dating information than web pages, so it is unlikely that Google’s program is wildly inaccurate when estimating these dates. It is simply estimating dates for a different purpose.
10
Note that, in most cases, the full text of matching articles is not available. In some cases, a sentence context is available by following the link to the corresponding newspaper archive. In other cases the limited context on the Google News results page is all that is available. Figure 5 shows the
278
Andrew Kehoe & Matt Gee
widest context available. There is an apparent OCR error in the last context shown. 11
Though the corpus comes from two different broadsheet newspapers, these are broadly comparable in terms of content, focus and style.
12
Including its Sunday sister newspaper The Observer.
13
Though The Guardian has an archive on its website, this is complete only from 1999 onwards. Only a selection of the 1984-88 articles in our corpus is available on the Guardian site and The Independent does not have a freely accessible archive at all. WebCorpLSE makes limited contexts available from these sources, to registered users only.
14
The # operator in WebCorpLSE matches the three variants ‘credit crunch’, credit-crunch’ and ‘creditcrunch’, a useful option when searching for compounds. As it transpires, the last of this does not occur in our corpus. We use the credit#crunch query syntax throughout this paper. This particular search is also case insensitive.
15
The same is perhaps also true, to a lesser extent, for the Frown corpus, its 1992 AmE texts capturing a credit crunch in the US economy at that time.
16
Line 280 (‘By most definitions, that’s a credit crunch’) is a possible exception. However, we would not class the sentence immediately before this (‘Right now, big buy-outs are impossible: the debt markets are closed until the jam clears’) as a clear definition of the term. The concept of ‘credit crunch’ is not presented in this article as something which may be unfamiliar to the reader.
17
This concordance selection was made possible by the ‘filter’ option in WebCorpLSE, which allows manual removal of individual concordance lines, filtering by date, etc.
18
Some of the categories in this chart are composed of several sub-sections on The Guardian website: COMMENT: Comment, Letters; CULTURE: Artanddesign, Arts, Books, Culture, Film, Music, Stage; LIFE: Lifeandhealth, Lifeandstyle, Cars, Society, Travel, Weekend; NEWS: News, UK News; TECHNOLOGY: Science, Technology; WORLD: EU, Global, International, USA, World.
19
We have included both left and right span 1 collocates for illustrative purposes. WebCorpLSE allows the analysis of right and/or left collocates at spans 1-9 and sentence span. It is possible to conflate the frequencies of case variants, separate part-of-speech variants (e.g. separate entries for crunch_NN and crunch_VV) or view POS collocates only.
20
http://www.guardian.co.uk/business/2008/jan/07/creditcrunch.zoewilliams
Weaving web data into a diachronic corpus patchwork
279
References Hundt, M. and C. Mair (1999), ‘Agile and uptight genres: The corpus-based approach to language change in progress’ International Journal of Corpus Linguistics 4, 221-242. Kehoe, A. & M. Gee (2007), ‘New corpora from the web: making web text more “text-like”’ in: P. Pahta, I. Taavitsainen, T. Nevalainen & J. Tyrkkö (eds.) Towards Multimedia in Corpus Studies, University of Helsinki: http://www.helsinki.fi/varieng/journal/volumes/02/kehoe_gee Kehoe, A. (2006), ‘Diachronic Linguistic Analysis on the Web using WebCorp’ in: A. Renouf & A. Kehoe (eds.) The Changing Face of Corpus Linguistics, Amsterdam: Rodopi, 297-307. Kehoe, A. & A. Renouf (2002), ‘WebCorp: Applying the Web to Linguistics and Linguistics to the Web’, in: Proceedings of WWW 2002, Honolulu, Hawaii. Electronic publication: http://www2002.org/CDROM/poster/67 Leech, G. and N. Smith (this volume), ‘Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991’. Leech (2005), ‘Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB’, in: ICAME Journal No. 29. Mair, C. (2007), ‘Change and variation in present-day English: integrating the analysis of closed corpora and web-based monitoring’, in: M. Hundt, N. Nesselhauf & C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York: Rodopi, 233-247. Nesselhauf, N. (2007), ‘Diachronic analysis with the internet? Will and shall in ARCHER and in a corpus of e-texts from the web’, in: M. Hundt, N. Nesselhauf & C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York: Rodopi, 287-305. Renouf, A. (2003), ‘WebCorp: providing a renewable data source for corpus linguists’, in: S. Granger & S. Petch-Tyson (eds.) Extending the scope of corpus-based research: new applications, new challenges. Amsterdam: Rodopi, 38-53. Renouf, A. & L. Bauer (2001), ‘Contextual Clues to Word-Meaning’, International Journal of Corpus Linguistics, Vol. 5 (2), Amsterdam/ Philadelphia: John Benjamins, 231-258. Renouf, A. (1987), ‘Lexical Resolution’, in: W. Meijs (ed.) Corpus Linguistics and Beyond: Proceedings of the Seventh International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi, 121-131.
“To each reader his, their or her pronoun”. Prescribed, proscribed and disregarded uses of generic pronouns in English Elisabetta Adami University of Verona, Italy Abstract After a brief review of the existing literature, this paper investigates the use of generic pronouns in the academic written sections of several corpora of English, namely, (a) the socalled ‘Brown Family’ of the ICAME collection, (b) six components of the International Corpus of English, (c) the British National Corpus and (d) the current extent of the American National Corpus. The analysis shows that the 1970s and 80s debate about sexism in language has apparently influenced academic writing, to the extent that the frequency of generic he is lower in the post-debate texts, while other alternatives have been introduced, some of which, such as ‘he or she’ are now widely used in academic writing. Furthermore, in a genre which is most concerned with ‘correctness’, some so far proscribed pronouns, like singular they, show a slight increase, while the usually disregarded generic she attests a quite significant use. The data testify to variations in use between BrE and AmE and, less conclusively, between other geographical varieties of English. In addition, the analysis makes some observations on the contexts of use, both in terms of domains and of type of antecedents, of s/he, singular they and of the rare, yet attested, generic she, generally disregarded by the literature on the subject.
1.
Introduction
For a number of years grammarians, linguists and teachers have debated which English pronoun should be used to refer individually to gender-indefinite or sexmixed human categories and roles, in cases like ‘anyone can put aside his, their or her own interests to review a situation dispassionately’.1 When sexism in language became a major topic of debate, both the long-lasting prescription of ‘generic he’ (e.g. ‘anyone can put aside his own interests’) and the proscription of the so-called ‘singular they’ (‘anyone can put aside their own interests’) were questioned and various gender-fair alternatives, such as ‘he or she’ (‘anyone can put aside his or her own interests’), were suggested. Nowadays, the ‘Great He/She Battle’ seems to have exhausted its ink-munitions and, in the absence of an agreed solution, ‘recast the sentence into the plural’ (~ ‘(all) people can put aside their own interests’) and ‘avoid pronouns whenever possible’ (~ ‘(personal) interests can be put aside to review a situation dispassionately’) remain the most frequently suggested strategies. So far, few studies have been carried out to ascertain the current use of generic pronouns, none of which has examined academic writing extensively. In order to fill this gap, after a review of the long-standing debate (section 2), this paper investigates the use of generic pronouns in the academic written sections of
282
Elisabetta Adami
several corpora of English (section 3), namely, (a) the so-called ‘Brown Family’ of the ICAME collection, (b) six components of the International Corpus of English, (c) the British National Corpus, and (d) the current extent of the American National Corpus. The analysis aims to (a) verify the extent of influence on academic writing of what has been termed the ‘Great He/She Battle’2, (b) uncover differences in the use of generic pronouns in different regional varieties of English, and (c) investigate some contexts of use of the newly introduced gender-fair alternatives and, in particular, of the attested, but so far disregarded, generic she. 2.
The background
Unlike other Indo-European languages, English has no inflectional category for gender and no gender agreement is needed within and above the noun phrase. In English, ‘[g]ender classes can be differentiated only on the basis of relations with pronouns’ (Huddleston and Pullum 2002: 485) and ‘the choice of pronoun is determined by denotation or reference, not by purely syntactic properties of the antecedent’. Indeed, as is well known, the English pronoun system signals, for the third person singular only, the natural gender of the referent, so that he, his, him, himself stand for antecedents denoting males, she, hers, her, herself stand for antecedents denoting females, and it, its, itself refer to non-human entities.3 Given that the choice of the pronoun follows the sex of the referent, a problem arises when a pronoun is to be used with antecedents referring individually to a mixed-sex human group, role, or category, or to a human entity whose sex is unknown (e.g. the student, the child, someone). Following Latin rules and grammatical tradition, for more than two hundred years, grammarians have retained the masculine as the unmarked case in English (cf. Corbett 1991), hence he has been the prescribed choice to be used in cases like every passenger must show his ID. According to this prescription, he can be both gender-specific (to refer to a male) and gender-inclusive, or generic (to refer to a male + female category). The prescription of generic he has been paired with the proscription of the socalled ‘singular they’, although its use in sentences like everybody raised their hand/s is widespread and well evidenced throughout the history of English in authoritative examples, from William Shakespeare to Jane Austen and George Bernard Shaw; see for example the examples cited in the entry for they in the Merriam-Webster Online Dictionary (2005) (cf. Bodine 1975 for a detailed history of both generic he prescription and singular they proscription since the 18th Century; Stanley 1978; Sklar 1983; Baron 1986): […] know-148 > live-87 > think-81 > visit-74 > realize-71 > believe-60 > pass-48 > expect-47 > stay-46 > understand-45 > call-40 > rest-42 > take-31 > recognise-24 > accept-22 Table 4: Lexeme types and tokens in the infinitival complements Verb forms come + to V comes + to V coming + to V came + to V COME + to be V-ing COME + to be V-ed Total Average
Types 298 62 82 270 12 61 785
Tokens 1,353 141 234 971 14 208 2,913
Type-token ratio 5 2 3 4 1 3 3
The lexemes instantiated denote either deliberate actions (visit, take) or involuntary experiences (accept, believe, expect, know, realize, recognise, rest, understand). But they may also be ambiguous between the two interpretations; this depends, for instance, on whether a given verb occurs in the active or passive voice (e.g. call - be called), on whether it is polysemous, (pass ‘give’ vs ‘spend time’; see ‘perceive visually’ vs ‘pay a visit’; think ‘reflect’ vs ‘have an opinion’) and on whether it is compatible with both an agentive and an experiential interpretation (‘choosing/happening to’ live/stay). In addition, as is the case with how or why embedding, the literal or aspectual interpretation of the construction relies on cues from the larger co-text. For instance, verbs of involuntary experience may be used to encode goals. In such cases, COME is used literally, and the infinitival complement expresses an outcome that the subject hopes or tries to achieve, even if this is not totally under her control; e.g.: (50) “We wanted to win, we came to win” (N9119980615) (51) “[…] when 180,000 fans came to witness the annihilation of the opposition by Nigel Mansell” (N0000000794). Finally, certain cases may remain ambiguous even when the immediate lexicosyntanctic environment is taken into consideration. This applies especially, but not only, to subordinate or embedded clauses; e.g.: (52) “[…] always point it away from you and anybody else when you come to open it” (E0000002013) (53) “Although your main rows will be empty when you come to plant out your winter crops […]” (B0000001178) (54) “‘[…] the nature of the species that he has come to redeem’” (B9000001369) (55) “When I came to write about the city, it was very challenging […]” (N6000920227)
COME+ infinitive construction
391
(56) “Then Saddlers’ Hall joined in the aggravation as he came to challenge the leaders” (N6000920605) (57) “But it was important not to lose sight of it when the Legal Aid Board came to decide whether to cooperate with a scheme […]” (N2000960405) (58) “[…] who came to exert a mutually transforming influence upon Africans of his time […]” (B0000001159) (59) “And you have to put that into the scales when I came to face the British Aerospace decision […]” (N6000940421) (60) “We come to say that the evil and inhumanity represented by Sandakan […]” (N5000950712). 3.4
Distribution of meanings
Manual coding of the data reveals an uneven distribution of the literal and aspectual meanings of the construction across its syntactic variants (see Table 5). On average, the aspectual meaning is favoured over the literal one (59% vs 39%), but a strong preference for the resultative interpretation applies only to the COME + progressive infinitive and COME + passive infinitive constructions (100% and 98%, respectively). A less marked preference occurs with come + active infinitive. The coming + active infinitive variant displays a strong preference for the literal interpretation (82%), followed by came + active infinitive (60) and come + active infinitive (39%). Ambiguous cases account for only 2% of the data. The different frequency values for the literal and aspectual meanings are statistically significant (p-value 0.01). Most of the lexemes associated with the encoding of resultative aspect are exclusively reserved for this function; they encode involuntary experiences. A smaller group, however, are also employed in sentences with a literal interpretation (see Table 6). Table 5: Distribution of literal and aspectual meanings across variant forms of the construction, in percentage values COME forms come comes coming came COME + be V-ing COME + be V-ed Average
Literal 39% 51% 82% 60% 0% 2% 39%
Aspectual 58% 45% 18% 37% 100% 98% 59%
Other 4% 4% 0% 3% 0% 0% 2%
392
Sara Gesuato
Table 6: Colligation of variants of the construction with lexemes encoding only aspect vs lexemes encoding both motion and aspect COME forms come comes coming came COME + be V-ing COME + be V-ed Average
Lexemes encoding only aspect 40% 53% 29% 37% 100% 93% 59%
Lexemes encoding both motion and aspect 9% 5% 4% 6% 0% 0% 4%
The infinitival complements may encode durative processes – whether stative (e.g. seem), dynamic (e.g. use) or envisaging a natural endpoint (e.g. build a hut) – single instantaneous events (e.g. leave), and repeated events (e.g. make each piece of work; suggest every now and then). Table 7 shows their frequency and distribution in the data with regard to those occurrences in which COME + infinitive unequivocally encodes a resultative meaning. In the various corpus subsets there is a consistent preference for durative events, which on average account for about 60% of the data. Punctual events are represented, making up 30% of the data. Habitual events, instead, are rarely instantiated, i.e. about 3% of the time. To sum up, the concordances reveal that COME + infinitive is a fairly frequent construction, used mostly in writing, and preferably realized in a few tenses marked for perfective aspect, which may express goal-oriented motion or, more frequently, resultative aspect, especially in combination with the encoding of durative events. Table 7: Temporal characteristics of events in resultative instances of COME + infinitive COME forms
Durative
come comes coming came
498 (63%) 42 (67%) 25 (58%) 213 (59%)
Single instantaneous 248 (32%) 17 (27%) 16 (37%) 109 (30%)
COME + be V-ing COME + be V-ed Average %
8 (58%) 109 (53%) 60%
2 (14%) 85 (42%) 30%
Repeated 6 (1%) 0 (0%) 0 (0%) 0 (0%) 2 (14%) 4 (2%) 3%
Other 33 (4%) 4 (6%) 2 (5%) 39 (11%) 2 (14%) 6 (3%) 7%
COME+ infinitive construction 4.
393
Discussion and conclusion
The COME + infinitive sequence is attested as a frequent syntactic form in a general corpus of English. In its literal usage, it encodes goal-directed motion. In its more frequent aspectual usage, instead, it encodes resultative aspect, that is, the completion of a process or achievement of a goal. In the latter interpretation, it counts as a manifestation of the localist theory of aspect (Brinton 1988: 112114), according to which, there is “conformity between the spatial meanings of aspect categories and the semantics of the verbs involved” (e.g. ingressive aspect is marked by verbs expressing movement into a situation; p. 95). Resultative COME + infinitive exemplifies the metonymic shift in focus of a motion verb from a spatial meaning to an aspectual meaning, which takes place when it collocates with another verb expressing an action or state (Brinton 1988: 112-114). In general, resultative COME + infinitive manifests the incremental transition of an event to a culmination, or the reaching of a target state, which stands for a metaphorical result-location. It therefore expresses two notions: the development of a process (i.e. a change of state) and the reaching of its endpoint (i.e. the realization of an event). As a result, it can be likened to other structures technically expressing motion but actually denoting change of state, such as going to sleep, falling asleep, putting someone to sleep (Talmy 1975: 234). More specifically, it encodes varying aspectual nuances, depending on the types of verbs it combines with: attainment of a result, with stative verbs like know; inception of a process, with dynamic durative verbs like develop; and realization of a process, with dynamic punctual verbs like arrive. The interpretation of the construction is strongly influenced by the type of events encoded in its infinitival complement: if this denotes a deliberate act, a literal interpretation is favoured; if it denotes an involuntary experience, an aspectual interpretation is likely to be activated. However, despite this correlation, occasional exceptions are attested: certain instantiations are interpretable either in the sense of ‘get closer so as to’ or in that of ‘decide/happen to’ independently of the verb used (e.g. come to buy/rest/win), and only the surrounding co-text (e.g. time adverbs, temporal clauses, how- or why-embedded clauses) may help disambiguate them. The construction displays clear semantic preferences. Although it is used with a great variety of lexemes, most of these encode involuntary experiences or events interpretable as being determined or influenced by external circumstances. More specifically, the lexemes include verbs of physical experience (e.g. develop, die, exist, fall, find, form, get, happen, listen, live, look, notice, perceive, receive, pass, rest, see, wear), verbs of emotional experience (e.g. adore, cherish, deserve, despise, dread, face, fear, feel, hate, loathe, love, prefer, regret, relish, resent, worship); of cognitive experience (e.g. believe, consider, decide, doubt, expect, figure, find out, know, learn, realize, reflect, regard, rely, respect, think, trust, understand, view, value); verbs of relation, often with inanimate subjects (e.g. become, challenge, characterize, comprise, define, denote, depend, epitomize, focus, make up, mean, personify, possess, represent, resemble, seem, sound); and
394
Sara Gesuato
verbs denoting the impact caused by the subject, whether animate or inanimate (e.g. challenge, exert, force, outnumber, overshadow, preserve, reign, share). The re-interpretation of an original expression of motion as a lexicosyntactic marker of resultative aspect is fostered by two co-textual features. In its resultative instantiations, the construction tends to encode durative events, as is typical of ingressive aspectualizers, although it is also instantiated with punctual ones. Also, the matrix clause is mostly realized in non-progressive forms, while its complement tends to be rendered as an active infinitive. This is in line with the semantics of the construction: the use of perfective forms is particularly suitable for encoding the completion of a process.12 COME + infinitive can be said to illustrate the partial grammaticalization of a spatial expression into a marker of resultativity. On the one hand, its grammatical re-interpretation is not complete: the construction takes on an aspectual, modal-like meaning in a favourable co-text, although it can still retain the literal meaning of goal-oriented motion, and is at times ambiguous between a literal and an aspectual interpretation. On the other, its specific aspectual meaning is resultative because, through a combination of lexical and syntactic means, the construction encodes the accomplishment of a process, whose resultant state can be inferred, even if it is not overtly expressed.13 The link between the literal and the aspectual meaning of the construction is provided by those examples in which COME is used literally, but is followed by a verb denoting a non-deliberate event; e.g.: (61) “[…] the rain came to bless me with all its clumsy fingers” (S2000910319) (62) “[…] the nose twisted and came to touch the knees” (B9000001254) (63) “Air sacs are where blood vessels come to deposit ‘used’ air (carbon dioxide)” (N0000000740). However, only diachronic data can provide definite insights into the origin of the resultative variant of the construction. The exploration of the diachrony of the phenomenon goes beyond the scope of this study, but it is certainly a worthwhile research goal: by consulting other corpora and/or concordances from texts by 18th and 19th century authors, for instance, it should be possible to understand whether resultative COME + infinitive is a recent innovation or a structure that was available to speakers/writers also in the past, but whose frequency of occurrence may have increased in recent times. Additionally, one could trace and compare developmental trends across registers (spoken and written), geographical varieties (e.g. American and British) over time, which could give insights into the overall grammaticalization process (cf. Mair 2008 on infinitival complements in specificational clefts). More generally, the consultation of additional corpora may shed light on the actual spread and degree of prominence of the construction examined. On the one hand, the higher occurrence of COME + infinitive in written sources (see Table 1) may be due to a bias in the design of the BoE, most of whose components are representative of the written register. On the other hand, the
COME+ infinitive construction
395
relative scarcity of narrative texts – with their focus on the past – in the BoE may have downplayed the magnitude of the resultative structure (see Table 3 about the preference of the construction for perfect and past tenses). Either way, it is only by comparing the findings reported here with more data, from varied sources, that the issue can begin to be settled. A step in this direction has already been taken. I have looked at the occurrence of resultative COME + infinitive in various components of the International Corpus of English (ICE; Gesuato 2008a, 2008b). Although fewer instances of the construction have been retrieved, the same kinds of co-textual preferences and phraseological associations have been identified there as in the BoE, but with one exception. The Great Britain component more frequently instantiates the literal than the aspectual meaning, while the Hong Kong component instantiates both to the same degree. The ICE data, therefore, seems to suggest that the native variety of British English is not at the forefront of the aspectual development of the construction, which runs counter to what one would expect in general and also to the BoE findings, where the aspectual meaning is more firmly established than the literal one. Even this limited comparison, therefore, reveals that, while the use of corpora is extremely useful in finding out what grammatical and textual patterns characterize a given expression, no single corpus will actually reveal the whole picture of a given linguistic phenomenon. Only by comparing findings from different corpora is it possible to explore how the performance of single individuals can modify the competence of groups of individuals over time. In addition, it may be advisable to compare corpus data with elicited data: the most frequent sense of a given form is not necessarily its most prototypical meaning, as tested against native speakers’ judgments (Leech 2008). An interesting finding from the study is that resultative COME is more common than literal COME (see Table 5) and in statistically significant terms. This suggests that the grammaticalization process affecting COME + infinitive is well under way. Indeed, indirect support for this interpretation is provided by the patterns of comparable grammaticalizing constructions based on motion verbs. For instance, non-progressive forms of GO followed by an active infinitive in the BoE have been found to encode the literal meaning of ‘moving away so as to’ 88% of the time, and to instantiate related, metaphorical meanings outside the domain of tense (‘be transferred and used’, ‘contribute to’, ‘succeed in’ and ‘proceed to’) only 12% of the time (Gesuato forthcoming). Similarly, the BoE has been found to instantiate the have/has/had been to V construction meaning ‘being back from V-ing’ only marginally (i.e. with 41 unambiguous examples; Gesuato 2008c). According to Heine and Kuteva (2002: 2), there are four mechanisms involved in grammaticalization: “(a) desemanticisation (or “semantic bleaching”) – loss in meaning content, (b) extension (or context generalization) – use in new contexts, (c) decategorialisation – loss in morphosyntactic properties characteristic of
396
Sara Gesuato
lexical or other less grammaticalised forms, and (d) erosion (or “phonetic reduction”) – loss in phonetic substance.” The BoE data suggests that resultative COME + infinitive has reached the second stage. However, only a comparison of instances of motional and aspectual COME collected from a speech corpus could reveal whether the resultative examples are also characterized by phonetic reduction with respect to the literal ones. The same authors (pp. 318-319) also show how cross-linguistically the verb COME can be grammaticalized into a resultative marker to denote a change of state, like other aspectual markers (e.g. go, go to, finish, leave). Their survey, therefore, lends support to an interpretation of the non-literal COME + infinitive as a marker of resultative aspect. In conclusion, the role of resultative COME + infinitive in the system of the English language is similar to that of other resultative constructions and lexical aspectualizers: it contributes to the encoding of aspect, which is not fully grammaticalized (i.e. not systematically realized through morpho-syntax; cf. Hopper 1979: 239-40; Horrocks, Stavrou 2003: 299). More specifically, COME + infinitive signals the completed development of a process, although this completion is presented not as already achieved, but as an outcome to be achieved, projected into a later stage. Therefore, while resembling ingressive aspectualizers denoting the beginning of durative processes (Brinton 1988), resultative COME actually functions as a forward-oriented or prospective marker of perfective aspect, which expresses the realization of an event as dependent on the conclusion of an introductory phase. Notes 1 Thanks go to Alberto Mioni and an anonymous reviewer for helpful comments and suggestions on an earlier draft of this paper. 2
Here and elsewhere, made-up examples appear only in double quotes, while examples from the corpus consulted are followed by the specific text reference.
3
There are different views on what syntactic forms count as complex predicates. According to Butt’s (1997: 108) and Mohanan’s (1997: 432) definitions, complex predicates constructions combine two or more semantically predicative elements, which contribute arguments into the flat grammatical function of a single, simple predicate.
4
The semantics of COME, however, has been examined (Goddard 1997).
5
For other types of resultatives, see Horrocks, Stavrou (2003) and Nedjalkov (1988).
COME+ infinitive construction
397
6
Otherwise, if the question is made relevant to the larger event encoded in the sentence, the meaning conveyed will actually be resultative (e.g. “How did it happen that she bought a house?”).
7
Cf. Bertinetto and Squartini’s (1995) description of gradual completion verbs.
8
The resultative meaning of COME is not necessarily dependent on the occurrence of an infinitival complement. It may also be instantiated when followed by an indirect object that encodes a state, event or activity, rather than a physical destination; e.g.: COME + to a decision/conclusion/view; + to an end/stop/halt/standstill; + to power/prominence; + into being/existence/operation/effect; + into view/sight. In addition, it can be activated when used with a predicative adjective denoting a resultant state; e.g. COME + apart/unstuck/undone/untied, + true. Finally, it is also encoded in COME-based phrasal verbs, albeit with specific nuances; e.g. COME IN + first/second; + useful/handy; COME OFF + well/badly/worst. It thus parallels other English motion verbs, in that it can be used both literally and non-literally in similar syntactic environments (see section 1).
9
Cf. Klein (1994)’s characterization of aspect in terms of the interaction of source and target states and their relevant pre- and post-time (ch. 6), as well as his description of the meaning of COME along the same lines (note 4 on p. 227).
10
In these and following examples, underlining signals added emphasis.
11
Cf. Brinton’s (1988:43) description of the meaning nuances conveyed by the perfective depending on the verb type it is applied to.
12
Brinton (1998: 16), for instance, explicitly states that the simple present and the simple past are markers of perfective aspect.
13
Cf. Bertinetto’s (1986: 98, 274 et passim) definition of resultative as ‘+durative’ and ‘+telic’).
References Alsina A., Bresnan J., Sells P. (eds.) (1997), Complex Predicates. Stanford: CSLI. Baicchi A. (2007), ‘‘He Smiled me into Love’. The Subsumption Process of the Intransitive-Transitive Migration’, paper presented at the 23rd AIA (Associazione Italiana di Anglistica) Conference ‘Forms of Migration, Migration of Forms’. University of Bari, 20-22 September 2007. Bertinetto P.M. (1986), Tempo, Aspetto e Azione nel Verbo Italiano. Firenze: Accademia della Crusca.
398
Sara Gesuato
Bertinetto P.M., Squartini M. (1995), ‘An Attempt at Defining the Class of ‘Gradual Completion Verbs’, in: P.M. Bertinetto, V. Bianchi, J. Higginbotham and M. Squartini (eds.) Temporal Reference, Aspect and Actionality, 1: Semantic and Syntactic Perspectives, Rosenberg and Sellier, Torino, Italy. 11-27. Brinton L.J. (1988), The Development of English Aspectual Systems. Cambridge: CUP. Bussmann H. (ed.) (1996), Routledge Dictionary of Language and Linguistics, vol. II, London: Routledge. Butt M. (1997), ‘Complex Predicates in Urdu’, in: A. Alsina, J. Bresnan, P. Sells (eds.) Complex Predicates. Stanford: CSLI. 107-150. Carrier J., Randall J.H. (1992), ‘The Argument Structure and Syntactic Structure of Resultatives’, Linguistic Inquiry, 23(2): 173-234. Claudé P. (1990), ‘La Biprédication Résultative en Anglais’, Sigma: Linguistique Anglaise – Linguistique Générale, 14: 143-56. Eastlack C.L. (1967), ‘Catenative Verbs in Portuguese and English: A Contrastive Study’, Estudos Lingüísticos, 2(1-2): 43-56. Fang A.C. (1995), ‘Distribution of Infinitives in Contemporary British English: A Study Based on the British ICE Corpus’, Literary & Linguistic Computing, 10(4): 247-57. Gesuato S. (2008a) ‘The Resultative Aspectualizer COME + to_Infinitive in Five Varieties of English’, paper presented at the 4th IVACS (Inter-Varietal Applied Corpus Studies International) Conference. University of Limerick, Ireland, 13-14 June 2008. Gesuato S. (2008b) ‘Motional and Aspectual Usage of COME + To-infinitive in Native and Non-native English Varieties’, in: Associaçao de Estudos de Investigaçao Científica do ISLA-Lisboa (ed.) TaLC8 Lisbon, Proceedings of the 8th Teaching and Language Corpora Conference, 3-6 July 2008, Offsetmais Artes Gráficas S.A., 379-385. Gesuato S. (2008c) ‘Corpus Data and Elicited Data: The Case of HAVE BEEN + to_infinitive’, paper presented at the 9th ESSE (European Society for the Study of English) Conference. University of Aarhus, Denmark, 22-26 August 2008. Gesuato S. (forthcoming) ‘GO to V: Literal Meaning and Metaphorical Extensions’, in: M. Hundt, D. Schreier, A. Jucker (eds.) Proceedings of the 29th ICAME (International Computer Archive of Medieval and Modern English) Conference ‘Corpora: Pragmatics and Discourse’. University of Zurich, Ascona, Switzerland, 14-18 May 2008. Goddard C. (1997), ‘The Semantics of Coming and Going’, Pragmatics, 7(2): 147-62. Goldberg A.E., Jackendoff R. (2004), ‘The English Resultative as a Family of Constructions’, Language, 80(3): 532-68.
COME+ infinitive construction
399
Heine B., Kuteva T. (2002), World Lexicon of Grammaticalization. Cambridge: CUP. Hinrichs E., Kathol A., Nakazawa T. (eds.) (1998), Complex Predicates in Nonderivational Syntax, vol. 30 of Syntax and Semantics. San Diego: Academic Press. Hoekstra T. (1988), ‘Small Clause Results’, Lingua, 74: 101-39. Hopper P.J. (1979), ‘Aspect and Foregrounding in Discourse’, in: G. Talmy (ed.) Syntax and Semantics, vol. 12 of Discourse and Syntax. New York: Academic Press. 213-41. Horrocks G., Stavrou M. (2003), ‘Actions and their Results in Greek and English: The Complementarity of Morphologically Encoded (Viewpoint) Aspect and Syntactic Resultative Predication’, Journal of Semantics, 20: 297-327. Huddleston R., Pullum G.K., Bauer L. (eds.) (2002), The Cambridge Grammar of the English Language. Cambridge: CUP. Ike-Uchi M. (1994), ‘English Resultative Constructions and Wh-movement’, in: S. Chiba et al. (eds.) Synchronic and Diachronic Approaches to Language. A Festschrift for Toshio Nakao on the Occasion of his Sixtieth Birthday. Tokyo: Lieber Press. 361-78. Ionescu D. (1994), ‘Resultative Small Clauses’, Revue Romaine de Linguistique, 39(3-4): 353-69. Klein W. (1994), Time in Language. London/New York: Routledge. Kudrnáþová, N. (2005), ‘On One Type of Resultative Minimal Pair with Agentive Verbs of Locomotion’, in: J. Cermák, A. Klégr, M. Malá, P. Šaldova (eds.) Patterns: A Festschrift for Libu se Dusková. Prague: Charles University. 107-14. Leech G. (2008), ‘Frequency is Important – and Challenging: A Present-day Corpus Perspective’, paper presented at the 8th TALC (Teaching and Language Corpora) Conference. University of Lisbon, 3-6 July 2008. Mair C. (1990), Infinitival Complement Clauses in English. A Study of Syntax in Discourse. Cambridge: CUP. Mair C. (2008), ‘Right in the Middle of the S-shaped Curve: On the Spread of Specificational Clefts in 20th Century English’, paper presented at the 8th ESSE (European Association for the Study of English) Conference. University of Aarhus, 22-26 August 2008. McIntyre A. (2001), ‘Argument Blockages Induced by Verb Particles in English and German: Event Modification and Secondary Predication’, in: N. Dehé, A. Wannen (eds.) Structural Aspects of Semantically Complex Verbs. Berlin/Frankfurt/New York: Peter Lang. 131-64. Mohanan T. (1997), ‘Multidimensionality of Representation: NV Complex Predicates in Hindi’ in: A. Alsina, J. Bresnan, P. Sells (eds.) Complex Predicates. Stanford: CSLI. 431-72. Müller S. (2002), Complex Predicates, Verbal Complexes, Resultative Constructions, and Particle Verbs in German. Stanford: CSLI.
400
Sara Gesuato
Müller S. (2005), ‘Resultative Constructions – Syntax, World Knowledge, and Collocational Restrictions’, Studies in Language, 29(3): 651-81. Nedjalkov V.P. (ed.) (1988), Typology of Resultative Constructions. Amsterdam/Philadelphia: John Benjamins. Quirk R., Biber D. (eds.) (1999), Longman Grammar of Spoken and Written English. London: Longman. Rosen S.T. (1990), Argument Structure and Complex Predicates. New York/London: Garland Publishing. Rowling J.K. (2007), Harry Potter and the Deathly Hollows. London: Bloomsbury. Shirai Y. (1998), ‘Where the Progressive and the Resultative Meet. Imperfective Aspect in Japanese, Chinese, Korean and English’, Studies in Language, 22(3): 661-92. Stevens W.J. (1972), ‘The Catenative Auxiliaries in English’, Language Sciences, 23: 21-5. Stewart O.T. (1998), ‘Evidence for the Distinction between Resultative and Consequential Serial Verbs’, in: B. Bergen, M. Plauché, A. Bailey (eds.) Proceedings of the Twenty-fourth Annual Meeting of the Berkeley Linguistics Society, February 14-18, 1998, General Session and Parasession on Phonetics and Phonological Universals, Berkeley, CA, Berkeley Linguistics Society. 232-243. Talmy L. (1975), ‘Semantics and Syntax of Motion’, in: J.P. Kimball (ed.) Syntax and Semantics, vol. 4, London: Academic Press. 181-238. Tortora C.M. (1998), ‘Verbs of Inherently Directed Motion are Compatible with Resultative Phrases’, Linguistic Inquiry, 29(2), pp. 338-45. Whelpton M. (2001), ‘Elucidation of a Telic Infinitive’, Journal of Linguistics, 37(2), pp. 313-37. Whelpton M. (2002), ‘Locality and Control with Infinitives of Result’, Natural Language Semantics, 10: 167-210. Yamada Y. (1987), ‘Two Types of Resultative Constructions’, English Linguistics: Journal of the English Society of Japan, 4: 73-90.
A corpus-based analysis of invariant tags in five varieties of English Georgie Columbus Department of Linguistics, University of Alberta Abstract Discourse markers are a feature of everyday conversation – they signal attitudes and beliefs to their interlocutors beyond the base utterance. One particular type of discourse marker is the invariant tag (InT), for example New Zealand and Canadian eh. Previous studies of InTs have clearly described InT uses in individual language varieties. Such studies have focused on sociolinguistic features and on sociolinguistic functions of single markers. However, InTs as a class have not yet been fully described, and the variety of approaches taken (corpus- as well as survey-based) means that cross-varietal or crosslinguistic comparison cannot be conducted with the results thus far. This study investigates InTs in five varieties of English from a corpus-based approach. It lists the utterance-final InTs available in NZ, British, Indian, Singapore and Hong Kong English through their occurrences in their respective International Corpus of English (ICE) corpora, and compares frequency of usage across the varieties. The quantitative analysis offers a clearer overview of the InT class for descriptive grammars, and clarifies some usage aspects for ESL/EFL pedagogy. Finally, the results offer an insight into the global status of InTs in English. 1
1.
Introduction
Question tags have long been the subject of sociolinguistic and variationist studies. Canonical question tags, such as aren’t you? and isn’t it? have received much attention in linguistics, perhaps due to their curious syntactic and semantic properties, including inversion and polarity. In the last few decades especially, invariant tags (InTs) such as huh and innit, have been equally researched and documented. InTs provide similar attitudinal and evidential meanings above the level of the proposition as canonical tags, but do not undergo changes in structure or polarity. Yet while canonical question tags are the focus of much ESL/EFL clarification in syllabi and texts, their invariant counterparts are rarely formally taught. This imbalance, and the prevalence of one particular tag (eh) in both my home and adopted countries, formed the impetus to investigate the meanings and usage patterns of InTs in different English varieties. 1.1
Previous InT studies
Much research has been undertaken on InTs in particular varieties and/or dialects of English. Most of this research has been within the realms of Conversation
402
Georgie Columbus
Analysis, focusing on sociolinguistic patterns of use and/or pragmatic contributions of tags in a speech community. Sociolinguistic factors such as distribution of the markers within a speaker population have been investigated by Stubbe and Holmes (1995), Andersen (1997, 1998), Algeo (1998), Stubbe (1999), and Starks, Thompson and Christie (2008). Other studies have focused on InT meaning and functions, for example Holmes’ (1982) description of both canonical and non-canonical (i.e. invariant) tags in New Zealand English. Holmes divided the items into hearer- and speaker-oriented categories, and offered a list of potential functions. Meanwhile, Norrick’s (1995) study of US English hunh looked more at the pragmatic features, such as use indicating sarcasm or irony, and use in (semi-)fixed expressions. Berland (1997), on the other hand, focused on teenagers’ use of a small set of InTs in the Corpus of London Teenage speech (COLT). Lastly, the semantics and pragmatics of Canadian eh has been characterised by several researchers, notably Avis (1972), Love (1973), Gibson (1977) and Gold (2005). Each of these studies clearly defines tag uses in their respective varieties, but taken together provide heterogeneous classifications of English InTs. Thus despite this depth in tag description, it is not feasible to compare the tags across varieties using these results, as the studies have been carried out in single varieties and with varying methodologies and sociolinguistic/pragmatic aims. This study, then, aims to work toward such a comparison using five varieties of spoken English. It focuses on the frequency of InTs in the varieties to gain some indication of usage and preferences regarding tags, in order to shed light on global usage of InTs. 1.2
Research goals
This study aims to describe the relative frequencies for uses of the utterancefinal tags in BrE, IndE, NZE, Hong Kong English (HKE) and Singapore English (SinE) results. It investigates InT selection and usage compared across and within the five varieties. 1.2.1 Variety selection BrE, NZE, IndE, HKE and SinE were chosen as the varieties for this study due to their diversity in geography, linguistic history and speaker populations. It seemed desirable to limit the comparisons to varieties within the same ‘type’. That is, in dictionaries and particularly in ESL/EFL, North American and British English are commonly the divisions used for items with varietal distinctions, before subdivision into (loosely) national varieties and their dialects (if at all). BrE was chosen as a globally-recognised ‘type’ of English. Also, BrE as a variety is noteworthy in its dialectal diversity, many of which are available in the ICE corpora used for this study. Given the incomplete status of the relevant corpora, no American-type varieties were considered. NZE is considered to be of the British ‘type’, but has a much smaller speaker base and range of dialects. Additionally, where ESL/EFL materials are concerned, NZE is comparatively under-described. This is also true of IndE, and as a variety with diverse contact
A corpus-based analysis of invariant tags in five varieties of English.
403
languages and large migrant communities in English-speaking countries it can provide an insight into English as a lingua franca. Furthermore, the high business profile of India makes IndE a common language in business situations, and therefore worthwhile to define for EAP/Business English purposes as much as for purely linguistic reasons. For similar reasons, two other outer circle varieties were chosen. These were SinE and HKE, which share related native contact languages. HKE is used in a prominent global business centre, while SinE has been the subject of much scholarly research. A comparison of which tags are shared in two Englishes that have close L1 connections allows insight into the variation possible between such varieties. Most importantly for the comparative aspect of the study, each of these varieties was available as an ICE corpus, with similar time periods for collection and near-identical mark-up protocols. 1.3
Invariant tag definition
To determine which items are indeed tags, a variety of definitions from previous canonical and invariant tag studies, such as Holmes (1982), Meyerhoff (1992), Stubbe and Holmes (1995), Berland (1997), and Andersen (1997, 1998, 2001) were considered. The working definition employed for this study was extrapolated from Biber et al.’s (1999) definition of what they term ‘response elicitors’ (RE) in the Longman Grammar of Spoken and Written English (LGSWE). This stated that REs have a “speaker-centred role, seeking a signal that the message has been understood and accepted” (p.1089).2 Yet while this includes gestural responses, only one RE (right) is noted as requiring a verbal response. Indeed, the response-eliciting function of these items is not universally accepted (cf. Holmes 1982, Berland 1997, Andersen 1998). The identification of InTs in this study utilised a slightly broader definition, in that the ‘message’ being signalled was considered to include attitudinal information as well as propositionchecking information. Furthermore, I assumed no response was required, having no visual data to check this. The classification also excluded non-discourse markers and non-InT homonyms such as yeah where it expressed surprise or affirmation, and right where it was confirmation or part of a direction. The key definition was whether the propositional meaning changed when the item was left out (as for all discourse markers, cf. Schiffrin 1987), and whether the item could function with similar (though not identical) uses as a canonical tag. The tags in each variety were analysed individually, and only those items which fulfilled the criteria above were included for this study (viz. the exclusion of isn’t it?/is it? when in canonical use with grammatical agreement, and no in varieties that did not have the InT function). The definitions employed were corroborated by each stage of analysis. Finally, the frequency analysis deals with only utterance-final InTs. Non-clausal, utterance-initial and utterance-medial InTs in BrE, IndE and NZE are described along with the utterance-final tags in terms of frequency and meanings/functions in Columbus (in revision) and with respect to the most common meanings in Columbus (forthcoming).
404
Georgie Columbus
2.
Methods
The study was conducted using the International Corpus of English corpora for British English (ICE-GB, Survey of English Usage, University College London, 1998), Indian English (ICE-IND, Shivaji University, Kolhapur, and the Freie Universität Berlin, 2002), New Zealand English (ICE-NZ, School of Linguistics and Applied Language Studies, Victoria University of Wellington, 1999), Hong Kong English (ICE-HK, Hong Kong Polytechnic University, The University of Hong Kong and The Chinese University of Hong Kong, 2006) and Singapore English (ICE-SIN, The Department of English, The National University of Singapore, 2002). Each corpus was delimited to text files of 200,000 words each, from the Private Conversation texts (S1A-001-100). They were analysed based on the transcriptions only, due to the lack of sound files at the time the study was undertaken. The advantage of using these corpora was that they had the same text categories and almost identical mark-up conventions. They were also collected during the same time-period, making their content highly comparable. However, there were some differences in the level of notation seen as the ICE-GB text was imported into Wordsmith 4 from its custom-made mining program ICE-CUP. This altered the visible mark-up.3 The search itself involved narrowing down a set of discourse markers to a set of potential InTs using the (‘discourse marker’) tag in ICE-GB’s corpus tool, ICE-CUP (Survey of English Usage, University College London, 1998). Additionally, potential discourse markers in ten randomly selected files were analysed manually from ICE-HK, ICE-SIN, ICE-NZ and ICE-IND; these corpora were available in marked-up text but without the discourse marker tagging. From the original search set of approximately fifty potential InTs, seventeen items which all appeared with InT functions in the utterance-final position were selected for this study. These were accha, ah, ahn, eh, is it, isn’t it, lah/la, na, no, OK/okay, right, see, wah, yeah, yes, you know, and you see. 2.1
A note on the inclusion of lah/la as an InT
Before describing the search methodology, a brief comment on tag selection is necessary. It may not be obvious why the particle la/lah has been classified as a noncanonical tag in this study. Certainly, much has been written on la/lah over the past thirty years, beginning with Richards and Tay (1977) and Kwan-Terry (1978). To understand why la/lah can be a tag we need to return to our basic definitions of invariant tags and tag questions. Invariant question tags are considered to be like the fixed forms of canonical question tags, with similar functions. However, research on both of these discourse marker types has not shown the question function to be the primary use and/or meaning of the tag. Uses such as emphasis, softening and irony/sarcasm are prevalent in the literature (cf. Holmes 1982, Berland 1997, Norrick 1995 inter alia). Similarly, a not insignificant number of the items considered to be invariant tags or response elicitors in the varying descriptions are interjection-based discourse markers of
A corpus-based analysis of invariant tags in five varieties of English.
405
some form, such as hey and eh, a descriptor which is also used for la/lah (Lee 2004). Bell and Peng Quee Ser (1983) describe la/lah as a marker of emphasis or contrast, “drawing attention to the literal meaning – the semantic sense, overt and explicit – of an utterance or part of one” (p.13). Likewise, Kwan-Terry (1978) discusses the marker’s use for persuasion, approval, as a softener or for authority, and for positive and negative humour, as well as for uncertainty and suggestions. If we take these definitions as guides, then the classification of la/lah as an InT is not unjustified – it fits with the prior classifications of other InTs and follows from traditional descriptions of the marker(s). It should be noted again, however, that the classifications of an item as an InT in this study took general descriptions and definitions from previous studies into account as background information only. The only criterion used in the analysis and classification process was the definition provided in 1.3. 2.2
Search technique
The initial search was conducted in Wordsmith 4, utilising the tag symbols for a start of utterance in the ICE mark-up. This was the reason for the elimination of non-final InTs in this study. Some of the tagging was not in the imported version of the ICE-GB concordances, which meant that the results for the BrE data may have been under-reported. With the concordances in the Wordsmith tool, the search items were then entered followed by the start of utterance mark-up symbol. This query returned the start of each utterance, allowing easy visual inspection of the utterance-final instances of the seventeen InTs named above. Table 1 gives example of InT concordances in each of the five varieties. To ensure that other factors such as marked up pauses and anthropophonics which share the initial tag symbol were not falsely included as ‘utterance-final’, each concordance line was manually checked to eliminate the non-final occurrences. A tally for the InTs in each variety was performed using simple Excel functions. Table 1: Examples of concordances for BrE, IndE, NZE, HKE and SinE InTs BrE:
C: She looks she looks Puerto Rican or something is it B: There was this bloke in the in the cafe in Cambridge called the Steps really weird OK C: I wrote I turned up the first night right C: But then I ‘ve had this about twenty years with the same thing on see HKE: <S1A-006#268:1> A: Oh it ‘s exhausting you know
406
Georgie Columbus
Z: Who is twenty-five when he got married A: Twenty-one la B: Pronunciation you know not in English I think isn’t it A: It’s tape recording conversations okay A: Yeah yeah uh may be you will say uh you two always have argue why still can last for four years right SinE: <S1A-099#40:1> A: Ya but other than uhm workwise I guess like I manage to buckle down lah B: You read for yourself is it A: So when the second application came out I applied again and then notwithstanding the fact that they told us that those who have been rejected or have been offered a place don’t have to apply again they will not consider us you see IndE: B: ...everytime the team keeps losing I mean something should be done isn’t it C: But again that caste certificate problem has arrived na B: She is very she is very bold no NZE: <S1A-072#400 > P: He’s pretty intelligent eh A: It’s like you see things T: True eh A: Yeah and all from your sitting room window yeah N: They just go though you know N: They’re only going through a process eh
For the classification of discourse markers as InTs, context was relied on for clarification. This was due to the written nature of the transcriptions (searchable sound files being unavailable) and the lack of intonational mark-up. To a limited extent, mark-up of punctuation was used where possible to determine the utterance position and classification of an item. A question mark offered a clear indicator of question intonation in the file, but was not used in the BrE, HKE, SinE and IndE data, and thus may have led to under-reporting of question uses. Again, full context was used to clarify the utterance’s intent. Interruption data, where the marker occurred at the end of an utterance which overlapped with another speaker’s, could not necessarily be counted as being utterance-final in intention, and thus was only included where this intention was clear. For
A corpus-based analysis of invariant tags in five varieties of English.
407
example, items were included when other mark-up indicated an utterance break, such as when pauses were indicated, or when new utterances began which were also included in the overlap. Finally, the rechecking processes involved during the analysis stage of comparison did offer a chance to confirm and/or adjust previous categorisations and identifications. In general, the classifications were highly reliable across the varieties and tags. It should be noted here that, as with many discourse marker studies, a certain amount of subjective analysis is necessary in determining which items to include for analysis as InTs. It is understood that this does not allow for complete confidence in the results given below, but is a fact reluctantly accepted as necessary for this type of study (cf. Berland’s lament for the same, 1997). 3.
Results
The raw occurrences of the seventeen items in NZE, BrE, HKE, SinE and IndE are given in Figure 1 and Table 2. The only items to reach over fifty occurrences in the utterance-final position in any of the 200,000 word corpora of each were eh, yeah, la, right, you see, no, na and you know. Of the seventeen items which occurred utterance-finally, only four occurred in all five varieties: okay/OK, right, you know and you see. Of these, right and, to a lesser extent, you know, have the highest frequencies. Another major point to be obtained from the results in Figure 1 is that the total number of utterance-final InTs in these varieties is not analogous; BrE has 268 and HKE has 288, while NZE has almost fifty percent more at 386. IndE has almost twice as many as the NZE tally at 696, and more than the total number of InTs in NZE and BrE or HKE combined. SinE, however, has the highest usage of InTs, with 776.
Figure 1: Frequencies for InTs in the five varieties with a threshold of 20 raw occurrences
408
Georgie Columbus
Table 2: Raw frequencies of 17 utterance-final InTs. Shaded cells indicate the InTs found in all varieties. Bold numbers indicate the most frequent tag
T ag accha ah ahn eh is it isn't it lah/la na no OK/okay right see wah yeah yes you know you see Total
BrE 6 1 1 0 7 8 2 34 7 171 31 268
IE 2 18 10 0 12 33 109 237 12 12 2 60 4 158 27 696
NZE 292 0 0 1 7 11 2 35 2 18 18 386
SinE 5 47 14 241 1 14 236 0 7 0 0 110 101 776
HKE 1 25 4 14 5 24 110 0 5 24 5 70 6 288
We now turn to more the detailed comparisons given in Figures 2 and 3. Figure 2 shows only the frequencies for the seven InTs which are shared in BrE, NZE and IndE. We see here that IndE has a high raw frequency of the InT no, and also makes high use of you know. BrE also has frequent use of you know, while in NZE none of these shared InTs is preferred. Instead, NZE has a high use of eh, as seen in Figure 1, and comparable use of yeah to BrE. In Figure 3, we see the full range of InTs in HKE and SinE. The results for these varieties’ InT frequencies show more dissimilarities than likenesses. We do not see patterns similar to each other which may be expected given the similar language contact situation. Nor is there a pattern which is similar to the seemingly BrE-influenced IndE, the other outer circle variety. Instead, HKE and SinE have raw frequency patterns which are distinct from each other and from the three other varieties investigated. The results in Figure 3 and Table 2 show that the two English varieties here do not share frequency of usage in the utterance-final position, despite having similar contact languages. Most obvious is the wide gap between HKE and SinE in the total number of utterance-final InTs – with SinE having almost two and a half times as many InTs as HKE. Indeed, the only points of similarity between the two varieties are the relatively comparable numbers for wah (7 for SinE and 5 for HKE) and the lack of use of see. Also, the two varieties both share use of is it, okay/OK, and you know (with approximately 50-60% fewer uses for is
A corpus-based analysis of invariant tags in five varieties of English.
409
it and you know for HKE than SinE, but more HKE occurrences of okay/OK). Also, neither variety has see in the utterance-final position. Most notable, however, are the three clear preferences for InTs in SinE – you see, right, and la4.
Figure 2: Raw frequencies of the seven shared InTs between BrE, NZE and IndE
Figure 3: Raw tag frequencies in SinE and HKE
410
Georgie Columbus
3.1
Discussion
Several points are realised by the results given above. Firstly, the low number of tags in ICE-GB and ICE-HK may suggest that BrE and HKE do not use tags to a high degree. However, if we consider that BrE’s tally for is it?/isn’t it? alone as canonical (that is, variant) tags was 215 and 156 respectively, with only one example each in noncanonical, invariant use, then it suggests that canonical tags are regularly in use in BrE, perhaps more so than the invariant type. Similarly, the fact that it is not possible to search for known InTs such as innit (e.g. Berland 1997) in BrE, as it is normalised to isn’t it in the ICE-CUP (and thus the exported ICE-GB) corpus content, suggests a higher number of InTs ought to exist in BrE, but they have been obscured in the corpus due to the normalisation process. The implications of such normalisations are discussed further in Columbus (in revision). Another complicating factor may have been the difference in visible mark-up in the ICE-GB transcriptions via Wordsmith. The comparative lack of InTs in HKE, however, is less clear. While this may also be due to higher canonical question tag use, there is a relatively high number of invariant uses of is it? and isn’t it? More research into canonical question tag use in HKE may clarify the matter. Secondly, there is a strong resemblance between the raw frequencies for IndE and BrE. The group of seven items shared by BrE, NZE and IndE in Figure 3 (OK/okay, right, see, yeah, yes, you know and you see) have very similar rates of occurrence. In particular, the pattern of frequent usage is the same, but IndE has extended the pattern to include indigenous InTs, such as ah, ahn, and accha. This extension from the (likely) BrE base contrasts with the NZE pattern. NZE appears to have instead taken the set from the base set in BrE and changed both the relative frequencies and the preferred items. Where IndE uses no and shares high use of you know with BrE, NZE has relatively little use of the InTs in the set but for eh. A search for potential indigenous NZE InTs (such as Maori kao and ae) also revealed no non-English-based tags in use. With respect to the two English varieties with related contact languages, SinE and HKE, the results above show that there is no apparent similarity between HKE and SinE in InT usage, with the possible exception of the non-use of see. However, while HKE has right as a weakly preferred marker, SinE has a strong preference for la as a tag. Right and you know are more often used in HKE, but given the low occurrence of utterance-final tags in HKE overall, these form a high proportion of the InTs used. 4.
General discussion and conclusions
As the results in Figure 1 and Table 2 show, the InT patterns for BrE, IndE, NZE, HKE and SinE are unique to each variety, with the exception of IndE’s extension of the BrE pattern. NZE shares little frequency of InT usage with the other varieties, save the use of yeah in BrE, IndE and HKE, and perhaps you see with
A corpus-based analysis of invariant tags in five varieties of English.
411
BrE and IndE. For the most part, NZE speakers prefer eh over other InTs. SinE also has one preferred marker, la. The variety has a second preferred item, right, which is used to a lesser degree in HKE (though as HKE speakers’ preferred marker), but rarely in BrE, IndE and NZE. Perhaps surprisingly, only one English variety of the five investigated here shows a clear relationship to another. IndE appears to have taken the base set of InTs from BrE and built upon it. Even the number of new, indigenous-based items (four) is higher than other varieties in the private spoken ICE corpora (none in NZE or BrE, two in both HKE and SinE). There are otherwise few InTs which are common across the varieties in terms of rates of usage (at least for the corpus time period of circa 1990-1999). Such a distribution pattern is not without implications; it is to these we now turn. 4.1
Further implications
The relative dissimilarity in the selection and use of InTs these varieties in the utterance-final positions implies that the use of InTs is not comparable across British-type Englishes. This is clear in the relative frequencies of the items and in the preferred tags, or lack thereof, in each variety. Such a lack of similarity in attitudinal nuance could be problematic for global English use; varietal differences at the level above propositional understanding could cause problems for intercultural and global communication. This in turn has implications for pedagogy and materials for ESL/EFL and English for Specific/Business Purposes (ESP): Global English as a lingua franca for both interpersonal and international business needs relies on mutual intelligibility. An awareness of these subtle differences in attitudinal and evidential meaning seems necessary at the varietal level. From an ESL/EFL perspective, these differences are at least as unevenly distributed as accent and vocabulary, with differences in meaning across the English-speaking world. ESP syllabi thus need to go beyond the current focus on polarity and general meaning in canonical tags, and consider the role of invariant tags in conversation when designing curricula and materials. Finally, this study set out to compare the use of InTs in five varieties of English. The variance in use and subtle meanings of a single discourse marker group such as InTs may suggest that a global language cannot in fact guarantee global communication. These differences in frequency may prove challenging for speakers unfamiliar with the variety; however, the results also show similarity in the set, as four items are still shared across the five Englishes. This augurs well for other Englishes, and suggests that with a raised level of awareness, the attitudinal level of tag usage will not be lost in international communication. Notes 1
I would like to thank John Newman for his comments and suggestions on previous drafts of this paper, as well as the original study which this paper extends. Additionally, I would like to thank two anonymous reviewers for
412
Georgie Columbus
their helpful comments, as well as participants at ICAME 28, in particular Sebastian Hoffmann and Andrea Sand, for their insight and comments on this presentation. All errors, of course, remain my own. Some of the frequency results in this paper relating to the BrE, NZE and IndE study have been submitted for publication (Columbus, in revision). 2
I assume here that “message” means ‘proposition’.
3
While all ICE corpora have the same mark-up options, it is up to individual project teams to determine the completed format. Thus differences exist in the detail of mark-up tags used by each variety and the layout of the corpus and mark-up in its final form.
4
The spelling of la/lah in ICE-SIN is restricted to la; without the intonation information and pronunciation of the tag it is not possible to determine if this is one marker or a combination of the la and lah variants noted by Kwan-Terry (1978) and Bell and Peng Quee Ser (1983). Hence, they are treated together in this analysis.
References Algeo, J. (1988), The tag question in British English: it’s different, i’n’it? English Worldwide, 9, (2), 171-191. Andersen, G. (1997), “I goes you hang it up in your shower innit? He goes yeah.” The use and development of invariant tags and follow-ups in London teenage speech. Paper presented at the 1st UK Language Variation Workshop, Reading, United Kingdom. Andersen, G. (1998), Are tag questions questions? Evidence from spoken data. Paper presented at the 19th ICAME Conference, Belfast, United Kingdom. Andersen, G. (2001), Pragmatic markers and sociolinguistic variation. Amsterdam/Philadelphia: John Benjamins. Avis, W. (1972), So eh? Is Canadian, eh?. Canadian Journal of Linguistics, 17, 89-105. Bell, R. and L. Peng Quee Ser (1983), “‘Today la?’ ‘Tomorrow lah!’; the LA particle in Singapore English”. RELC Journal,14, (2),1-18. Berland, U. (1997), “Invariant tags: pragmatic functions of innit, okay, right and yeah in London teenage conversations”. Unpublished master’s thesis, University of Bergen, Norway. Biber, D., Stig Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman Grammar of Spoken and Written English. Harlow: Longman. Columbus, G. (in revision), A comparative analysis of invariant tags in three varieties of English. English Worldwide. Columbus, G. (forthcoming). “Ah lovely stuff, eh?” On invariant tag meanings and usage across three varieties of English, in: S. Gries, S. Wulff and M. Davies (eds.) Corpus linguistic applications: current studies, new directions. Amsterdam: Rodopi.
A corpus-based analysis of invariant tags in five varieties of English.
413
The Department of English, The National University of Singapore (2002), The ICE-SIN Corpus. Gibson, D. (1977), Eight types of ‘eh’. Sociolinguistics Newsletter 8 (1), 30-31. Gold, E. (2005), Canadian Eh?: A survey of contemporary use, in: M. Junker, M. McGinnis and Y. Roberge (eds.), Proceedings of the 2004 Canadian Linguistics Association Annual Conference. Retrieved November 19, 2006 from: http://http-server.carleton.ca/~mojunker/ACL-CLA. Holmes, J. (1982), The functions of tag questions. English Language Research Journal, 3, 40-65. Hong Kong Polytechnic University, The University of Hong Kong and The Chinese University of Hong Kong (2006), The ICE-HK Corpus. Kwan-Terry, A. (1978), The meaning and source of the “la” and the “what” particles in Singapore English. RELC Journal, 9, (2), 22-36. Lee, J. (2004), A Dictionary of Singlish and Singapore English. Retrieved September 7, 2007 from: http://home.pacific.net.sg/~willows5/singlish_L.htm Love, T. (1973), “An examination of eh as question particle.” Honours thesis, University of Alberta. Meyerhoff, M. (1992), ‘We’ve all got to go one day, eh?’: Powerlessness and solidarity in the functions of a New Zealand tag, in: K. Hall, M. Bucholtz and B. Moonwomon, (eds.) Locating power: Proceedings of the Second Annual Berkeley Women and Language Conference. Berkeley, California: Berkeley Women and Language Group, 409-419. Norrick, N.R. (1995), Hunh-tags and evidentiality in conversation. Journal of Pragmatics, 23, 687-692. Richards, J.C. and M.W.J. Tay (1977), The La particle in Singapore English, in: W. Crewe (ed.) The English language in Singapore, 141-155. Singapore: Eastern Universities Press. Schiffrin, D. (1987), Discourse markers. Cambridge: Cambridge University Press. School of Linguistics and Applied Language Studies, Victoria University of Wellington (1999), The ICE-NZ Corpus. Shivaji University, Kolhapur, and the Freie Universität Berlin (2002), The ICEIND Corpus. Starks, D., L. Thompson and J. Christie (2008), Whose discourse particles? New Zealand eh in the Niuean migrant community. Journal of Pragmatics 40 (7), 1279-1295. Stubbe, M. and J. Holmes. (1995), You know, eh and other exasperating ‘expressions’: an analysis of social and stylistic variation in the use of pragmatic devices in a sample of New Zealand English. Language and Communication, 15, 63-88. Stubbe, M. (1999), Research report: Maori and Pakeha uses of selected devices. Te reo, 42, 39-53.
414
Georgie Columbus
Survey of English Usage, University College London (1998), The ICE Corpus Utility Program (ICECUP 3.1). Survey of English Usage, University College London (1998), The ICE-GB Corpus.
Discourse presentation in EFL textbooks: a BNC-based study Christoph Rühlemann Ludwig-Maximilians-Universität, Munich Abstract Following corpus-linguistic research which has shown the representation of certain lexico-grammatical features in EFL textbooks to be at variance with their use in native English, this paper aims to explore the match or mismatch of discourse presentation (often referred to as ‘speech reporting’) in conversation and its representation in EFL textbooks. The analysis of selected textbooks shows that textbook representation is overwhelmingly concerned with indirect and, to a much lesser extent, narratised mode but not direct mode, the free categories and representation of voice. Further, textbooks promote quotatives typical of written registers but not informal everyday speech. Specifically, I show that discourse presentation in EFL textbooks features essential parallels with a written register, namely journalistic writing. The concluding section considers implications for EFL teaching.
1.
Introduction
In recent years, an impressive body of applied corpus linguistic research has been accumulated, pointing out gaps between school English and native spoken English as recorded in corpora. The comparative analyses so far have focused on features of lexico-grammar. The features whose treatment in textbooks has been found to be at variance with their use in actual discourse include (i) modal verbs such as can, will, must, may, shall, (ii) conditional clauses and (iii) future time orientation through will and going (Mindt 1996); (i) any, (ii) will and would, and (iii) irregular verbs (Mindt 1997); the linking adverbial though (Conrad 2004); and progressives (Römer 2005). This paper attempts to demonstrate that one crucial discourse area in which the gap is particularly wide is discourse presentation, often also referred to as ‘speech reporting’.1 Given that, for reasons of applied linguistic grading and simplification, school English will, to some extent, always be at variance with naturallyoccurring English, a crucial question to be addressed is whether, in dealing with discourse presentation, we are dealing with some remote or otherwise negligible aspect of conversational behaviour that school English need not be modelled on in great detail or whether it constitutes something more important in the conversational arena which school English should take great care to represent to its best of abilities. There is evidence to suggest that discourse presentation is indeed central to conversation. An initial indication is the fact that the verb SAY is among the most frequent words in various spoken corpora. According to Kilgarriff’s (1998)
416
Christoph Rühlemann
frequency list, said – by far the most frequent form of the lemma SAY – is ranked 42nd in the conversational subcorpus of the British National Corpus (BNC), representing the second most frequent content word (only the content word know is more frequent). Said is ranked similarly highly in the Cambridge and Nottingham Corpus of Discourse in English (McCarthy 1998: 122 f.). Considering that the form know is overwhelmingly used as part of discourse markers such as you know and I don’t know, said might well be ranked first in the list of lexical words in the conversational subcorpus of the BNC – indeed, in the Longman Spoken and Written English Corpus, SAY turned out to be “the single most common lexical verb” (Biber et al. 1999: 373). Thus, the prominent frequency of SAY suggests that sharing with others what was said in anterior situations is fundamental to conversation. Why is this so? The answer becomes obvious when we consider what discourse presentation is used for in conversation: it is an essential ingredient of narrative (cf. Schiffrin 1981: 58). Narrative is seen in Tannen (1986; 1988) as ‘drama’, creating interpersonal involvement and rapport. In her view, discourse presentation (her term being ‘constructed dialogue’) “is a means by which experience surpasses story to become drama” (Tannen 1986: 312). Thus, discourse presentation, as a building block of ‘narrative as drama’, is frequent in, and central to, conversation because it makes a decisive contribution to a fundamental function of language use – what Malinowski identified as ‘phatic communion’: discourse presentation is a means “to establish bonds of personal union” (1923: 480). In sum, discourse presentation is an important component of conversation both in terms of frequency and in terms of its interpersonal function. It is therefore consistent to expect that discourse presentation be covered in very good detail in EFL teaching. This paper will demonstrate that, in actual fact, very much ‘good detail’ is still missing from the discourse presentation as represented in most EFL textbooks. The paper is divided into three main parts. The first part summarises research on two major aspects of how native speakers go about presenting discourse, viz. reporting mode(s) and quotative verbs. The second, major, part looks at how discourse presentation is represented in seven internationally marketed EFL textbooks; here, too, the focus will be on reporting modes and quotatives. The analyses of how EFL quotatives distribute across major English registers will be based on the British National Corpus (BNC) (XML Edition). The concluding part briefly juxtaposes the results of the two analyses and outlines what seems to me the main implication of the stark contrast between ‘real’ and ‘school’ discourse presentation: the need to rethink the role of Standard English in the EFL classroom.
Discourse presentation in EFL textbooks 2.
417
Discourse presentation in conversation
In this section I briefly summarise sociolinguistic and corpus-linguistic findings related to two central aspects of conversational discourse presentation: reporting mode and quotative verbs. The section starts with a look into the reporting mode which is typically used in conversation. 2.1
Reporting mode in conversation
Broadly, four types of reporting mode can be distinguished: with reference to the examples listed below, discourse presentation can be direct as in (1), indirect as in (3), narratised, to use McCarthy’s (1998) term – a more convenient label than McIntyre’s (2004) corresponding ‘narrator’s representation of speech act’ (NRSA) category – as in (4), and what McIntyre et al. (2004) refer to as ‘representation of voice’ (RV) as in (5). This latter category “captures minimal references to speech with no indication of the illocutionary force, let alone the propositional content or form of the utterance (part)” (McIntyre et al. 2004: 62). Subtypes of direct and indirect mode are free direct (or ‘zero quotative’) and free indirect mode, that is, presentation without a reporting clause (cf. McIntyre et al. 2004: 64). (2) exemplifies free direct mode: (1)
(2)
(3) (4) (5)
direct: And then he said here’s the hymns, put those hymns up now. (BNC: KBO 3461) free direct: [Speaker is reporting how someone asked him/her for change for a fiver]. I said no! [ ... ] only. So ... well can you lend me a pound? I said no! (BNC: KD5 7945) indirect: Well I phoned Shirley ... and she said she’s fine. (BNC: KB8 3541) narratised: So we asked for twenty thousand pound upfront. (BNC: KB9 3284) voice: I was sitting there talking and they had a drop, drop of wine (BNC: KC2 1222)
Structurally, direct and indirect mode, on the one hand, and narratised mode and representation of voice, on the other, are neatly distinguished by the fact that the former typically have two clauses – a reporting clause containing the quotative verb and a reported clause containing the discourse reported – while the latter have only one clause (Semino and Short 2004: 11). Functionally, a fundamental difference between the direct modes and all other modes lies in the speaker perspective (Coulmas 1986: 2): while direct mode is characterized by the presenting speaker switching, as it were, into the non-present speaker’s deictic
418
Christoph Rühlemann
system whose discourse is being presented, thus adopting his/her deictic perspective, indirect and narratised modes as well as RV mode presentations relate the (usually anterior) speech event from the presenter’s own deictic perspective. To further understand how the presentation modes are functionally distinguished it is helpful to bear in mind that discourse presentation involves an intertwining of two discourse situations – the current situation where the presentation is being made and the anterior situation where the language presented was originally produced (Short et al. 1996: 114). That is, discourse presentation is a type of mediation between a here-and-now speech situation and a there-and-then speech situation. In mediating between the two, speakers can make the anterior speech situation more or less immediate in the present speech situation depending on their choice of presentation mode: the degrees of immediacy continuously decrease from (free) indirect to narratised mode to RV (cf. Leech and Short 1981; Semino and Short 2004), whereas (free) direct mode serves to re-construct the anterior speech situation with the highest degree of immediacy because, due to the presenter’s switch into the presentee’s deictic perspective (and, additionally, due to imitation of voice-related characteristics such as prosody or voice quality), the presented discourse is uttered as if the speaker whose discourse is being presented were present in the current speech situation. Which is the preferred mode in conversation? There is agreement that discourse presentation in conversation is overwhelmingly in direct mode (e.g., McCarthy 1998: 161; McIntyre et al. 2004: 69). In Halliday and Matthiessen (2004: 444), direct mode presentation (their term being ‘paratactic projection’) accounted for roughly 75 per cent (indirect presentation, or ‘hypotactic projection’, accounted for 25 per cent). In a close analysis of a sample of 300 occurrences of said, the most frequent form of the lemma SAY (see section 2.2), which is, in turn, commonly seen as the most frequent quotative verb, said turned out to introduce direct mode presentation in 215 occurrences, representing 72 per cent (Rühlemann 2007: 124). GO and BE like even invariably launch direct mode presentations (e.g., Butters 1980: 305; Schourup 1982: 148), and even THINK, although less clearly, seems to display a preference for direct mode (Rühlemann 2007: Chapter 6; but see McIntyre et al. 2004 who found that THINK introduced mainly indirect mode presentations). The preferred choice of mode in conversation is, thus, the direct mode. In terms of discourse presentation as a building block of ‘narrative as drama’, it will be obvious that direct mode is the most ‘dramatic’. While in indirect and narratised mode “speakers use themselves as the spatiotemporal point of reference” (Romaine and Lange 1991: 229), speakers using direct mode slip out of their deictic system and into that of a displaced speaker’s. In so doing, they effectively lend their voice to somebody not present and, thus, act like an actor on a stage, uttering words which are not their own. Direct mode is also more dramatic than indirect (and narratised mode) because one problem posed by indirect mode is “how to capture the emotive affective aspects of speech. Insofar as these are expressed not in the content, but in the form of the message, they are
Discourse presentation in EFL textbooks
419
not preserved in indirect reporting” (Romaine and Lange 1991: 240). That is, it is only in direct mode presentation that the expressive potential of the human voice can be exploited. Again, it makes sense to interpret this association of conversational discourse presentation with direct mode as a dramatic device which helps the narrator achieve his/her basic aim: to bring the narration as close to the interlocutors as possible and, thus, engage them affectively. 2.2
Quotative verbs in conversation
Which quotatives are most frequent in conversation? It appears that, in conversation, a small set of verbs dominate the quotative system. According to Tagliamonte and Hudson, “[t]he complete inventory of quotatives used to introduce constructed dialogue in British and Canadian English comprise four major verbs, say, go, think, be like and zero” (1999: 155). For an identical set of quotative verbs used in American English see Buchstaller (2002; cf. also Tannen 1986); a similar top five list was observed in Macaulay (2001) for Scottish English. In conversational language use, then, the most frequent quotatives are to a large extent shared across regional varieties of English. The four quotative verbs are briefly characterised in the following. SAY: It is uncontroversial to view SAY as the ‘default verb’ in conversational discourse presentation both in North-American and British English (e.g., Romaine and Lange 1991: 242; Ferrara and Bell 1995; Buchstaller 2002: 14). In Tagliamonte and Hudson’s (1999: 158) corpus of tape-recorded narratives of personal experience, SAY was the most frequent quotative – 31 percent in British English and 36 percent in Canadian English. However, there is evidence that the dominance of SAY is being challenged, particularly because of the influence of the new quotatives BE like and GO. There is good evidence for such waning dominance in the usage of adolescent speakers: SAY was observed to trail far behind BE like in narratives told by Canadian youths (Tagliamonte and D’Arcy 2004), while GO was used more frequently than SAY by adolescents in Glasgow (Macaulay 2001: 10) and London (Stenström et al. 2002). THINK: Another traditional quotative is THINK. While the present tense form think, particularly when associated with the subject I, is mostly used as a discourse marker (cf. Carter et al. 2000), the past tense form thought seems to be used frequently to introduce discourse presentation. In a sample of 300 randomly selected occurrences, thought acted as a reporting verb in more than half of all occurrences (Rühlemann 2007: 138). In the sample, quotative thought mostly introduced direct presentations. Note, however, that use of quotative THINK dramatically decreases in adolescent speech: in Macaulay (2001) and Tagliamonte and D’Arcy (2004), for example, this quotative accounted for a mere two per cent.
420
Christoph Rühlemann
BE like: There is evidence that BE like has gained a notable frequency in U.S. American English – the variety in which it is commonly assumed to have originated (e.g., Fairon and Singler 2006) – and that, as noted above, BE like has made major inroads into Canadian English. Tagliamonte and D’Arcy (2004) note a dramatic increase in the use of BE like compared to an earlier study (Tagliamonte and Hudson 1999): in Tagliamonte and D’Arcy’s corpus, BE like turned out to be by far the most frequent quotative at all (accounting for 58 per cent of all quotatives), while SAY, GO, and THINK were observed to decline in frequency (Tagliamonte and D’Arcy 2004: 501) (for other varieties in which BE like has been attested see Buchstaller 2008 and references therein). The status of BE like in British English, by contrast, is as yet relatively uncertain. In research carried out on British speech data from the early 1990s, BE like’s frequency was low (cf. Miller and Weinert 1998; Andersen 2001; Rühlemann 2007). However, BE like in British English may well be spreading (e.g., Romaine and Lange 1991; Ferrara and Bell 1995; Andersen 2001; Buchstaller 2002). Strong evidence of this comes from Tagliamonte and Hudson (1999): in their corpus of narratives told by university students in England in 1996, quotative BE like, THINK, and quotative GO were equally represented (18 per cent). GO: Unlike BE like, whose frequency in current British English is as yet somewhat unclear, there is evidence that quotative GO is very frequent. Biber et al. (1999: 1119) found that quotative use of the third person singular present tense form goes is particularly frequent (for supportive evidence see Stenström et al. 2002). Observations made on non-computerized collections of personal experience narratives also suggest that quotative GO is recurrent in British English: in Macaulay (2001: 10) and Stenström et al. (2002), GO had a higher frequency than SAY in Scottish and London youth respectively, and in Tagliamonte and Hudson (1999: 158) GO was equally frequent as THINK and BE like in British youth. High frequencies of quotative GO were also reported for Canadian English (Tagliamonte and Hudson 1999) and U.S. American English (Tannen 1986; Blyth et al. 1990; Ferrara and Bell 1995: 274). Finally, it should also briefly be noted that the four quotatives fulfil different functions in discourse. While SAY and THINK are relatively straightforward, introducing mainly speech and, respectively, thought presentations, BE like and GO act as “‘anything-goes’-items” (Buchstaller 2002: 10). That is, they are able to introduce a broad range of different types of content of the quote: both BE like and GO have been observed to introduce not only speech and thought, but also gesture (Butters 1980: 305; Ferrara and Bell 1995: 281), and emotion (Romaine and Lange 1991: 238; Ferrara and Bell 1995: 282 ff.; Adolphs and Carter 2003: 54; Buchstaller 2002: 15; Rühlemann 2007: 149 ff.). Consider (5): aargh vocalizes the pain the speaker felt after a skiing accident: (6)
I was just like aargh. (BNC: KPV 2371)
Discourse presentation in EFL textbooks
421
Additionally, GO has the capacity to introduce presentations of non-human sound (e.g., Butters 1980: 306 f.; Macaulay 2001: 15). In (6), for example, the speaker is presenting sounds made by a cat: (7)
She sits there she goes [sucking then purring noises] and she stops and you’re just about to go to sleep and she goes [purring noises] so loud! (BNC: KPG 3613)
To summarize this section, discourse presentation in conversation is a richly diversified dramatic activity: presenters ‘report’ not only isolated speech but stage and enact whole scenes from the past animating voice qualities, utterances, thoughts, emotions, and the sounds of people, animals, and things in action. How is this everyday drama reflected in textbooks? 3.
Analysis of discourse presentation in selected EFL textbooks
Reporting is generally introduced in EFL teaching at intermediate level. Accordingly, the seven textbooks selected for analysis all cater for that level. They are given in Table 1 in alphabetical order. The textbooks will be referred to in the following sections by their acronyms listed in Table 1: Table 1: EFL textbooks under examination Textbook
Acronym
Cutting Edge (Intermediate) (2005) Innovations (Intermediate) Workbook (2004) Inside Out (Intermediate) (2000) New Headway (Intermediate) (2003) Reward (Intermediate) (1995) Straightforward (Intermediate Student’s Book) (2006) Touchstone 4 (2006)
CUT INN INS NEW REW STR TOU
The series from which TOU is taken stands out from the others because it draws on the Cambridge International Corpus; it is thus one of the very few textbook series for learners of English which consistently draw on corpus data and insights from corpus research; see also the Collins COBUILD English course (e.g., Willis and Willis 1989) which is based on the Birmingham Corpus – now the Bank of English. Additionally, the textbook puts an extra emphasis on highlighting ‘conversational strategies’ and ‘conversational grammar’. Given the corpus-based approach and the focus on conversation, TOU is a milestone in the history of English textbooks. We saw in the analysis of conversational discourse presentation (see section 2.1) that a crucial choice concerns mode. Which modes are promoted in EFL textbooks?
422
Christoph Rühlemann
3.1
Reporting mode in EFL textbooks
To address the above question the relevant units and sections in all seven textbooks were carefully studied. Table 2: Mentions of different types of reporting mode in textbooks (D: direct; FD: free direct; I: indirect; FI: free indirect; N: narratised; V: representation of voice)
CUT INN INS NEW REW STR TOU
D
FD
I
FI
N
V
no no no no no no no
no no no no no no no
yes yes yes yes yes yes yes
no no no no no no no
(yes) (yes) no (yes) no no no
no no no no no no no
Table 2 shows that none of the textbooks, including corpus-based TOU, mention either the free categories FD and FI or ‘representation of voice’ or, most importantly, direct mode as ways of presenting discourse in their own rights. Instead, the focus is exclusively on indirect and, to a much smaller degree, narratised mode. Narratised mode is not taught explicitly in any of the textbooks. In CUT, INN and NEW, narratised mode only occurs implicitly in a few example sentences and fill-in-the-gap exercises, as in this one from CUT: “(…) would you ______ her the truth?” (p. 107) where the learner is expected to fill in tell and where truth gives a mere summary of the utterance(s) presented. The complete absence of explicit mention of direct mode shown in Table 2 is not to say that the notion of ‘direct speech’ did not figure prominently in the textbooks. In fact, both instances of direct mode presentation and the term ‘direct speech’ recur quite frequently across all relevant textbook units and grammar reference sections. However, instances of direct mode presentation only occur in narrative texts (here, interestingly, the preponderance of direct mode typical of non-textbook fiction is often faithfully re-produced but never pointed out to the learners). Further, where explicit attention is drawn to ‘direct speech’ as such, this is invariably done in the context of transformational exercises; that is, in exercises in which direct speech merely serves as raw material for transformations into indirect speech, thereby applying the rules of ‘backshift’ and performing necessary changes in deictic usage. None of the textbooks inform the learner that direct speech presentation is a reporting mode in its own right. Indeed, indirect mode is presented as if it were the norm in any context of use. Consider, for example, this statement introducing the relevant grammar reference section in NEW: “It is usual for the verb in the reported clause to move ‘one tense back’ if
Discourse presentation in EFL textbooks
423
the reporting verb is in the past tense (e.g., said, told)” (p. 150). Learners consulting the language summary section in the back of CUT are informed: “When we report someone’s words afterwards, the verb forms often move into the past” (p. 152). Similar generalized descriptions could be quoted from the other textbooks as well. Mode representation in textbooks, hence, suggests that ‘reported speech’ is synonymous with ‘indirect speech’. Learners are likely to form the impression that the reporting system is a one-way system, admitting only the choice of indirect mode (or, to a far lesser extent, narratised mode); that direct mode is not only an alternative choice but the preferred choice in conversation is not mentioned. Moreover, direct mode is not only the major mode in conversation but also in fictional writing, as research by Leech and Short (1981: 334) and Semino and Short (2004: 89) has shown. Thus, by failing to include treatment of direct mode, textbook representation of discourse presentation fails to represent how discourse is presented not only in conversation but also in fiction. Interestingly, however, indirect mode and narratised mode, while being of secondary importance in conversation and fiction, seem to be primary in journalistic writing. Comparing discourse presentation across three written corpora – fiction, newspaper news reports, and (auto)biographies – Semino and Short (2004: 225) found these two modes to be predominant in their press corpus. The following analysis of quotatives in EFL textbooks suggests that more parallels can be found between discourse presentation in textbooks and discourse presentation in newspaper reportage. 3.2
Quotatives in EFL textbooks
Table 3 lists all reporting verbs mentioned in the seven textbooks: Table 3: Quotatives in EFL textbooks 7x
2x +
1x
ASK SAY TELL
ADVISE COMPLAIN EXPLAIN INVITE PERSUADE REMIND REFUSE SUGGEST THINK WARN
ACCEPT ADD AGREE APOLOGISE BEG CLAIM CONCLUDE DECIDE DENY ENCOURAGE
ENQUIRE HEAR HOPE INFORM INTRODUCE INSIST ORDER PROMISE RECALL WANT to know
424
Christoph Rühlemann
As shown in Table 3, ASK, SAY, and TELL are included in all seven textbooks; ten verbs are found in two or more of the seven textbooks while 20 verbs are mentioned in one textbook only. Table 3 allows for three initial observations. First, it will not be surprising that SAY is included in all textbooks; as mentioned above, there is broad agreement that SAY is the ‘default reporting verb’. By contrast, as far as ASK and TELL are concerned, which are also included in all seven textbooks, there is some evidence that these two verbs may be far from frequent, at least in speech. For instance, in Tagliamonte and Hudson’s (1999) corpus of British and Canadian quotatives, ASK and TELL were found to be very infrequent, accounting for very small percentage values. I suspect that the three verbs SAY, TELL, and ASK enjoy such popularity with textbook writers because they are generally considered to be associated with a particular type of mood: SAY is seen as the verbum dicendi for statements, ASK for questions, and ASK and TELL for directives and requests. Indeed, in most of the textbooks, treatment of ‘reporting speech’ is divided into three sections: reporting statements, reporting questions and reporting directives and requests. In STR, for example, the relevant headings read: ‘reported speech & thought’, ‘reported questions’, and ‘tell & ask with infinitive’. Second, given their non-standard nature, it may not be surprising that BE like and GO are not included in any of the textbooks.2 But it may come as a surprise that THINK is not consistently included: it is mentioned only in INS and STR. That is, in five out of seven textbooks no thought is given to the presentation of thought, no doubt an important factor in conversational narrative. Third, there seems to be little agreement as to which reporting verbs should be covered: the textbooks differ noticeably from each other as to which and how many quotatives are mentioned (see Appendix 2 for a list of quotatives by textbook). This lack of agreement may be due to the fact that decisions in textbooks regarding the inclusion or exclusion of lexical items are generally made on bases other than frequency analyses in representative corpora (see, for example, Mindt’s (1997) alternative proposal of a list of irregular verbs based on their relative frequency). In light of the above mentioned parallels regarding mode between textbooks and journalistic writing, I thought it interesting to investigate whether the quotatives covered in the textbooks show a general preference for writing and/or a specific preference for journalistic writing. Using the set of pre-defined subcorpora in the BNC XML Edition, a comparative analysis was conducted investigating the frequencies of those reporting verbs which are mentioned in at least two of the seven textbooks across what Biber et al. (1999) refer to as the major English registers, viz. Academic Writing, Fiction and Verse, Newspapers, and Conversation. This distributional analysis did not include the verbs SAY and THINK simply because of the broad agreement that these two quotatives are crucial both in writing and speech. It needs to be admitted, however, that a register-distributional analysis of the lemmas of verbs used as quotative verbs in EFL textbooks is not without problems because we cannot be sure that all the forms are being used as
Discourse presentation in EFL textbooks
425
quotatives in all the four registers considered. Even seemingly straightforward quotatives such as SAY and TELL can be used as non-quotative verb forms. To name only two examples. The formula I say predominantly acts as a discourse marker rather than a quotative (Rühlemann 2007: 172 ff.), and the verb TELL can be used as a mental verb as in Yeah but you can’t tell screws from security can you?. Further, we cannot rule out completely that the quotative proportions of any given verb may exhibit significant variation across the registers. To ensure with sufficient confidence that only quotative uses are being compared it would be necessary to download all instances of each form of each lemma in each of the four registers and to inspect concordance lines; and to ensure that register variation in quotative use is taken into account it would be necessary to compare quotative proportions. Obviously, going to these lengths is not feasible in the present connection. Instead, I will assume that the eleven verbs examined, all of which are clearly verba dicendi, predominantly perform a quotative function in any register and fully acknowledge the inherent dangers in so doing. Bearing these reservations in mind, the results of the following analyses need to be taken as approximate rather than definitive. Table 4 lists in alphabetical order eleven quotatives shared by at least two of the textbooks under examination, the respective raw frequencies and normed frequencies per million words in the four registers; further, it shows the ratios obtained between, on the one hand, the three written registers taken together and, on the other, conversation: Table 4: Distributional analysis of eleven verbs across Academic Writing (ACW), Fiction and Verse (FIC), Newspapers (NEW), and
Conversation (CON) in the BNC XML Edition ACW 18m
FIC 19m
NEW 11m
CON 5m
ASK RF NFpm Ratio W/C
4,327 21,576 4,357 2,530 240 1,136 396 506 ------------------------------------------------- 1.17 -------------
ADVISE RF NFpm Ratio W/C
703 498 562 27 39 26 51 5 ------------------------------------------------- 7.73 -------------
COMPLAIN RF NFpm Ratio W/C
583 783 613 148 32 41 56 30 ------------------------------------------------- 1.43 -------------
426
Christoph Rühlemann
EXPLAIN RF NFpm Ratio W/C
4,421 3,620 1,340 206 246 191 122 41 ------------------------------------------------- 4.55 -------------
INVITE RF NFpm Ratio W/C
634 1,197 685 166 35 63 62 33 ------------------------------------------------- 1.37 -------------
PERSUADE RF NFpm Ratio W/C
652 896 579 37 36 47 53 7 ------------------------------------------------- 6.48 -------------
REMIND RF NFpm Ratio W/C
469 1,896 335 201 26 100 31 40 ------------------------------------------------- 1.31 -------------
REFUSE RF NFpm Ratio W/C
1,622 1,844 1,833 55 90 97 167 11 ------------------------------------------------ 10.73 -------------
SUGGEST RF NFpm Ratio W/C
9,537 2,644 1,811 128 530 139 165 26 ------------------------------------------------ 10.69 -------------
TELL RF NFpm Ratio W/C
3,031 29,025 8,353 6,731 168 1,528 759 1,346 -------------------------------------------------- 0.61 -------------
WARN RF NFpm Ratio W/C
350 1,265 1,692 46 19 67 154 9 ------------------------------------------------- 8.89 -------------
The results displayed in Table 4 suggest two major conclusions: (i) that the verbs are typically used in writing rather than informal speech and (ii) that they are mostly used in journalistic writing. The evidence for (i) is twofold. First, as can be seen from the shaded figures highlighting the highest frequency in each row of normed frequencies,
Discourse presentation in EFL textbooks
427
none of the eleven verbs reach the highest frequency in Conversation. On the contrary, eight of them are least frequent in Conversation; only ASK, REMIND, and TELL do not follow this pattern (ASK and REMIND are least frequent in Newspapers, and TELL is least frequent in Academic Writing). Second, the ratios between the three written registers, on the one hand, and conversation, on the other, show that TELL, for which a ratio of 0.61 was obtained, is the only verb which is more frequent in conversation than in the three written context types taken together. The remaining ten verbs are invariably more frequent in writing than in conversation, with four verbs displaying slight preferences for writing – ASK (1.17), COMPLAIN (1.43), INVITE (1.37), and REMIND (1.31), and, conversely, six verbs displaying strong and very strong preferences for the written mode – ADVISE (7.73), EXPLAIN (4.55), PERSUADE (6.48), REFUSE (10.73), SUGGEST (10.69), and WARN (8.89). Initial evidence for (ii), that the verbs in question are mostly used in journalistic writing, is the fact that, as can be seen in Table 4, five out of eleven quotatives obtain the highest normed frequency in Newspapers compared to Conversation and the two other written subcorpora. Further, this tendency becomes stronger as soon as we take the group of verbs mentioned in only one textbook (see Table 3) into account. Table 5 shows the results of a distributional analysis of these 20 quotatives. Again, the highest normed frequencies per quotative are shaded: Table 5: Distributional analysis of quotatives mentioned in only one out of seven textbooks across Academic Writing (ACW), Fiction and Verse (FIC), Newspapers (NEW), and Conversation (CON) in the BNC XML Edition
ACW 18m
FIC 19m
NEW 11m
CON 5m
ACCEPT RF NFpm
4,381 243
2,186 115
1,703 155
153 31
ADD RF NFpm
3,006 167
4,001 211
5,062 460
278 56
AGREE RF NFpm
3,065 170
3,483 183
2,304 210
264 53
APOLOGISE RF NFpm
33 2
377 20
241 22
16 3
428
Christoph Rühlemann
BEG RF NFpm
130 7
794 44
142 13
84 17
CLAIM RF NFpm
3,359 187
838 44
3,764 342
110 22
CONCLUDE RF NFpm
1,968 109
384 20
364 33
4 1
DECIDE RF NFpm
3,386 188
4,548 239
2,661 242
450 90
DENY RF NFpm
1,354 75
985 52
1,642 149
27 5
ENCOURAGE RF 2,307 NFpm 128
544 29
913 83
60 12
ENQUIRE RF NFpm
99 6
778 41
19 2
13 3
HEAR RF NFpm
1,778 99
13,980 736
3,233 180
2,407 481
HOPE RF NFpm
1,188 66
4,898 258
3,356 305
949 190
INFORM RF NFpm
1,069 59
820 43
379 35
30 6
INSIST RF NFpm
857 48
1,386 73
1,259 115
37 7
Discourse presentation in EFL textbooks
429
INTRODUCE RF 3,017 NFpm 168
786 41
1,242 113
32 6
ORDER RF NFpm
1,105 61
1,461 77
1,010 92
187 37
PROMISE RF NFpm
442 25
1,834 97
923 84
87 17
RECALL RF NFpm
686 38
1,091 57
683 62
21 4
WANT to know RF 108 NFpm 8
1,188 63
152 14
204 41
Ten out of the 20 quotatives listed in Table 5 are most frequent in Newspapers compared to Conversation, Fiction and Verse, and Academic Writing. Conversely, once again, Conversation remains without top-scoring quotative, while Academic Writing and Fiction and Verse obtain highest frequencies with six quotatives each. If we combine the results from Table 4 and Table 5, we see that of all 31 EFL quotatives which were compared across registers (remember that SAY and THINK were excluded from this analysis), 15 verbs are most frequent in Newspapers, whereas Academic Writing has seven and Fiction and Prose nine top-scoring quotatives. That is, almost half of all EFL quotatives are used mostly in journalistic writing. 4.
Conclusions and implications for EFL teaching
We can now compare the results of the analyses on discourse presentation in natural conversation and EFL textbooks and draw conclusions. As regards reporting mode, I have shown that discourse presentation in conversation is overwhelmingly in direct mode, whereas the modes promoted in textbooks are indirect mode and, to a much lesser extent, narratised mode. The focus on indirect mode in EFL textbooks is such that this mode is presented as if it were the default mode; narratised mode is mentioned only marginally and neither representation of voice, free direct, free indirect nor, most importantly, direct mode are mentioned at all as reporting modes in their own rights. The analyses of the sets of quotatives used in conversation and EFL textbooks suggested that the two sets overlap to some degree – e.g., the default
430
Christoph Rühlemann
quotative for speech presentation SAY is included in both sets – but, more importantly, diverge: while none of the ‘new quotatives’ BE like and GO, which play increasingly important roles in conversation, are included in the textbooks, the EFL quotatives exhibit a skew not only towards the written mode but, specifically, to journalistic writing. Moreover, in the analysis of how mode and quotatives are realised in EFL textbooks I found evidence that discourse presentation in EFL textbooks resembles in essential ways discourse presentation in journalistic writing. This resemblance was observed on two levels. First, the heavy emphasis EFL discourse presentation puts on indirect and, to a smaller degree, narratised mode is reminiscent of the emphasis on indirect and narratised mode which Semino and Short (2004) found in their press corpus. Second, the distributional analyses carried out on the quotatives used in EFL textbooks suggest a clear tendency: almost half of all quotatives examined were used most frequently in the Newspapers subcorpus and less frequently in Academic Writing, Fiction and Verse, and Conversation. This double evidence raises the question whether the model underlying EFL discourse presentation is found in discourse presentation in journalistic writing – hence, maybe, the preference in EFL for the term ‘reporting’. Again, however, a cautionary note is in order not only because of the methodological reservations acknowledged above but also, as one reviewer commented, because no large claim can be made on the basis of a number of verbs that occur in just two to maximally seven texts, all of which have been published in one place (the UK). Bearing these reservations in mind, the overlap between EFL discourse presentation and journalistic writing found in this paper is merely sufficient to hypothesize that the former is modelled on the latter, and to leave this hypothesis to be tested for future research. In conclusion, the comparison of discourse presentation in conversation and EFL textbooks shows that this is an area in which the gap between school English and real English is particularly wide: EFL textbooks disregard not only a primary reporting mode – direct mode – which is the norm not only in conversation, no doubt a ‘core register’ (cf. Rühlemann 2007), but also in fiction, a similarly important context type, thus creating the impression for EFL learners that indirect mode is the only choice they have for ‘reporting’. Further, EFL textbooks promote quotatives which will help EFL learners as readers of British newspapers but not as conversationalists in informal L2 encounters. Moreover, EFL textbooks fail to equip EFL learners with what is most central to discourse presentation in conversation: an awareness of what end discourse presentation serves in conversation, namely the establishment of ‘bonds of communion’ through the creation of narrative as drama, and the corresponding linguistic means to achieve that end. Thus, a yawning gap divides discourse presentation in natural conversation and EFL textbooks. Can the gap be closed? Unlike, for example, modals or progressives, whose representation in textbooks, it seems, can easily be re-aligned to naturallyoccurring English, attempts to re-align the representation of discourse presentation in EFL to conversational discourse presentation will face a major
Discourse presentation in EFL textbooks
431
problem. This problem arises from the fact that conversational discourse presentation is fraught with non-standard English. Take, for example, the reporting verbs GO and BE like: these are generally “considered by many people to be non-standard and grammatically unacceptable” (Carter and McCarthy 2006: 823), an observation supported by an attitudinal survey conducted by Blyth et al. (1990: 223) whose respondents judged the two verbs as “stigmatized, ungrammatical, and indicative of casual speech” (for a more differentiated picture of attitudes towards the two quotatives in the UK see Buchstaller 2006). To complicate matters, the two verbs are by no means the only non-standard features of discourse presentation. The long list of conversational discourse presentation features which are at odds with Standard English includes: I says, a seemingly clear case of ‘subject-verb discord’ (cf. Rühlemann 2007), use of SAY not only for presented statements but also questions, seemingly careless switches between historic past (HP) and narrative past (NP) (but see Schiffrin 1981 who saw HP in association with the Complicating events section in narratives), seemingly careless shifts between reporting verbs, seemingly unmotivated repetitions of reporting clauses, use of past –ing with reporting verbs (cf. McCarthy 1998), and, finally, use of ‘utterance openers’ such as oh and well (cf. Biber et al. 1999). As regards its non-standardness, conversational discourse presentation is far from being exceptional; it is in the good company of a great many non-standard features distinctive of conversational English. So, conversational language generally is ‘vernacular’ language to such an extent that Biber et al. (1999: 1121) declare the notion of Standard English “problematic in talking of the spoken language.” Hence, bringing the representation of conversational discourse presentation into closer correspondence with discourse presentation in natural conversation will be difficult because conversational discourse presentation, just as conversational language generally, conflicts with Standard English, the model which has long been predominant in EFL both for teaching writing and speech (Quirk 1985: 7). Therefore, we need to be aware that teaching conversational discourse presentation, just as teaching most other conversational features, presupposes a readiness to sacrifice, at least partly, Standard English as the ‘oneand-only’ model.3 Instead, it has been suggested, Standard English needs to be reduced to a ‘core variety’ (Bex 1993: 261), underlying the teaching of the written language, while the spoken language should be taught on the basis of the model of ‘conversational grammar’, a more appropriate model that major corpus linguistic studies have elaborated in great detail (e.g., Biber et al. 1999; Carter and McCarthy 2006).4 Such a ‘register approach’ (Rühlemann forthcoming) would tie in well with recent attempts to argue a shift in emphasis from monolithic descriptions to register-specific descriptions of the grammar of English (Conrad 2000). The obvious advantage of this approach would be that it enables EFL teaching to reflect what is seen as a fundamental property of language: its functional diversity (cf. Stubbs 1993; Bex 1993).
432
Christoph Rühlemann
Notes 1
For many researchers though, the term ‘speech report’ is a misnomer (cf. Tannen 1986) because neither do conversationalists ‘report’ faithfully what was being said nor is it always speech that is rendered but rather a broad spectrum of types of discourse, including not only actual speech but also habitual and potential speech, thought, emotion, gesture, and sound (e.g., Buchstaller 2008; Rühlemann 2007: 121 ff.)
2
This is not to imply that no findings from corpus research on conversational discourse presentation had found their way into the representation of discourse presentation in TOU. An example of what has been taken up is past –ing with reporting verbs, as in she was saying how nice it was – a form which serves “to focus on the content rather than the actual words” (McCarthy et al. 2006: 90; cf. also McCarthy 1998: Chapter 8; Biber et al. 1999: 1120; Rühlemann 2007: 133 f.). This form is explicitly taught in TOU. Also, TOU mentions BE like as a quotative; however, it does so not in the section on speech reporting but elsewhere (viz. in the context of summarising the various functions of discourse marker like) and without any illustrative examples.
3
For a discussion of the problems surrounding this partial sacrifice see Rühlemann (2008).
4
Noteworthy in this discussion are also attempts to argue an acknowledgment of ‘spoken standard’ complementing (and thus relativising) the ‘written standard’ (e.g., Carter 1999). It is questionable though whether all or most forms of conversational discourse presentation will be accepted as spoken standard. Particularly doubtful cases are, for example, I says, quotative GO (including I goes) or BE like which generally seem to attract rather negative attitudes.
References Adolphs, S. and R.A. Carter. (2003), ‘And she’s like it’s terrible, like: Spoken Discourse, Grammar, and Corpus Analysis’, International Journal of English Studies 3 (1): 45-56. Andersen, G. (2001), Pragmatic Markers and Sociolinguistic Variation. A relevance-theoretic approach to the language of adolescents. Amsterdam/Philadelphia: John Benjamins. Bex, T. (1993), ‘Standards of English in Europe’, Multilingua 12 (3): 249-264. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. Harlow: Pearson Education Limited. Blyth, C. Jr., S. Reckenwald and J. Wang (1990), ‘I’m like “Say what?!”: A new quotative in American oral narrative’, American Speech 65 (3): 215-227.
Discourse presentation in EFL textbooks
433
Buchstaller, I. (2002), ‘He goes and I’m like: The new Quotatives re-visited’, Internet Proceedings of the University of the Edinburgh Postgraduate Conference 1-20. Buchstaller, I. (2006), ‘Social stereotypes, personality traits and regional perceptions displaced: Attitudes towards the ‘new’ quotatives in the UK’, Journal of Sociolinguistics 10 (3): 362-381. Buchstaller, I. (2008), ‘The localization of global linguistic variants’, English World-Wide 29 (1): 15-44. Butters, R.R. (1982), ‘Narrative Go ‘Say’’, American Speech 55 (4): 304-307. Carter, R.A. (1999), ‘Standard grammars, spoken grammars: Some educational implications’, in: T. Bex and R.J. Watts (eds.) Standard English. The widening debate. London: Routledge, pp. 149-166. Carter, R.A., R. Hughes and M.J. McCarthy (2000), Exploring Grammar in Context. Cambridge: Cambridge University Press. Carter, R.A., and M.J. McCarthy (2006), Cambridge Grammar of English. Cambridge: Cambridge University Press. Conrad, S. (2000), ‘Will corpus linguistics revolutionize grammar teaching in the 21st century?’, TESOL Quarterly 34 (3): 548-560. Conrad, S. (2004), ‘Corpus linguistics, language variation, and language teaching’, in: J. McH. Sinclair (ed.). How to Use Corpora in Language Teaching. Amsterdam / Philadelphia: John Benjamins, pp. 67-85. Coulmas, F. (1986), ‘Reported speech: Some general issues’, in: Coulmas, F. (ed.) Direct and Indirect Speech. Berlin/New York/Amsterdam: Mouton de Gruyter, pp. 1-28. Fairon, C. and J.V. Singler (2006), ‘I’m like, “Hey, it works!”: Using GlossaNet to find attestations of the quotative (be) like in English-language newspapers’, in: A. Renouf and A. Kehoe (eds.) The Changing Face of Corpus Linguistics, Amsterdam/New York: Rodopi, pp. 325-336. Ferrara, K. and B. Bell (1995), ‘Sociolinguistic variation and discourse function of constructed dialogue introducers: The case of be + like’, American Speech 70 (3): 265-290. Halliday, M.A.K. and M.I.M. Matthiessen (2004), An Introduction to Functional Grammar (3rd edition). London: Edward Arnold. Kilgarriff, A. (1998), ‘BNC database and word frequency lists’, http://www.kilgarriff.co.uk/bnc-readme.html. Leech, G. and M. Short (1981), Style in Fiction. London/New York: Longman. Macaulay, R. (2001), ‘You’re like ‘why not?’ The quotative expressions of Glasgow adolescents’, Journal of Sociolinguistics 5 (1): 3-21. Malinowski, B. (1923), ‘The problem of meaning in primitive languages’, in: C.K. Ogden and I.A. Richards (eds.) The Meaning of Meaning. London: Routledge, 296-336. McCarthy, M.J. (1998), Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. McIntyre, D., C. Bellard-Thomson, J. Heywood, T. McEnery, E. Semino and M. Short (2004), ‘Investigating the presentation of speech, writing and
434
Christoph Rühlemann
thought in spoken British English: A corpus-based approach’, ICAME 28: 49-76. Miller, J. and R.Weinert (1998), Spontaneous Spoken Language: Syntax and Discourse. Oxford: Clarendon Press. Mindt, D. (1996), ‘English corpus linguistics and the foreign language teaching syllabus’, in: J. Thomas and M. Short (eds.) Using Corpora for Language Research, London: Longman, pp. 232-247. Mindt, D. (1997), ‘Corpora and the Teaching of English in Germany’, in: A: Wichmann, S. Fligelstone, T. McEnery, and G. Knowles (eds.) Teaching and Language Corpora. Harlow: Longman, pp. 41-50. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman. Romaine, S. and D. Lange (1991), ‘The use of like as a marker of reported speech and thought: a case of ongoing grammaticalization in progress’, American Speech 66 (3): 227-279. Römer, U. (2005), Progressives, Patterns, Pedagogy. A Corpus-driven Approach to English Progressive Forms, Functions, Contexts and Didactics. Amsterdam/Philadelphia: John Benjamins. Rühlemann, C. (2007), Conversation in Context. A Corpus-driven Approach. London: Continuum. Rühlemann, C. (2008), ‘A register approach to teaching conversation: Farewell to Standard English?’, Applied Linguistics 29 (4): 672-693. Schourup, L. (1982), ‘Quoting with Go ‘Say’’, American Speech 57 (2): 148-9. Semino, E. and M. Short (2004), Corpus stylistics: Speech, writing and thought presentation in a corpus of English writing. London: Routledge. Short. M., E. Semino and J. Culpeper. (1996), ‘Using a corpus for stylistics research: speech and thought presentation’, in: J. Thomas and M. Short. (eds.) Using Corpora for Language Research. Studies in Honour of Geoffrey Leech. London/New York: Longman, 110-131. Sinclair, J. McH. (ed.) (1989), Collins COBUILD Dictionary of Phrasal Verbs. London: HarperCollins. Stenström, A., G. Andersen and I.K. Hasund (2002), Trends in Teenage Talk. Amsterdam / Philadelphia: John Benjamins. Stubbs, M. (1993), ‘British Traditions in Text Analysis – From Firth to Sinclair’, in: M. Baker, G. Francis, and E. Tognini-Bonelli (eds.) Text and Technology. In Honour of John Sinclair. Amsterdam/Philadelphia: John Benjamins, pp. 1-33. Tagliamonte, S. and A. D’Arcy (2004), ‘He’s like, she’s like: The quotative system in Canadian youth’, Journal of Sociolinguistics 8 (4): 493-514. Tagliamonte, S. and R. Hudson (1999), ‘Be like et al. beyond America: The quotative system in British and Canadian youth’, Journal of Sociolinguistics 3 (2): 147-172. Tannen, D. (1986), ‘Introducing constructed dialogue in Greek and American conversational and literary narrative’, in: F. Coulmas (ed.) Direct and indirect speech. Berlin: Mouton de Gruyter, pp. 311-332.
Discourse presentation in EFL textbooks
435
Tannen, D. (1988), ‘Hearing voices in conversation, fiction and mixed genres’, in: Tannen, D. (ed.). Linguistics in Context: Connecting Observation and Understanding. Norwood, NJ: Ablex, 89-113. Willis, J. and D. Willis (1989), Collins COBUILD English course. London: Collins COBUILD. Appendix 1: Sources for textbook representation of discourse presentation Cunningham, S. and P. Moor (2004), Cutting Edge (Intermediate). London: Heinle. Deller, H. and A. Walker (2004), Innovations (Intermediate) Workbook. Harlow: Longman. Greenall, S. (1995), Reward (Intermediate). Oxford: Macmillan Education. Kay, S. and V. Jones (2000), Inside Out (Intermediate). Oxford: Macmillan Education. Kerr, P. and C. Jones (2006), Straightforward (Intermediate Student’s Book). Oxford: Macmillan Education. McCarthy, M.J., McCarten J. and H. Sandiford (2006), Touchstone. Student’s book 4. Cambridge: Cambridge University Press. Soars, L. and J. Soars (2003), New Headway (Intermediate Student’s Book) Oxford: Oxford University Press. Appendix 2: Quotative verbs by textbook: CUT INN INS NEW REW STR TOU
ask, say, tell apologise, ask, complain, enquire, invite, persuade, say, suggest, tell ask, say, tell, think advise, ask, beg, invite, order, refuse, remind, say, tell accept, advise, agree, ask, decide, encourage, explain, hope, introduce, persuade, promise, refuse, remind, say, suggest, tell, warn ask, claim, complain, deny, inform, insist, know, say, tell, think, want to know, warn add, ask, conclude, explain, recall, remember, say, tell
Awful adjectives: a type of semantic change in present-day corpora Göran Kjellmer University of Gothenburg Abstract Semantic change observable in isolated linguistic items is both frequent and interesting in itself. More interesting, perhaps, are cases of structural change, i.e. cases where one and the same tendency can be discerned in a related group of words. This paper uses modern corpus material in order to sketch the development of one such group, words meaning ‘frightening’, and suggests that they all follow the same trend in the direction of ‘impressive, overwhelming’ although they differ with respect to how far they have advanced along that route. The semantic changes of some 25 words in the chosen area are studied in detail, and their development is illustrated with corpus material. One of the conclusions of the study is that their rate of semantic progress is partly dependent on the time when they entered the semantic field. The paper deals with the adjectives in the group and leaves the adverbs, although equally interesting, out of account for a later investigation.
1.
Introduction
Saussure’s division of language study into a synchronic and a diachronic section is not always possible or indeed fruitful to uphold. Many of the perplexing phenomena in the language of today can be naturally explained with reference to historical facts and, perhaps more importantly, there are changes taking place in the modern language before our eyes. To insist on a rigorous synchronic OR diachronic approach in such matters would therefore be counterproductive. The present paper will study a case of ongoing semantic change in modern English, a transition in meaning from negative to positive, a change that is often referred to as amelioration. Amelioration can be found in many, perhaps all, languages, and a few examples may be illuminating. Terribilis is used in a positive sense in the Vulgate translation of the Song of Songs (Snaith 1993: 88), and negative-to-positive changes are found in several Semitic languages (Goitein 1965: 220). Swedish grym ‘cruel’ is used informally in the sense of ‘very good, skilful, “cool”’ (NEO). In English, shrewd has passed from meaning ‘wicked’ to meaning ‘clever, astute’, and nice has passed from meaning ‘foolish, stupid’ to meaning ‘agreeable, delightful’ (OED). And, finally, a recent English parallel is the use of wicked to mean ‘excellent, splendid; remarkable’ (OED, s.v. Wicked 3.b).
438
Göran Kjellmer
All those are thus cases of amelioration, a well-known type of semantic change, although most of the time the change is not as extreme as in the examples just given. The examples of amelioration we have just seen are isolated instances of semantic change. What would be much more interesting would be if more general trends were to be found, in line with Stern’s finding (1964 [1931]: 190) that “English adverbs which have acquired the sense ‘rapidly’ before 1300 always develop the sense ‘immediately’”. This paper will try to find such regularity in the semantic change of words meaning ‘frightening’. 2.
Adjectives in the field
A terrible film is a very bad film, but a terrific film is a very good film. Both terrible and terrific originally denote ‘causing terror’1; they have thus developed differently, though not necessarily in different directions. Similarly, an awful place is a bad place, but an awesome place is a positively impressive one.2 Again both awful and awesome originally meant practically the same thing,3 and again they have developed differently. What is common to the two pairs is thus that a negative element that has remained in one member has developed into a positive one in the other member. We might hypothesise that adjectives having (had) the meaning ‘causing fear’ in common will show a degree of similarity in their developmental tendencies. It may be that they will coincide to such an extent that a tendency for the whole group will appear. A study of the group from this perspective could therefore be of interest. The adjectives to be looked into are listed in Table 1. Table 1: Adjectives meaning ‘causing fear’ alarming appalling awe-inspiring awesome awful creepy
dreadful fearful fearsome formidable frightening frightful
hairy horrendous horrible horrific horrifying ominous
redoubtable scary shocking startling terrible terrific
terrifying tremendous
The words can be said to belong to the same semantic field, where the common factor is ‘causing fear’, or causing closely related sensations such as awe, dread, fright, horror, shock, terror, trembling. There is no suggestion that the words are absolute synonyms at any stage of their development, only that they share or have shared an important element in their semantic make-up. As we shall see, that element is present to varying degrees in the words as used today. The words, particularly in their early uses, overlapped in meaning to a considerable extent, the common element being ‘frightening’. Some of the words also have a prior stage in common, namely that of ‘feeling fear’, ‘frightened’.
Awful adjectives
439
OED’s first recorded occurrence of awesome means ‘full of awe’ (1598); similarly that of dreadful means ‘full of dread’ (a1225) and that of frightful means ‘full of terror’ (c1250). In some cases such a sense seems to have developed later than the ‘frightening’ one: the ‘frightening’ sense is found earlier, but awful is recorded as ‘terror-stricken’ (c1590), fearful as ‘frightened’ (c1374), fearsome (“?erron.”) as ‘timid’ (1863) and scary as ‘frightened’ (1800). However, in time they all come together in the meaning of ‘frightening’, a common startingpoint for their subsequent development. Note that this change implies a widening of application: only a person or an animal can be full of awe, but living creatures as well as lifeless things can be frightening. There is a certain semantic fluctuation in the present-day use of the words with occasional uncertainty as to the exact meaning of the items in individual cases; the present-day semantics in the field are sometimes indistinct and worth looking into. I will address the subject in two different ways, one synchronic and one that could be called synchronic-diachronic. 3.
Synchronic approach. Semantic polarity of head-words
If we go back to the pair terrible: terrific, which had quite different meanings, one positive and one negative, with the head-word film, the question arises, how do we know whether a “frightening” word is positive or negative when the head does not suggest either interpretation, i.e. when it is neutral in the positivenegative dimension? The head, that is, does not seem to be of much help here. Nevertheless, it seems clear that we can say a terrible disaster or a terrific achievement but hardly *a terrible achievement or *a terrific disaster. We may then assume, rather trivially, that the nature of the heads of the NPs in which the adjectives occur will give some indication of the semantics of the adjectives. This is where corpus evidence will be most useful. The terrible: terrific examples suggest that one contrast likely to play a distinguishing role in the nominal heads is that between semantically positive and semantically negative. An achievement can be seen as a positive thing, as something good, and a disaster as a negative thing, as something bad. However, it is obvious that for many, probably most, nouns there is no such semantic charge: a thought, an experience, a feeling, a job are neither good nor bad in themselves. Determining the semantic prosody of the nominal heads with a tripartite classification of the heads as positive, negative and neutral was therefore seen as important. In order to find relevant material for statistical calculations the CobuildDirect Corpus was used. A list was produced of all the relevant adjectives immediately followed by a noun found in the Corpus. For each adjective its most frequent nominal collocates were selected; they were limited to those occurring at least twice, up to a maximum of 20 nouns. The nouns were then classified into the three classes Positive, Negative and Neutral. The following criteria for the
440
Göran Kjellmer
classification were used. If the noun (as used by itself) fits into either of the slots in the formula It was ½ This was ¸ These were ¾ (quite a(n)) ---, She was ¸ He was ¶ This was evidence of ---,
º ¸ ¾ so obviously I liked it/them/him/her ¿
it was considered a positive noun (e.g. achievement), but if it fits into the formula It was ½ This was ¸ These were ¾ (quite a(n)) ---, She was ¸ He was ¶ This was evidence of ---,
º ¸ ¾ so obviously I DIDN’T like it (etc.) ¿
it was considered a negative noun (e.g. shock). If, finally, it fits into neither formula, or both, it was classed as a neutral noun (e.g. contrast). A subjective element is as inevitable here as it is in the classification of our adjectives. “VALUE adjectives are thus subjective in the same sense as deictic terms: their referential meaning is largely dependent on their speaker’s identity.” (Adamson 2000:45). The nominal heads are listed in Appendix 1. It appears that the great majority of the nominal heads fall in the category Neutral and that the rest is unequally divided between Positive and Negative, so that Negative is roughly three times as frequent as Positive, as might have been expected given the original meaning of the adjectives. The distribution of the adjectives over the positive, negative and neutral heads is shown in Appendix 2. Without going into too much detail here, the clear difference in meaning that could be observed with terrible and terrific is reflected in the distribution of their nominal collocates: terrible occurs with 8 neutral and 12 negative heads but never with a positive head, whereas terrific occurs with 7 positive heads and 13 neutral ones but never with a negative head. When adjectives that take positive heads occur with neutral heads they normally still convey that positive meaning, as in formidable reputation or tremendous amount. On the other hand, adjectives with frequently occurring negative heads can be seen to convey the same negative meaning with neutral heads, as in horrendous consequences, horrific incident or terrifying experience. The proportion of the adjective’s occurrences with positive or negative heads thus seems to be indicative of the meaning it carries with neutral heads. But although that may be so, the division of collocates into positive, negative and neutral is not enough to explain how the words have moved across
Awful adjectives
441
their semantic terrain. We will then have to apply a synchronic-diachronic approach. 4.
Synchronic-diachronic approach
Several of the ‘frightening’ words develop semantically from ‘frightening’ to ‘(very) impressive’ or ‘overwhelming’. In order to understand how this is possible one could posit several intermediate steps, from ‘frightening’ to ‘very bad’, from ‘very bad’ to ‘great/big/large’, and from ‘great/big/large’ to ‘impressive, overwhelming’. It is characteristic of the succession of steps that for any two adjoining steps the speaker can intend one and at the same time imply the other one. Traugott (1990: 498f.; Traugott and Dasher 2002: passim) uses the term “invited inference” and shows that invited inferences can become lexicalised. The second step will then take over the main import of the word, without necessarily letting go of the first meaning.4 This is a situation that Stern (1964 [1931]: 380) describes as “adequation”. Referring to the semantic change of horn from ‘animal’s horn’ to ‘musical instrument of a certain kind’, he says, The principal element of its meaning - of the subjective apprehension of the referent - changes; the notion ‘animal’s horn’ recedes to a subsidiary position, and the notion ‘instrument of a certain kind’ takes its place as predominant. It is only when the hearer accepts this added element as being part of the word’s meaning that a semantic change takes place. Semantic change is thus a result of a collaborative (but mostly unconscious) effort: [Meanings have] a starting point in the conventional given, but in the course of ongoing interaction meaning is negotiated, i.e. jointly and collaboratively constructed ... This is the setting of semantic variability and change. (Lewandowska-Tomaszczyk, quoted from Traugott & Dasher 2002: 25) It should be stressed that the originally dominant element need not disappear but can “recede to a subsidiary position”. This will serve as a characterisation of each one of the steps leading from ‘frightening’ to ‘impressive’. The original semantic component of fear may even remain as a background component all through the later development of the word, cf. sentence (5). The progression could be viewed as a change from less subjective to more subjective, in which case it would be in line with the principle of semantic change put forward by Traugott (1990: 500). Let us now take a look at the steps separately. First there is the semantic transition from ‘frightening’ to ‘very bad’. An evaluative element is introduced, which will be part of the word all through its
442
Göran Kjellmer
later development. Awful carnage is presumably frightening, but it is also very bad, as in (1)
He is creating racial hatred against ethnic minorities, as he would approve the awful carnage of the Muslims by ethnic cleansing in the former Yugoslavia. Corpus: ukmags/03. Text: N0000000887.
A slight shift in meaning then makes it possible for awful to refer to things that are very bad but may not be frightening, as in (2)
What a vile place, what a bloody awful place to spend a bloody awful afternoon. Corpus: ukbooks/08. Text: B0000000100.
Fear is hardly involved here. The next step, from ‘very bad’ to ‘great/big/large’, follows logically. If something stands out as being very bad, it may be because of its scale or size, as in: (3)
I think political stands will account for an awful lot. Corpus: npr/07. Text: S2000910312.
where there may be no suggestion of ‘badness’, but where the evaluative element is clearly present. A final step will then be that from ‘very great/big/large’ to ‘impressive’ or ‘fascinating’, again a natural and logical development. What is very big is often also impressive, fascinating or even overwhelming. Cf. (4)
There is an awful suspense in watching this self-absorbed creature being taken over, ... Corpus: times/10. Text: N2000960217.
The coupling of big size with impressiveness and fascination leads on to a situation where the words in the field can denote impressiveness without at the same time necessarily denoting magnitude: (5)
Can we ordain to ourselves the awful majesty of God - to decide what cities and villages are to be destroyed, who will live and who will die...? Corpus: usbooks/09. Text: B9000001351.
In (5) the original component of ‘fear, dread’, in this case of the deity, can be seen to remain alongside the new one of fascination. The full scale then stretches from ‘frightening’ over ‘very bad’ to ‘very great’ and culminates in ‘very impressive’, ‘fascinating’, ‘overwhelming’. In a more general way, the change can be seen as going from negative to positive, the first two steps representing the negative side and the last two steps representing the positive side. Ullmann (1962: 137) sees this change as resulting from a tendency to overstatement.5 There is a great deal of continuity in the development and no sudden jumps from one meaning to the next. One developmental stage is always foreshadowed
Awful adjectives
443
(“invited”) in the previous stage. A very similar development is described by Gustaf Stern (1921: 261), who discusses the historical change of the Old English adverb fæste from the sense ‘strongly, immovably’ to ‘closely, securely, well’, and says, The whole process consists of a series of small changes, each representing an imperceptible advance in one direction, and capable of being explained as an association of the simplest kind. It is not necessary, at any point, to assume complicated psychic processes in order to explain the development. His description is equally relevant to the semantic development of the “fearinspiring” words. A graphic illustration of the semantic field is presented in Appendix 3, where examples of head-words appear under the relevant senses. The adjectives in the left column have (roughly) the meaning given at the top as their predominant element when they modify nouns of the type given under the meanings. The adjectives differ in the extent to which they have covered the way towards fascination and impressiveness; hairy and creepy have just begun their progress in that direction whereas the adjectives at the bottom of the table have gone all the way. Cf. Traugott (1990: 514): “[S]emantic change very rarely applies to items of the same lexical field at the same time, and thus is rarely capturable in a rule.” Even if hairy and creepy are beginning to move in the same direction as the others, it should be stressed that that is not necessarily the case. A word with the semantic characteristics of the group dealt with here is likely to change in the direction suggested, but it does not have to do so.6 What is very clear is that the words in the field all move in the same direction. It seems less likely that a word should develop in the opposite direction, from ‘fascinating’ to ‘frightening’. This raises the question of directionality in semantic change. Particularly within the theory of grammaticalisation claims have been made that changes always move in the same direction, from lexical to grammatical and not the other way round. (Cf. the title of Traugott’s 1990 paper.) Lass (2000) contests these claims in a spirited paper, where he shows that the strong version of the unidirectionality position is untenable. Even if most of the evidence supports the hypothesis that all grammaticalisation is unidirectional, the hypothesis must remain a hypothesis for several reasons. (The number of counterexamples is theoretically infinite, the difference between lexical and grammatical is insufficiently well defined, etc.) Similarly, Olga Fischer uses the story of the to-infinitive to show that “grammaticalisation processes do not always run the same course, that there may be differences between similar languages, that the process may indeed be reverted, and that this relates to the specific grammatical circumstances that a language finds itself in” (2000: 163). The weaker position, that some kind of unidirectionality can normally be observed in semantic change, thus a tendency but not a law, seems in any case defensible, if less revolutionary. This then
444
Göran Kjellmer
applies not only to grammaticalisation but to semantic change more generally. “The crucial point is that if SP/Ws [speakers/ writers] begin to exploit a lexeme in new ways, and the new meanings are adopted by others, the reverse order of change is not expected.” (Traugott and Dasher 2002: 281) And as we saw, the words in our lexical field, although deriving from widely different sources, all follow the same semantic path.7 As there are no given borderlines between the stages of progression, the proportions between the constitutive semantic elements of the words change as time goes by. Figure 1 is a schematic and greatly simplified representation that shows their development in the positive-negative dimension. The adjectives are seen to move from left to right in the semantic spectrum. At the beginning of their career, the negative semantic elements prevail, but with time positive elements grow in relative importance until they are totally predominant. Different adjectives represent different stages of this development.
Positive ( ‘great’, ‘impressive’)
Negative ( ‘frightening’, bad’)
Figure 1: Development of “frightening” adjectives in a positive-negative dimension 5.
Conclusion
There is a great deal of dynamism and regularity in the group of ‘frightening’ adjectives. Many words have come together in the common sense of ‘causing fear’. The ‘frightening’ sense is a common starting-point, and a necessary one for the subsequent development. The early stages of this change, the ‘frightening’ element, may remain as part of the word’s semantic set-up throughout its development, as in awe-inspiring, or they may fade away through the process of semantic bleaching, as in terrific. It seems probable that adjectives that have only covered part of the stretch will eventually acquire the sense of ‘impressive’. How long that will take will vary with the individual words, as new words meaning ‘frightening’ are likely to come into use, like the comparatively recent hairy or creepy. But that words meaning ‘frightening’ will develop in the direction of
Awful adjectives
445
‘impressive, overwhelming’ is as probable as that Stern’s words meaning ‘rapidly’ developed into words meaning ‘immediately’. Notes 1
OED, s.v. Terrible 1: “Exciting or fitted to excite terror; such as to inspire great fear or dread; frightful, dreadful.” OED, s.v. Terrific 1: “Causing terror, terrifying; fitted to terrify; dreadful, terrible, frightful.”
2
As in the quotes from the CobuildDirect Corpus: “my problem is that er I make it sound as though place name’s an absolutely awful place and er place name is not an absolutely awful place.” “This is real Lawrence of Arabia country, an awesome place of shimmering sands described in his Revolt in the Desert.”
3
OED, s.v. Awful 1: Awe-inspiring. OED, s.v. Awesome 2: Inspiring awe
4
“[O]ld and new meanings typically coexist in the same text [...] original meanings tend to persist so that no pure synonyms develop” (Traugott and Dasher 2002: 280).
5
“In a less extreme form, the same tendency to overstatement is responsible for countless hyperbolical expressions in everyday life: awful, dreadful, frightful, terrific, tremendous, abysmal, bottomless, deadly, and many more. The meaning of some of these words has been completely cancelled out by their emotional tone: to speak of a ‘terrific success’, a ‘tremendous welcome’, or of something ‘awfully funny’, is really a contradiction in terms.”
6
Cf. “No lexeme is required to undergo the type of change schematized here ... The hypothesis is that if a lexeme with the appropriate semantics undergoes change, it is probable that the change will be of the type specified.” (Traugott and Drasher 2002: 281)
7
It may be of some interest that Fred Householder argued, even in 1992, against any kind of directionality in semantic change. (Householder 1992).
References Adamson, S. (2000), “A lovely little example. Word order options and category shift in the premodifying string.”, in: Fischer, Rosenbach and Stein, 39-66. Allén, S. (ed.) (1995-96), Nationalencyklopedins ordbok. Göteborg and Höganäs: Språkdata and Bra Böcker. (NEO) American heritage dictionary, see Soukhanov, 1992.
446
Göran Kjellmer
Bright, W. (ed.) (1992), International Encyclopedia of Linguistics. New York & Oxford: Oxford University Press. CobuildDirect corpus, an on-line service: http://titania.cobuild.collins.co.uk. Fischer, O. (2000), “Grammaticalisation: Unidirectional, non-reversable? The case of to before the infinitive in English.”, in: Fischer, Rosenbach and Stein, 149-169. Fischer, O., A. Rosenbach and D. Stein (eds.) (2000), Pathways of Change: Grammaticalization in English. Amsterdam/Philadelphia: Benjamins. Goitein, S.D. (1965), “Splendid like the brilliant stars”, Journal of Semitic studies, 10: 220-221. Householder, F.W. (1992), “Semantic and lexical change.”, in: Bright: 3: 387389. Lass, R. (2000), “Remarks on (uni)directionality.”, in: Fischer, Rosenbach and Stein, 207-227. Maxidico = Domas, J. (ed.) 1996. Le Maxidico. Dictionnaire encyclopédique de la langue française. Éditions de la Connaissance. NEO, see Allén (1995-96). OALDCE = Wehmeier, S. (ed.) (2000), Oxford advanced learner’s dictionary of current English. 6th ed. Oxford: Oxford University Press. OED = Simpson, J.A., and E.S.C. Weiner (eds.) (1989), The Oxford English dictionary, 2nd ed., online version. Oxford: Clarendon. Saussure, F. de (1922 [1915]), Cours de linguistique générale. 2nd ed. Lausanne & Paris: Bally & Sechehaye. Snaith, J.G. (1993), The Song of Songs: based on the revised standard version. London: Marshall Pickering. Soukhanov, A.H. (ed.) (1992), The American heritage dictionary. 3rd ed. Boston and New York: Houghton Mifflin. Stern, G. (1921), Swift, swiftly, and their synonyms. Göteborg: Wettergren & Kerber. Stern, G. (1964 [1931]), Meaning and change of meaning. Bloomington: Indiana University Press. Originally published as Göteborgs högskolas årsskrift. Traugott, E. Closs (1990), “From less to more situated in language: the unidirectionality of semantic change.”, in: Adamson, S., V. Law, N. Vincent and S. Wright (eds.). Papers from the 5th International Conference on English Historical Linguistics. Amsterdam/Philadelphia: Benjamins, 496-517. Traugott, E. Closs, and R.B. Dasher (2002), Regularity in Semantic Change. Cambridge: Cambridge University Press. Ullmann, S. (1962), Semantics. An introduction to the science of meaning. Oxford: Blackwell.
Awful adjectives
447
Appendix 1: Positive, negative and neutral nominal heads Positive: achievement advantage boost clarity effort energy force friends fun originality performance potential power quality relief responsibility stability strength success support talent value
Neutral: act actions admission adventure agenda amount announcement anticipation array arsenal aspect attitude barrier behaviour canyon catalogue challenge change character claim climax colour comeback conclusion condition consequences contrast day
Negative: accident allegation assault attack blow bore bully burden burns car crash cloud conflicts cost crash crawlies crime cruelty death disease error fall foe indictment injury loss mess mistake monster
decline defence degree development difference display dream drop effect events evidence example exercise experience faces fact feeling figure film foursome frequency game group guy headlines horse idea ---
murder nightmare obstacle opponent opposition ordeal pain problem rival sex attack shame shock shortage slump strain threat tragedy trouble warning waste violence
Appendix 2: Distribution of adjectives over positive, negative and neutral nominal heads TYPES
TOKENS
Pos. Neutral Neg.
% TOKENS
Pos. Neutral Neg.
Pos. Neutral Neg.
Alarming
18
2
103
7
0.0
93.6
6.4
Appalling
15
5
103
27
0.0
79.2
20.8
0.0
100.0
0.0
53
2
37.5
60.2
2.3
AweAwesome
1 5
14
2 1
33
Awful
14
6
860
34
0.0
96.2
3.8
Creepy
3
1
8
14
0.0
36.4
63.6
12
8
107
45
0.0
70.4
29.6
Dreadful
448
Göran Kjellmer
Fearful
8
Fearsome Formidable
2
Frightening
16
4
1
12
6
19 3
Frightful
0.0
100.0
0.0
13
2
0.0
86.7
13.3
75
41
10.8
57.7
31.5
1
110
3
0.0
97.3
2.7
3
6
7
0.0
46.2
53.8
0.0
100.0
0.0
28
37
0.0
43.1
56.9
147
18
2.4
87.0
10.7
14
Hairy
3
Horrendous
9
11
16
3
Horrific
9
11
50
102
0.0
32.9
67.1
Horrifying
7
4
18
8
0.0
69.2
30.8
11
3
37
11
Horrible
1
Ominous Redoubtable
-
7 4
-
0.0
77.1
22.9
0.0
0.0
0.0
Scary
15
1
48
4
0.0
92.3
7.7
Shocking
15
5
75
14
0.0
84.3
15.7
8.7
91.3
0.0
276
0.0
53.0
47.0
Startling
2
18
Terrible Terrific
8 7
Terrifying
6 12
13
31
15
5
63 311 85
26.7
73.3
0.0
70
24
0.0
74.5
25.5
Tremendous
12
6
2
166
194
30
42.6
49.7
7.7
TOTAL
28
269
91
254
2589
706
7.2
73.0
19.9
Appendix 3: Semantic progression of adjectives ‘frightening’ ‘very bad’
‘v great/big/large’
‘impressive’, ‘overwhelming’
Negative
Negative
Neut./pos.
Positive
creepy
tale
dope
-
-
hairy
moments
old boats
-
-
ominous
clouds
news
-
-
scary
film
prospect
-
-
alarming
experience
effect
frequency
-
appalling
crime
behaviour
increase
-
dreadful
disease
situation
noise
-
fearful
wrath
racket
energy
-
fearsome
attack
reputation
pace
-
frightening
story
football
speed
-
Awful adjectives
449
frightful
catastrophes mess
lot
-
horrendous
injury
mistake
number
-
horrible
crime
embarrassment road toll
horrific
murder
fall
traffic problem
-
horrifying
violence
moment
kick
-
shocking
picture
waste
speed
-
terrible
accident
loss
cost
-
terrifying
violence
addiction
proportions
-
-
awe-inspiring Civil Guards loss
wingspan
beauty
awesome
task
effect
display
awful
tragedy
mistake
lot
majesty
formidable
threat
problem
energy
intellect
unnerving
movie
habit
concentration
performance
redoubtable
fighter
-
sceptic
larynx
startling
-
awkwardnesses contrast
originality
tremendous
-
problem
delight
achievement
terrific
-
-
pace
performance
disgrace
Global English – Global Corpora: Report on a panel discussion at the 28th ICAME conference Marianne Hundt Heidelberg University 1.
Introduction
At the 28th ICAME conference, a panel discussion was held on the role of corpus linguistics in the study of English as a global language. The panel members were: Pam Peters, Joybrato Mukherjee and Anna Mauranen. The panel was chaired by Marianne Hundt. The topics to be covered were (a) English as an international lingua franca (EIL), (b) the question of ‘ownership’ or who to count as a native speaker, and (c) norms for global English. Since both the title of the panel and the topic areas were rather broad, we decided to focus the discussion by introducing provocative statements on the topic areas. The chair passed the following statements on to the panel members: 1.
2.
3.
Corpus linguistics will enable us to describe the international core of English, namely those features that are shared by all L2 varieties of English. One of the core requirements for inclusion in the International Corpus of English (ICE) is that the authors and speakers of the texts were educated through the medium of English – thus ‘English-medium education’ and ‘long-term residence’ have replaced the criterion of ‘nativeness’. With its focus on ‘standard English’ (especially varieties of English as L1), corpus linguistics has (often involuntarily) fed into the ‘standard ideology’.
The idea for the panel discussion was to combine theoretical issues concerning ‘Global English’ with the methodological angle of corpus linguistics. Questions for discussion included: How do our methodological decisions influence our results? How does linguistic theory guide us in our methodological decision making? Do we have the ‘right’ corpora for studying global English? The panel opened with short ‘position statements’ from the panel members. Each of them focussed on a different topic area. The discussion that followed centred mainly on one point: the variety status of English as a lingua franca (ELF) and the norms that might apply to it. Furthermore, and as Anna Mauranen had predicted in her position statement, it was at times a rather emotional discussion. In this report of the panel, the position statements of the panel members are presented first; they were written by the panel members themselves. The summary of the ensuing discussion is based on the notes that David Minugh took
452
Marianne Hundt
at the time. The names of the participants in the discussion are not mentioned although some statements may come close to verbatim passages in the original discussion. 2.
Position statements
2.1
Pam Peters (Macquarie University, Sydney): The ICE corpora and Global English
Q. Do we have the “right” corpora for studying global English? How far do the ICE corpora go in meeting our research needs? A. In a nutshell, only part of the way. The ICE project is remarkable in many ways, providing a larger view of world English than any corpus project before it. It does nevertheless constrain or frame our view of world Englishes in at least two ways. With their fixed size (1 million words, half spoken discourse/half written discourse, and multiple subcategories of each), the ICE corpora inevitably provide only limited coverage of each variety, and a somewhat arbitrary range of lexis, morphology and syntactic constructions. Even high frequency polysemous words may not present identical sets of uses, especially in L2 varieties of English. For example, some uses of until in Singapore English are slightly different from those of international written English, particularly in situation-dependent discourse such as: (1) I waited until I (was) angry; luckily my turn came ten minutes later. Here the wait of the main clause continues all through the until clause, whereas in standard English the until-clause marks the point at which the main clause action ceases. Yet among 200 examples of until in Singapore ICE, there is only one example of this usage, in a rather fractured conversation. Since this probably reflects the Chinese aspectual particle dao, it is of particular interest as an example of the way in which substrate languages may impinge on outer-circle varieties of English. The subtler semantic developments in new Englishes may not emerge from the smallish amounts of interactive discourse in ICE corpora, even if straightforward loans such as the discourse particle lah are represented well enough in the data. The set of Englishes included in ICE is still limited. While it includes quite a few of those based on British English (e.g. Australian, New Zealand, Indian, Hong Kong English), there is only Philippine English to represent those based on American English. New ICE projects for the Bahamas, Fiji and Sri Lanka will extend the range, but the ICE network remains much more a coverage of Commonwealth Englishes than of “global English” per se. Without ICE-US and indeed ICE-Canada we still lack key reference points in world English, and the means of comparing the interplay of millennial British and American English on other inner and outer circle varieties of English. Their relative impacts on
Global English – Global corpora
453
expanding circle varieties such as Japanese, Chinese and Thai English could also be more effectively researched were there an ICE-US available alongside the other ICE-corpora. 2.2
Joybrato Mukherjee (University of Gießen): Corpus linguistics and linguistic ownership
In an often-quoted programmatic statement, Widdowson (1994) forcefully argues that in the light of the global spread of English, it is no longer native speakers alone who can claim ownership of the English language: How English develops in the world is no business whatever of native speakers in England, the United States, or anywhere else. They have no say in the matter, no right to intervene or pass judgment. They are irrelevant. […] It is a matter of considerable pride and satisfaction for native speakers of English that their language is an international means of communication. But the point is that it is only international to the extent that it is not their language. It is not a possession which they lease out to others, while still retaining the freehold. Other people actually own it. (Widdowson 1994: 385) Now, it is true that there are many more non-native than native speakers of English today – in this particular sense, it is obvious that English as a truly global language is no longer exclusively bound to native-speaker communities and their socio-cultural contexts. More specifically, it is generally accepted today that institutionalised second-language varieties around the world such as New Englishes in the Caribbean, in Africa, in Asia and in the Pacific region are normdeveloping varieties in their own right that are – to some extent, at least – independent of exonormative standards set by native speakers. It should not go unmentioned, however, that even in well-established English as a Second Language (ESL) communities such as India, one typically observes what Kachru (passim) has repeatedly called ‘linguistic schizophrenia’. D’souza (1997) describes linguistic schizophrenia in the Indian context as follows: We use English as if it belongs to us but the minute this is brought to our attention we get into a flap and say this is not our language. (D’souza 1997: 95) Even in India, then, Widdowson’s (1994) position is not entirely reflected by local users’ attitude towards the English language: ownership does not seem to be an all-or-nothing attribute. What is more, the simple fact that one uses the English language regularly and competently does not automatically mean that one also feels one is the owner of the language. Using and owning a language are clearly two different things.
454
Marianne Hundt
In my statement, I would like to concentrate on the increasing use of English as a lingua franca in intercultural communication by non-native speakers when communicating with other native and non-native speakers. Picking up on Widdowson’s (1994) stance, Seidlhofer (2001) has been in the vanguard of claiming linguistic ownership of English for everyone who uses English as a lingua franca. She writes that ELF speakers are usually not [...] concerned with emulating the way native speakers use their mother tongue within their own communities, nor with socio-psychological and ideological meta-level discussions. Instead, the central concerns for this domain are efficiency, relevance and economy in language learning and language use. (Seidlhofer 2001: 141) It is certainly true that ELF is part of linguistic reality – Seidlhofer (2001) is right in criticising that ELF has been a ‘conceptual gap’ for too long. Once we accept the existence of ELF as an integral part of global English, it is self-evident that this very kind of English needs to be described on the basis of solid data. It is, thus, a very welcome development that various corpus projects – including Seidlhofer and Jenkins’s VOICE project and Anna Mauranen’s ELFA project – have been launched. They will provide us with a comprehensive picture of what ELF actually looks like and what happens in ELF communication. What bothers me is not that ELF corpora are being compiled and analysed – quite the contrary. However, I have a niggling worry that by creating ELF corpora ELF is posited as a well-defined variety of English – which, in my view, it is not. ELF is an umbrella term for a multitude of variants, including all kinds of variants that we find in different learners with different L1 backgrounds and at various competence levels. ELF is a conglomerate of variants, but it is not a variety. What makes a variant – or a set of variants – a variety? Nayar (1998) offers a list of ten linguistic, sociolinguistic, political and other features that are characteristic of a variety. At the risk of some gross over-simplification, I have noted down on the right-hand side whether or not the features can be found in ELF: Linguistic features 1. Identifiably distinct formal features 2. Internal consistency and systematicity 3. Lectal range to accommodate variation
? – –
Sociolinguistic features 4. Ethnolinguistic vitality 5. Distinctive cultural attributes and pragmatics 6. Standardisability and codifiability
? – –
Political features 7. International acceptance
?
Global English – Global corpora 8.
Socio-political identity
Other (desirable) features 9. Indigenous literature 10. Distinct pragmatics
455 –
– ? (List from Nayar 1998: 285)
As for linguistic features, it is possible to describe formal features of ELF in an ELF corpus – but whether they are sufficiently distinct is a different matter. The level of distinctness is presumably very low because ELF includes variants of speakers with all kinds of L1 backgrounds. There is no internal consistency and systematicity – apart from high-frequency deviances from native norms, which we would traditionally refer to as learner errors. There is not so much a lectal range but, more importantly, a range of different levels of competence. With regard to sociolinguistic features, we get a very similar picture: while we could argue that ELF is ethnolinguistically vital in that it provides a communicative vehicle for intercultural communication, ELF as such is, by definition, independent of any specific culture, distinctive cultural attributes and pragmatics. I cannot see how ELF could develop its own standard and how it could be codified as a well-defined variety. The international acceptance of ELF is a disputed issue, but clearly, ELF has no specific socio-political identity and no indigenous literature (can ELF be truly indigenous in the first place?). It is difficult to conceive of any distinct ELF-specific pragmatics; I would assume that ELF pragmatics is, at best, a convergence of the pragmatic systems of the cultures that are linked via ELF. The overall picture that emerges from this characterisation of ELF is that it is not a variety with which anyone actively and positively identifies himself or herself, that it is a makeshift code that is used to overcome language barriers in intercultural communication, that it is not bound to any specific culture and that, consequently, it is not ‘owned’, as it were, by anyone. ELF is a communicative epiphenomenon. The existence of ELF corpora should not lead us to believe that ELF is a variety of English – although it seems to be an attractive mainstream position at the moment. Note in this context that the same holds true for what has been labelled ‘Euro-English’. Mollin (2006) shows that Euro-English is, by and large, a fata morgana – true, English is used in Europe as a lingua franca, but there is no such thing as a Euro-English variety. What is more, Mollin’s (2006) results seem to drag the skeleton from the closet of many advocates of ELF-based models of English – the native speaker: New standards need to be standards in the mind, too. Ideally, the speakers sampled in the [Euro-English] corpus should thus be asked whether they consider features which have emerged in the corpus to be potential markers of the new variety as correct, and whether they would use these themselves. [...] The results of both direct attitude elicitation parts and acceptability tests on supposedly Euro-English sentences, however, have demonstrated that the standard that
456
Marianne Hundt European speakers follow and wish to follow is that of native speakers. (Mair and Mollin 2007: 347)
This is where it all comes full circle: it seems that native speakers have a say in the matter – because non-native speakers want them to. Non-native European users of English are a significant part of the lingua-franca users of English worldwide – Mollin’s (2006) study might thus have wider implications for ELF in general. As other studies show, many ELF speakers are oriented towards nativespeaker norms and they do not want to learn and use a reduced variant of English that is still more or less intelligible or, as Jennifer Jenkins (passim) would put it, ‘communicatively successful’ despite its deviances from native-like usage. There is no point in ignoring the fact that the native speaker remains a relevant reference point for ELF speakers and learners of English. Thus, I cannot see why it should be useful to describe an international core of English across all ENL and ESL varieties and the myriad of variants of English that we subsume under ELF. The concept of a common core is a very useful one, but it should only be based on – and abstracted away from – full-fledged varieties of English. 2.3
Anna Mauranen (Helsinki University): English as a Lingua Franca
Corpus linguistics is an excellent means for discovering what L2 Englishes have in common – or, indeed, what all Englishes have in common, and where varieties differ. It is hard to think of serious alternatives to corpora for answering such questions. Even though corpora, for obvious reasons, have been heavily dominated by first language use and standard English, we can now move on and accept that L2 speakers constitute an important group of users who are different from ‘learners’. L2 speakers outnumber L1 speakers by about four to one these days, which means that we live in interesting times of potentially rapid changes in English. Large numbers of people use English for a wide range of purposes, many use it regularly in contexts which are important parts of their lives. Even though English is the medium of communication, the context is more often than not transcultural, and the location outside English-speaking countries. English is used as a global lingua franca – but English corpus linguistics is only beginning to take this development on board. The brief for this panel was to discuss the international core of ELF, the ownership of English, i.e. the status of the native speaker, and the norms for global English. It seems to me that if there is a common core to lingua franca English it can most reliably be discovered by exploring relevant corpus data; but the existence of such a core is an empirical question. The ownership of English is a trickier issue, but at this point suffice it to say that the ownership cannot be limited to those who were born with a given language, because our relationships to the languages we encounter and acquire throughout our lives are prone to change: a new language can become more important than our first language. These changes can be radical and unexpected, especially in today’s globalised and unstable world. Even so, there is every reason to respect the special relation-
Global English – Global corpora
457
ship people have with their first languages. The question of norms and global English tends to unleash emotions. Some people seem to think that if any concessions are made to the legitimacy of global English, all standards will go down the drain, no norms will be respected and soon communication between different Englishes is going to be impossible. This is a sad picture, and a dire motive for holding on to a native speaker norm. In the most basic sense, norms define what is normal. They are inherently evaluative, and they exert a powerful influence on people’s behaviour. We can roughly distinguish two kinds of linguistic norms: those which are prescribed and those which arise spontaneously. The first kind, norms which prescribe good usage, are institutionalized and sanctioned in many ways, largely through educational systems and normative reference works. We might call these imposed norms because they are sanctioned by authorities ‘higher’ than ordinary speakers. The second type, which can be called natural norms, originate in the selfregulation of speech communities or communities of practice. No institutional body controls them, and they can deviate considerably from standard language norms. Basically these are norms of use, emergent, uncodified, and a good deal more elusive than fixed standards. Natural norms tend to be receptive to innovations, and insofar as the innovations gain wide acceptance, they result in general language change and eventually find their way to the standard. What the two kinds of norm have in common is an interest in ensuring efficient and effective communication; this is why any community regulates their language use. There is an inevitable tension between actual usage and the imposed standard. But this tension keeps within comfortable limits if the standard gets updated often enough and the updates are informed by changes in use – good corpora are invaluable for judgments of what to treat as a norm. But how is the norm related to the native speaker? The native speaker is a problematic concept in that it is used to refer to both the ideal native speaker of certain linguistic theories and real-world native speakers, but the distinction is not always kept clear. Corpus linguistics is of course interested in the reality of language use. In the real world, not all native speakers are equally exemplary users of their language, certainly not equally good in all domains of use: while some may be good at giving public talks, and others at writing in an entertaining way, some excel in research writing, others again are fun to chat with. Some skills and genres are more highly valued in the linguistic market than others, and in compiling a norm-informing database we need to assess which genres and what uses we judge as worth including. Although for the non-native speaker it is ‘the native speaker’ that is held up wholesale as a desirable model, it is clear that this makes no sense at all for native speakers. What we need in a norm-informing corpus are instances of ‘good usage’, for example ‘educated English’ or some other limited section of the language, whether broadly or narrowly defined. If the native speaker is not an appropriate basis for an imposed norm in a native language community, is it really any more appropriate for non-natives? I would like to argue that it makes no more sense to define a standard for non-
458
Marianne Hundt
natives by simply pointing to a group of speakers who have the target language as a mother tongue than it would be for native speakers; a standard must be based on some model of good usage. But good usage need not be limited to native speakers; it ought to be independent of the speaker’s first language, as long as the usage of the target meets the criteria set for it. Non-native standards do not have to be any slacker than native standards, but they must be different because they apply to a different social and cultural context of use. The natural norm is a less sensitive issue than the imposed norm. Natural norms arise in the self-regulating mechanisms that any speech community possesses: what features a speech community adopts, tolerates or rejects. Natural norms are of descriptive and theoretical interest to linguists, because they are manifest in language variation, in non-standard use, in New Englishes – and in ELF. This is where ELF really comes to its own; whether we want to speculate on a need for a world standard or a general ELF standard is not decisive for a scholarly interest in ELF. ELF speaking communities may not be regarded as speech communities in the ordinary sense, since for example they are not associated with a locality, but it is certainly true that many communities of practice have adopted ELF as their de facto language, and that the ensuing norms of use are regulated by the participants of those communities. ELF is also the language of wide and diffuse networks of uses and users. To find out how these specific social contexts of use develop and affect the shape of English, we need databases of their authentic language. We already have corpora of New Englishes and learner English, both of which are interesting and valuable in increasing our understanding of English; one exploring nativised varieties, the other tracking the developmental paths of individuals towards a target. We need evidence from ELF to provide a missing link of using English in foreign language contexts outside settings where the speakers are positioned as learners. ELF provides an important basis for establishing what might be the necessary features of language – certainly of English – in situations of demanding and sophisticated use when there is no institutionalized basis for an imposed norm. By exploring these different kinds of databases, we can hope to come closer to answering questions on the similarities and differences in these hybrid Englishes, and trace their impact on English as a whole. ELF corpus data is capable of throwing light on mechanisms of language change, directions and patterns of the ways in which features travel in today’s globalised and multilingual world, and on social contexts of use not captured by other corpora. This has wider significance on language theory, as it reflects the unique situation in which virtually all languages in the world are in contact with one language. ELF research enables us to go beyond contact between two or very few languages, and beyond positing first language interference as the major, let alone only, explanatory factor behind deviations from native usage. It can help us understand the nature of emergent norms, and throw light on possible language universals or necessary features of language from a new perspective.
Global English – Global corpora
459
In sum, what I have suggested in this brief statement on what global English means to norms and corpus linguistics is: (1)
For imposed norms, we need to gather information on good usage independently of its origin.
(2)
For natural norms we need to include ELF, for description and theoretical models.
3.
Discussion
3.1
Accommodation
A question raised by the audience was whether ELF speakers accommodate more than native speakers. Anna Mauranen replied that we need to accommodate all the time when we speak to people with different language skills. She also pointed out how evidence from the ELF corpus compiled at Helsinki university indicates how speakers, in accommodating and their use of repair sequences, appear to concentrate on content rather than form. 3.2
ELF – description and norms
Various members of the audience were ready to accept that corpus linguists could (and should) describe ELF, but wondered whether we needed norms for it. A widely held opinion was that we must be able to correct student errors, not merely accept them as part of their interlanguage or ELF. To this, Pam Peters replied that phenomena such as reduced morphology are tolerable in an ELF situation, but that classroom assessment cannot allow this – and that writing was “a different ball game altogether”. Similarly, Joybrato Mukherjee rejected the reduction of, for instance, the third person singular present tense –s as a permissible feature of ELF because native speakers would not accept it. To the question from the audience how we should deal with the assessment of student writing and conversation, Joybrato Mukherjee replied that it was necessary to distinguish between describing ELF and teaching it; he attacked the idea of using ELF as an international language between non-native speakers, pointing out that it was not a goal desired by learners. Anna Mauranen, on the other hand, wanted to separate ELF from teaching norms and was less convinced that native-speaker norms truly dominated. A related issue discussed was the enforcement of native-speaker norms in publishing. Members on the panel pointed out how even linguistics journals for English as a world language recommended that non-native speakers have a native speaker edit their texts prior to submission, and that generally, prescriptive norms are often applied in the editing of L1-area journals. To this, a member from the audience contributed his view as a former editor, stressing that he himself concentrated on content but that he’d had copy-editors to back him up who would
460
Marianne Hundt
focus on the language side of editing. Anna Mauranen remarked that she had consistently avoided such language checking, maybe advocating indirectly that others do the same? A member from the audience pointed out the much more widespread use of English by immigrants, i.e. English as a second language (ESL) rather than ELF. The example given was the use of Eastern European immigrants moving into the UK. The suggestion was that – for those who did not want to fully master the language, an alternative would be to teach domain-specific forms like Business English, Agricultural English, etc. Another colleague pointed out how even native speakers have to ‘learn’ how to use domain-specific varieties, mentioning Eurocrat-speak for grant applications as an example. At this point in the discussion, Antoinette Renouf tried to elicit a North American response to a hitherto European series of descriptions. A colleague from Canada pointed out how accent and phonology were key features of language use in the global context. An American colleague mentioned the difference between migrating to an L1 context and assimilating in two generations (as in North America), and what is currently happening especially in parts of Asia where English is not the L1 and lacks the cultural roots, one example being the use of English in mainland China or English in Africa. On the question of norms and varieties, Marianne Hundt alluded to a statement by John Algeo that all varieties are fictions, but that they are useful fictions, wondering whether ELF, too, was a useful fiction. A member from the audience saw the creation of new norms (e.g. ELF norms) as possibly useful, but also mentioned them as potential channels for oppression. Marianne Hundt, taking up Anna Mauranen’s suggestion that ELF contexts constitute their own ‘communities of practice’, was wondering what the organizational frame would be that held them together. Pam Peters suggested that web-based virtual communities might be one example. The challenge from the chair was that even if such communities of practice for ELF existed, ELF lacked an underlying system and therefore did not qualify as a variety of English. Anna Mauranen countered this argument by pointing out that systems change through usage. A member from the audience suggested that speakers can perhaps create joint sub-varieties. Anna Mauranen added that we collected corpora of learner English and did not find that surprising; collecting data of EFL use, she stressed, did not imply that a system called ‘EFL’ existed; an ELF corpus would merely reflect what existed in the world (pointing out the similarities with other dialects of English). A member from the audience pointed out that defending ELF was often seen as rejecting native norms, but that this was a false perception. From this, the discussion moved on to the political aspects in ascribing variety status to a phenomenon such as EFL. A member from the audience said that we were witnessing a shift to a true lingua franca, and the creation of ELF corpora would be a way of recording this shift. The chair pointed out that, if we research a phenomenon, people assume that the phenomenon has an underlying system and that this could have implications for language teaching which might eventually lead to the short-selling of learners.
Global English – Global corpora 3.3
461
Common core English – myth or reality?
To the question as to how real a common core for English was, Pam Peters replied that this was a highly abstract question. She pointed out that even highfrequency items found in corpora are often polysemous across national varieties, so that the notion of the common core may even be a rather elusive one empirically. A member of the audience added to this set of questions by asking whether something like ‘global English’ existed. A common answer to this question used to be that there were global Englishes, and the question now was whether we could expect norm convergence over time (a possible example mentioned was the world-wide-web as a locus where ‘global English’ might be observed as a result of global convergence). The member of the audience suggested that we should be looking at the divergences instead of converging trends and that the ICE corpora provided a good tool for this. Ending on a critical note, he pointed out that one of the problems was that ICE-GB was a corpus of educated London English, rather than “ICE-GB” for all of Great Britain. 4.
Concluding remarks
The cautioning remark on ICE-GB (which is actually a sample of educated London English) brings us back to one of the questions raised in the introduction and that was also addressed by Pam Peters in her position statement, namely the question whether we have the right corpora for studying global English. Despite the wide scope of the ICE project, the corpora that we do have so far represent a tiny slice of the range of Englishes spoken and written within the Commonwealth. Obviously, to compile corpora with the coverage of something approximating the BNC is out of the question on a global scale, so one avenue for future research may be to exploit the world-wide-web for corpus building, both to complement some existing ICE corpora and to cover some of the ground that ICE has not covered so far (and is not likely to cover in the near future). The fact that the compilers of ICE-GB ended up compiling a corpus of educated London English rather than a corpus representative of all of Great Britain is closely connected to practical issues in corpus methodology – and we might have to be somewhat more cautious in our interpretation of results obtained from ICE data (not just with respect to the British component, but also – and especially – when working with the other ICE components).1 Coming back to the initial statements, we may conclude that (a) we are still a far cry from being able to describe the international core of English and might never actually reach that goal; (b) the question of ‘ownership’ is still a controversial one and the panel discussion simply reflects that we are dealing with an unresolved issue; (c) the ‘standard ideology’ was not directly addressed by any of the participants but is an issue that surfaces in the discussion about the status of ELF and norms for teaching.
462
Marianne Hundt
Notes 1. On a somewhat critical note: more detailed documentation than the existing manuals is needed. The ‘detail’ that we are missing so far is information on the compilation process and the decisions taken along that road – this kind of information would enable the corpus linguistic community to be more cautious in their interpretations of the results. References D’souza, J. (1997), “Indian English: some myths, some realities”, English WorldWide 18(1), 91-105. Mair, C. & S. Mollin (2007), “Getting at the standards behind the standard ideology: what corpora can tell us about linguistic norms”, in: S. Volk-Birke and J. Lippert (eds.) Anglistentag 2006 Halle: Proceedings, Trier: WVT, 341-353. Mollin, S. (2006), Euro-English: Assessing Variety Status. Tübingen: Gunter Narr. Nayar, P.B. (1998), “Variants and varieties of English: dialectology or linguistic politics?”, in: H. Lindquist, S. Klintborg, M. Levin & M. Estling (eds.) The Major Varieties of English: Papers from MAVEN 97, Växjö 20-22 November 1997, Växjö: Växjö University, 283-289. Seidlhofer, B. (2001), “Closing a conceptual gap: the case for a description of English as a lingua franca”, International Journal of Applied Linguistics 11, 133-158. Widdowson, H.G. (1994), “The ownership of English”, TESOL Quarterly 28(2), 377-389.