Incorporating Corpora
TRANSLATING EUROPE Series Editors: Gunilla Andeman, University of Surrey, UK Margaret Rogers, University of Surrey, UK Other Books in the Series In and Out of English: For Better for Worse Gunilla Anderman and Margaret Rogers (eds) Voices in Translation: Bridging Cultural Divides Gunilla Anderman (ed.) Other Books of Interest Beyond Boundaries: Language and Identity in Contemporary Europe Paul Gubbins and Mike Holt (eds) Contemporary Translation Theories (2nd Edition) Edwin Gentzler Constructing Cultures: Essays on Literary Translation Susan Bassnett and André Lefevere Cultural Encounters in Translation from Arabic Said Faiq (ed.) Identity, Insecurity and Image: France and Language Dennis Ager Language Planning and Policy in Europe, Vol.1: Hungary, Finland and Sweden Robert B. Kaplan and Richard B. Baldauf Jr. (eds) Language Planning and Policy in Europe, Vol.2: The Czech Republic, The European Union and Northern Ireland Richard B. Baldauf Jr. and Robert B. Kaplan (eds) Literary Translation: A Practical Guide Clifford E. Landers Politeness in Europe Leo Hickey and Miranda Stewart (eds) The Pragmatics of Translation Leo Hickey (ed.) Translating Milan Kundera Michelle Woods Translation, Linguistics, Culture: A French-English Handbook Nigel Armstrong Translation, Power, Subversion Román Alvarez and M. Carmen-Africa Vidal (eds) Translation and Nation: A Cultural Politics of Englishness Roger Ellis and Liz Oakley-Brown (eds) Translation and Religion: Holy Untranslatable? Lynne Long (ed.) Urban Multilingualism in Europe Guus Extra and Kutlay Yagmur (eds) Words, Words, Words. The Translator and the Language Learner Gunilla Anderman and Margaret Rogers
For more details of these or any other of our publications, please contact: Multilingual Matters, Frankfurt Lodge, Clevedon Hall, Victoria Road, Clevedon, BS21 7HH, England http://www.multilingual-matters.com
TRANSLATING EUROPE Series Editors: Gunilla Anderman and Margaret Rogers University of Surrey
Incorporating Corpora The Linguist and the Translator Edited by
Gunilla Anderman and Margaret Rogers
MULTILINGUAL MATTERS LTD Clevedon • Buffalo • Toronto
Library of Congress Cataloging in Publication Data Incorporating Corpora: The Linguist and the Translator / Edited by Gunilla Anderman and Margaret Rogers. Includes bibliographical references and index. 1. Translating and interpreting. 2. Corpora (Linguistics) I. Anderman, Gunilla M. II. Rogers, Margaret P306.2.I525 2007 418'.02–dc22 2007000090 British Library Cataloguing in Publication Data A catalogue entry for this book is available from the British Library. ISBN-13: 978-1-85359-986-6 (hbk) ISBN-13: 978-1-85359-985-9 (pbk) Multilingual Matters Ltd UK: Frankfurt Lodge, Clevedon Hall, Victoria Road, Clevedon BS21 7HH. USA: UTP, 2250 Military Road, Tonawanda, NY 14150, USA. Canada: UTP, 5201 Dufferin Street, North York, Ontario M3H 5T8, Canada. Copyright © 2008 Gunilla Anderman, Margaret Rogers and the authors of individual chapters. All rights reserved. No part of this work may be reproduced in any form or by any means without permission in writing from the publisher. The policy of Multilingual Matters/Channel View Publications is to use papers that are natural, renewable and recyclable products, made from wood grown in sustainable forests. In the manufacturing process of our books, and to further support our policy, preference is given to printers that have FSC and PEFC Chain of Custody certification. The FSC and/or PEFC logos will appear on those books where full certification has been granted to the printer concerned. Typeset by Datapage International Ltd. Printed and bound in Great Britain by MPG Books Ltd.
Contents Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1
The Linguist and the Translator Gunilla Anderman and Margaret Rogers . . . . . . . . . . . . . . . . . . . . . . 5
2
Parallel and Comparable Corpora: What is Happening? Tony McEnery and Richard Xiao. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3
Universal Tendencies in Translation Anna Mauranen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4
Norms and Nature in Translation Studies Kirsten Malmkjær . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5
Being in Text and Text in Being: Notes on Representative Texts Khurshid Ahmad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6
Translating Discourse Particles: A Case of Complex Translation Karin Aijmer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7
The Translator and PolishEnglish Corpora Tadeusz Piotrowski. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8
The Existential There-construction in Czech Translation Jirˇı´ Rambousek and Jana Chamonikolasova´ . . . . . . . . . . . . . . . . . . . 133
9
Corpora in Translator Training and Practice: A Slovene Perspective Sˇpela Vintar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
10 NP Modification Structures in Parallel Corpora Tama´s Va´radi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11 A Study of the Mandative Subjunctive in French and its Translations in English: A Corpus-based Contrastive Analysis Noe¨lle Serpollet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
v
vi
Incorporating Corpora
12 Perfect Mismatches: ‘Result’ in English and Portuguese Diana Santos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 13 Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project Anabel Borja . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Acknowledgements A number of people have helped us during the various stages of compiling this edited volume. A comprehensive overview of the use of electronic corpora in Translation Studies in a number of European areas required some diligence in tracking down language specialists in different countries. As a result, speed of production unfortunately had to be sacrificed to other considerations and we are most grateful to the contributors for the patience that they have shown. In addition, we owe a debt of gratitude to Multilingual Matters for allowing us the time to complete the volume in keeping with our original concept. We would also like to thank Akade´mia Kiado´ for allowing us to reproduce NP Modification Structures in Parallel Corpora in a revised version in Chapter 10. Last but certainly not least, our thanks, as always, go to Gillian James, this time more than ever; we are grateful not only for her attention to detail, persistence and patience, but also for her committed co-operation and initiative in bringing the volume together. Gunilla Anderman Margaret Rogers Guildford October 2006
vii
Contributors: A Short Profile Gunilla Anderman ($ April 2007) received her PhD in theoretical linguistics from University College London. She was for many years Professor of Translation Studies at the University of Surrey where she taught at post- as well as undergraduate level. Her most recent publications included ‘Linguistics and Translation’ in A Companion to Translation Studies, edited by P. Kuhiwczak and K. Littau (2007) and In So Many Words: Translating for the Screen (2007) co-edited with J. Dia´z Cintas (2007). Khurshid Ahmad is Professor of Computer Science at Trinity College Dublin, Ireland and a visiting professor at the University of Surrey. He has published widely on a range of topics related to natural language, using methods and techniques of corpus linguistics and artificial intelligence to build multilingual terminology systems. His initial interest in corpora dates back to the late 1970s when he built corpora of Russian and German language texts for use in Computer-Assisted Language Learning. He has also published on the development of language in children using methods and techniques of neural computing. Currently, he is involved in building information-extraction systems for analysing Arabic and Chinese (news) corpora. Karin Aijmer is Professor of English at Go¨teborg University, Sweden. Among her recent publications are Conversational Routines in English (1996), Convention and Creativity (1996) and English Discourse Particles. Evidence from a Corpus (2002). She is one of the editors of English Corpus Linguistics. Studies in Honour of Jan Svartvik (1991). More recently she has edited the volume Pragmatic Markers in Contrast (with Anne-Marie Simon-Vandenbergen, 2006). She is the leader of the project ‘Contrastive Studies in a Translation Perspective’. She is also responsible for the Swedish part in building a spoken learner corpus of advanced Swedish learners and has been involved in the collection of the Swedish component of the International Corpus of Learner English. Anabel Borja is a sworn translator who has worked professionally for publishers, multinational companies, law firms and the courts. She is Senior Lecturer in legal translation at Universitat Jaume I, Castello´n, Spain, where she co-directs the GITRAD research group. Her research is based on the comparative analysis and classification of legal texts ix
x
Incorporating Corpora
through the use of electronic corpora based on the concept of genre analysis. She is currently working on the creation of a Virtual Campus for Legal Translation and on an Expert Knowledge Management System Project for Legal Translation. Her publications include: El texto jurı´dico ingle´s y su traduccio´n al espan˜ol (2000) and the manual Estrategias, materiales y recursos para la traduccio´n jurı´dica (2007). Jana Chamonikolasova´ is Assistant Professor in the Department of English and American Studies in the Faculty of Arts, Masaryk University in Brno, Czech Republic. Her research fields include functional syntax (functional sentence perspective) and contrastive text analysis. She has published many scholarly papers and chapters in books on Czech English contrasts, particularly in relation to prosody. Kirsten Malmkjær is Professor of Translation Studies and Literary Translation at Middlesex University, UK, where she teaches on the BA in Translation Studies and the MA in the Theory and Practice of Translation. Her research specialisms on which she has published and lectured widely are translation theory, with a particular emphasis on matters of language and of philosophy, research methods, and the language of Hans Christian Andersen. Her most recent books include Linguistics and the Language of Translation (2006). Anna Mauranen is Professor of English at the University of Helsinki, Finland. Her main research interests currently focus on English as a lingua franca and modelling spoken language. Her publications deal with corpus linguistics, speech corpora, translation studies, contrastive rhetoric and academic discourses. Her major publications include Linear Unit Grammar (with J. Sinclair, 2006), Translation Universals Do They Exist (co-edited with P. Kujama¨ki, 2004), Academic Writing, Intercultural and Textual Issues (co-edited with E. Ventola, 1996) and Cultural Differences in Academic Rhetoric (1993). She is currently running a corpus-based research project on spoken English as a lingua franca, the ELFA corpus. Tony McEnery is Professor of English Language and Linguistics, Lancaster University and Director of Research at the Arts and Humanities Research Council in the UK. He has published widely in the area of corpus linguistics where his major interests are the contrastive study of aspect, epistemic modality and corpus-aided discourse analysis. Previous books include Corpus Linguistics (with A. Wilson, 1996), Corpus Annotation (with R. Garside and G. Leech, 1997), Aspect in Mandarin Chinese (with R. Xiao, 2004), Corpus-Based Language Studies (with R. Xiao and Y. Tono, 2005) and Swearing in English (2005).
Contributors
xi
Tadeusz Piotrowski graduated from Wroclaw University in 1980. Following a period working as a full-time translator, he began his research career working alternately at Opole University and Wroclaw University, focusing on lexicography, corpus linguistics and translation. He has compiled and edited a dozen or so bilingual dictionaries, published three books on lexicography, and about 100 papers and reviews. He has also served as an associate editor for the International Journal of Lexicography (OUP). He is currently Professor of Linguistics at Opole University. Margaret Rogers has a PhD in Applied Linguistics and is Director of the Centre for Translation Studies at the University of Surrey, where she teaches terminology, translation and text analysis on the undergraduate and postgraduate programmes in Translation Studies. She initiated the Terminology Network in the Institute of Translation and Interpreting, UK, and is a founder member of the Association for Terminology and Lexicography. Her publications focus on terminology in text, particularly in LSP texts in translation. Jirı´ Rambousek teaches in the Department of English and American Studies in the Faculty of Arts at Masaryk University in Brno, Czech Republic. His research fields, in which he has published a number of journal papers and book chapters, include the translation of fiction from Va´clav Havel’s plays to children’s literature and the use of parallel corpora. He is also a practising translator. Diana Santos’s main interests are translation, corpora and evaluation. She has participated in the development of several corpus projects under the scope of Linguateca, such as AC/DC, Floresta and COMPARA. She holds a PhD from Instituto Superior Te´cnico, Lisbon, Portugal, 1996, on corpus-based semantic studies, leading to the publication of Translationbased Corpus Studies: Contrasting English and Portuguese Tense and Aspect Systems (2004). She is currently a research scientist at SINTEF, Oslo, leading the Linguateca project. Noe¨lle Serpollet recently completed her PhD at Lancaster University with Professor Geoffrey Leech on the topic Should and the Subjunctive: A Corpus-based Approach of Mandative Constructions in English and in French. She is currently a lecturer in English linguistics and phonetics at the University of Orle´ans, France, and works on developing and analysing a large corpus of spoken French. Her research fields are corpus linguistics, contrastive analysis, translation theory and the history of the English language. Her theoretical framework is the French theory of enunciative and predicative operations (by A. Culioli).
xii
Incorporating Corpora
Tama´s Va´radi is a graduate of Eo¨tvo¨s Lora´nd University, Hungary, in English, Spanish and General Linguistics. His early research interests focus on second language acquisition, error analysis and communication strategies. Following a period in the first half of the 1990s as a guest researcher at Lancaster University and as Hungarian Lector at SSEES, University of London, he returned to Hungary, where he is currently Head of the Department of Language Technology at the Research Institute of Linguistics, Hungarian Academy of Sciences. Sˇpela Vintar is an Assistant Professor in the Department of Translation at the University of Ljubljana, Slovenia. Her research interests include computational terminology, extraction of multiword expressions, multilingual text mining, translation technologies and semantic natural language processing. She has been involved in a wide range of projects covering text mining, corpus building and translation, and has published extensively in scholarly journals and edited collections on related topics. Richard Xiao is a researcher in the Linguistics Department, Lancaster University, UK. His major research interests include corpus linguistics, contrastive and translation studies, and aspect theory. He is a co-author of Aspect in Mandarin Chinese (with Tony McEnery, 2004) and CorpusBased Language Studies (with Tony McEnery and Y. Tono, 2005). The results of his research have been presented in refereed journals and at international conferences.
Introduction Broadly speaking, this volume falls into two parts. The first part discusses some aspects of Corpus Linguistics and its emerging role in Translation Studies. The second part includes a number of chapters dealing with corpora and translation in specific languages. The first chapter ‘The linguist and the translator’ by Gunilla Anderman and Margaret Rogers aims to trace some developmental links between aspects of Linguistics in the Firthian tradition and the subsequent use of corpora to study translation and translation-related phenomena. This chapter sets the scene for much of what is to follow in that it historically roots these developments in the British linguistic tradition of the study of actual texts, including semantic issues. This approach to the study of languages experienced a resurgence, once ever larger quantities of text could be stored and processed using computers and software tools. The contribution by Tony McEnery and Richard Xiao, ‘Parallel and comparable corpora: What is happening?’, provides information not only about the kind of corpus data that are available to the researcher and translator, but also about what is not yet available. In doing so, the authors review different types of multilingual corpora and explore the different purposes to which they may be put. A strategy for building multilingual corpora is outlined that meets both the needs of researchers and the needs of eventual consumers of corpus-based research. In presenting this programme of research, they draw on experience gained in a number of multilingual corpus-building projects in Lancaster over the past years, presenting the results of research past and present, while looking forward to what the future may hold. A close association has been established in the Translation Studies literature between corpus-based studies of translations and what have become known as ‘translation universals’. In her comprehensive survey of ‘Universal tendencies in translation’, Anna Mauranen starts out by charting the provenance of current studies on universals in the earlier Translation Studies literature. As Mauranen points out, however, the topic has not been uncontroversial. She reminds us of the useful distinction first established in typological studies of languages between absolute universals and general tendencies, the latter currently presenting the more promising path. Before moving on to a discussion of each of what she cautiously calls the hypothesised translation universals, Mauranen 1
2
Incorporating Corpora
reviews some of the methods used to study these empirically. The universals that she succinctly surveys are: explicitation, simplification, conventionalisation, unique items in translation (‘under-representation’), interference and untypical collocations. In her closely argued paper ‘Norms and nature in Translation Studies’, Kirsten Malmkjær sets out to clarify how universals relate to norms. She argues, basing her case on sociology and theoretical linguistics, that norms are sociocultural whereas universals are cognitive. In considering those processes that are usually given as examples of universal tendencies, Malmkjær argues that only one is a candidate for the status of ‘universal’ according to her interpretation, namely the under-representation of linguistic features that are characteristically found in the target language. In conclusion, Malmkjær argues that corpus studies are better suited to the search for evidence of norms rather than universals. In all corpus-based work, the issue of representativeness arises. It is this perspective that is considered in Khurshid Ahmad’s paper on ‘Being in text and text in being: Notes on representative texts’. Over the last 50 years, dictionary publishers and linguists have created a number of corpora, starting from 1-million-word corpora rising to billion-word corpora. The implicit claim of the corpus compilers is that the texts they have selected are in some sense representative representative perhaps of a large number of language users or representative in the sense of a standard. In this contribution, the composition of four major corpora is discussed, indicating that there is a measure of objectivity in the selection of many text samples. Given that the pioneers of corpus linguistics were interested in the teaching of English as a second language, one can discern an emphasis on informative texts texts used in science and technology, at the expense of literary texts, in the compilation of the corpora. However, Ahmad points out that there are instances where texts were published by a small group of publishers, based mainly in metropolitan areas, or where there is a gender imbalance between the authors of the texts. Ultimately, he argues, there is a degree of choice exercised by the compilers, but in this respect the behaviour of corpus linguists is not that different from that of much of the scientific community. Words and expressions known as discourse particles do not affect the truth condition of an utterance and tend to modify the speech act rather than what is actually talked about. Functionally they express attitudes or emotions and contribute to the coherence of the utterance. In her contribution to this volume, ‘Translating discourse particles: A case of complex translation’, Karin Aijmer examines the use of English ‘oh’ and its translation into Swedish. Her findings show that there is no single lexical equivalent of ‘oh’ in Swedish. Instead, the many meanings that may be read into ‘oh’ must be translated in a number of different ways: if simply
Introduction
3
rendered by the standard Swedish equivalent ‘a˚h’, the resulting translation is not a natural sounding construction in the target language. While electronic texts are now available on the web in many languages of the world, in the case of some languages, their highly inflectional morphological structure compound the difficulties in compiling corpora. In ‘The translator and Polish English corpora’, Tadeusz Piotrowski discusses the complexity of Polish morphology and the compilation of corpora including the problem of ambiguity created by the similarity of inflectional endings. From his account of the situation with respect to corpus compilation in Poland, it is also shown that corpora providing Polish language data are not as readily available as in some other languages discussed in this volume. Although data are not altogether missing in Poland, the information provided appears to be of greater interest to corpus linguists, while it is still only of limited use to practising translators. In English, the so-called existential there has the status of a dummy subject, fulfilling a grammatical rather than semantic function. In contrast, Czech does not possess a lexical equivalent to the existential there in English; existence or occurrence is instead suggested by the intransitive character of the verb and the final position of the notional subject. For example, Bylo ticho (‘was silence’) corresponds to the English ‘there was silence’, while V domeˇ byto ticho (‘In the house was silence’) would also express ‘There was silence in the house’. In ‘The existential Thereconstruction in Czech translation’, Jirˇ´ı Rambousek and Jana Chamonikolasova´ examine English existential sentences from the perspective of translation practice. Using a corpus of parallel texts, they analyse how Czech translators deal with English there constructions and the syntactic and semantic means they use to achieve functional equivalence in Czech. The contribution by S˘pela Vintar, ‘Corpora in translator training and practice: A Slovene perspective’ starts by presenting an overview of resources available in Slovenia concluding that Slovene-English, in contrast to Slovene and other languages, is currently a language pair where it is possible to use parallel corpora effectively. This discussion is followed by an overview of mono- and multilingual corpora for Slovene; the role of corpora in translator training is described, as offered by the programme provided for the students in the Department of Translation at the University of Ljubljana. In view of the fact that the Slovene language community is one of the smallest in Europe, in the field of available bilingual language resources Slovene appears to be well catered for. In the contribution from Hungary ‘NP modification structures in parallel corpora’, Tama´s Va´radi discusses the use of corpora in investigating structures involving NP modification in English and Hungarian. In spite of their radically different internal structure in each language, it is shown that maximally extended NPs lend themselves more easily to a
4
Incorporating Corpora
comparison than parts of speech such as, for example, adjectives. Among cases that show a departure from the Adjective Noun pattern, the author focuses on Hungarian constructions equivalent to English NounPrepositional Phrases. The chapter concludes with a discussion of the criteria that may be used to distinguish genuine cases of explicitation from constructions that require expansion in order to meet the requirements of grammar. In ‘A study of the mandative subjunctive in French and its translations in English: A corpus-based contrastive analysis’, Noe¨lle Serpollet investigates an aspect of grammar that has previously not been investigated bilingually. Indeed, the FrenchEnglish language pair has so far received only limited contrastive attention using electronic corpora. Working first from French into English, the author analyses instances of the subjunctive in the Press category of the corpus and their translation into English. She then turns her attention to mandative constructions in English and examines how they are translated into French in the Learned Prose category. In conclusion, she compares her results with the findings obtained from an analysis of equivalent extracts from the 1-million-word corpus of British English LOB (1961) and its later counterpart, FLOB (1991). The study by Diana Santos, ‘Perfect mismatches: ‘Result’ in English and Portuguese’, compares one aspect of the English verbal system, namely the meaning of ‘result’ closely associated with the present perfect, with the situation in Portuguese. The comparison is made bidirectionally using the COMPARA corpus, consisting of parallel literary texts (English Portuguese and Portuguese English). The structure of the corpus allows comparisons to be made not only between source and target texts, but also between translations and original texts. Santos concludes that while little attention is paid to result in Portuguese, compared to English, other distinctions and patterns are apparent. Her more general point is that a parallel corpus facilitates contrastive studies of a semantic kind, which help to identify complex differences even between related languages. The final chapter, contributed by Anabel Borja, aims to provide an overview of the corpora available in Spain that might prove useful to translators and translation researchers working with Spanish as a source or target language. To this end, the author presents corpora containing texts in Spanish, including all the corpus resources identified as useful tools. In conclusion, a description is given of the CDJ-GITRAD corpus, a Multilingual Corpus of Legal Documents and the GENTT project, in which a multilingual encyclopaedia of specialised texts for translators is currently being compiled, specifically intended for medical, technical and legal translators working with Spanish.
Chapter 1
The Linguist and the Translator GUNILLA ANDERMAN and MARGARET ROGERS
Introduction In June 1956, J.R. Firth, holder of the first Chair of General Linguistics at the University of London, read a paper with the title ‘Linguistics and Translation’ to an audience at Birkbeck College, University of London. Firth concluded (1968a: 95): The spread of world languages such as English, but not forgetting Russian, Chinese and Arabic, multiplies the need for translation from and into all these languages mutually and also into dozens of other languages which serve what has become more and more a common world civilization. This observation presciently anticipated the need for translation that was to arise from the development of the European Economic Community (EEC), established in 1957, shortly after Firth’s lecture, with six members and four languages, to the European Union of 2007 with its 27 member states and 23 official languages, soon to be more. It also pointed to the rapid spread of English as a global language, although Spanish and Hindi are now joining Chinese and English to make up the top four most frequently spoken languages in the world.1
Linguistics and Translation: Early Pioneers As a pioneer of the new discipline of linguistics in the UK, Firth’s insight into the nature of language not only led him to predict an increased need for translation, it also made him an early advocate of the study of meaning in linguistics. At a time when American structuralist linguists were attempting to exclude meaning from linguistic analysis along with all psychological, or as Bloomfield called it ‘mentalistic references’, Firth clearly realised the importance of the task of incorporating linguistic meaning into the science of language. And as his definition of meaning as ‘function in context’ suggests, he was well aware of the importance of running text of the kind that computers are now able to process. In looking at words in their context, he was not, however, the first linguist to understand that in isolation separate lexical items are less likely to reveal to us their actual meaning. Context provides an important link with earlier developments in foreign-language teaching, which in some ways foreshadowed what could 5
6
Incorporating Corpora
be termed the ‘communicative turn’ in language teaching in Western Europe in the latter part of the 20th century. Over half a century before Firth, Henry Sweet, a member of the Reform Movement, an ardent opponent of the exclusive concern with Latin and Greek among linguists and an advocate of the study of English as spoken, pointed out that detached sentences should not be substituted for connected texts as is often the case in the use of teaching methods that focus on grammar rather than on the text itself: ‘it is only in connected texts that that the language itself can be given with each word in a natural and adequate context’ (Sweet, 1899/1964: 163). Awareness of the importance of not viewing words and constructions in isolation is also found in the work of Otto Jespersen, another member of the Reform Movement. In A Modern English Grammar on Historical Principles, Jespersen sets out to demonstrate the facts of English usage during different, historical periods. Supporting his discussion throughout are examples culled from the English canon and other sources. As his corpus, Jespersen used the texts in English available to him. His painstaking extraction of illustrative examples, in an endeavour ‘to place grammatical phenomena in a true light’ (Jespersen, 1961: VI), was at the time a gargantuan task, now routinely achieved in machine-readable corpus studies through, for instance, automatic grammatical ‘tagging’ of words. While the importance of spoken language was vindicated by the establishment of the first Department of Phonetics at University College London in 1912 under the headship of Daniel Jones, who had studied under Henry Sweet, the notion of context was further developed by another London University scholar, the social anthropologist Bronislaw Malinowski. Following his first field study to record the life and work of the Trobriand islanders of New Guinea in the South-west Pacific between 1915 and 1918, Malinowski clearly saw the need for linguistics in developing a school of social anthropology in London, in particular in relation to the establishment of reliable ethnographic texts. For Malinowski, the notion of translation into English was crucial in his anthropological studies and was extended to include the definition of a term by ethnographic analysis, that is, by placing it within its context of situation and its context of culture, ‘putting it within the set of kindred and cognate expressions, by contrasting it with its opposites, by grammatical analysis and above all by a number of well chosen examples [. . .] the only correct way of defining the linguistic and cultural character of a word’ (1935, II,16, discussed in Firth, 1968b: 151). Malinowski’s method, according to Firth, included a number of different stages of translation. First, he discusses what he refers to as an interlinear word-for-word translation, sometimes described as a ‘literal’ or ‘verbal translation’, ‘each expression and formative affix being rendered by its English equivalent’ (Firth, 1968b: 149). As a second step this was followed by a free translation in what Firth describes as ‘running
The Linguist and the Translator
7
English’. Thirdly, the interlinear and free translations were collated, leading to the fourth stage, namely, the compilation of a detailed commentary, or ‘the contextual specification of meaning’, in which the free translation was related to the verbal translation including a discussion of ‘equivalents’ (Firth, 1968b: 149). The notion of ‘context’ that Firth embraces is that of Malinowski’s ‘context of situation’ in its widest sense: ‘It is clear that one cannot deal with any form of language and its use without assuming institutions and customs’ (Firth, 1968b: 156). Also anticipating the ambivalence often expressed by 21st-century translators towards dictionaries, Firth (1968b: 156) expresses his reservations about the traditional use of dictionaries, citing Malinowski: ‘I should agree that ‘‘the figment of a dictionary is as dangerous theoretically as it is useful practically’’ and, further, that the form in which most dictionaries are cast, whether unilingual or bilingual, is approaching obsolescence [. . .]’. In the current age of the fast-moving knowledge society, contextual solutions to terminological and phraseological problems are crucial to today’s translators, who frequently turn to on-line documentation or even customised electronic corpora as an alternative to traditional dictionaries as well as electronic term bases or term banks, which rarely fully contextualise meaning and use. Malinowski’s influence is still discernible today in the computer-based processing of texts. His concept of a ‘coefficient of weirdness’, referring to strange language which becomes less strange in its context of use, and linking the user of language and the things s/he is trying to influence or connect with, has been adopted and adapted to shape statistical procedures for identifying specialist terms semi-automatically, originally for translation purposes. The basis of this is not any magical properties which specialist terms may exhibit, but rather their distributional characteristics compared with the distribution of content words in general-language texts, which are lexically less dense (cf. Ahmad & Rogers, 2001). As one of the first 20th-century linguists to show a concern with the importance of translation, in his 1956 paper Firth (1968a: 86) recognises four different types of translation. The first type he calls ‘creative translation’, intended primarily as literature in the language into which it is rendered by the translator. The second type of translation to which Firth refers is ‘official translation’, the kind of language transfer used in documents and treaties in so-called ‘controlled’ or ‘restricted’ languages, and most closely related to what today we would call specialist translation, in which terminological studies of special domains play an important part. The third type of translation selected by Firth for special attention is translation as used by linguists engaged in the description of a particular language, and the fourth, to which we return below, is ‘mechanical translation’. As an example of the third type, we can cite Firth’s description of an Indian novelist writing in English making repeated references to
8
Incorporating Corpora
Urdu and its pronominal system and terms of personal address, for which there are no equivalents in English. In this case, a contrastive analysis involving the principles of translation may help to illustrate the lack of equivalence between the different pronominal systems in the two languages. Firth also discusses the problem of carrying grammatical structures across what he calls the ‘bridge’ of translation, reflecting the contrastive method employed by the Prague linguists who arrived at important insights into the differences in information structure between European languages. This important text-based characteristic can, in turn, be related to different formal characteristics of the languages in contrast.2 While the non-finite construction your having done that will spoil your chances is prevalent in English, in translation into other European languages, Firth (1968a: 92) points out that it is likely to require a separate clause with a verb in the finite form. In cases of this type, that is, of structural difference, he feels that linguistics, in providing this information, would be able to make a contribution to translation. In fact, one of the earliest English-language attempts to map out a systematic approach to the study of translation Catford’s (1965: 1) study explicitly acknowledges his debt to Halliday, and in turn, Firth: ‘The general linguistic theory made use of in this book is essentially that developed at the University of Edinburgh, in particular by M.A.K. Halliday and influenced to a large extent by the work of the late J.R. Firth’. While Catford has been in later years much criticised for what has generally been described as an approach which reduces translation to a linguistic decoding/encoding exercise, interestingly, he espouses a contextual view of language ‘as related to the human social situation in which it operates’ (Catford, 1965: 1), echoing Firth’s concern with ‘context of situation’ and the functional orientation of Halliday’s model of grammar. His notion of ‘shifts’ in particular can be usefully applied to the description of translation solutions from a formal point of view, a perspective which may still be pedagogically useful and still has relevance for certain aspects of professional translation. For example, we could still imagine at least two further contexts of application for Firth’s observation on clause correspondence. Firstly, it would be a construction to avoid in the drafting of documents for translation into many other languages, such as in the European Union, although from a stylistic point of view it is the absence of variation that constitutes one of the reasons for the blandness of such international documents, contributing to what has come to be known as ‘Euro-English’ (cf. Wagner, 2005). Secondly, for similar reasons of translatability, the type of English-specific clause contraction described above would be a good candidate for elimination in the pre-editing stage of machine translation, which brings us to Firth’s fourth and last type of translation, namely ‘mechanical translation’. As an example of this type, he cites the work pioneered by Dr Andrew Booth who, by the end of 1952, had produced an electronic stored programme computer in full
The Linguist and the Translator
9
operation at the Birkbeck College Computation Library, University of London. In this context, Firth also makes tentative reference to the ease with which set phrases and cliche´s may be handled by machine translation. The path to the present-day use of computers in translation particularly in Computer-Assisted Translation with its stored ‘translation memory’, that is, a database of pre-translated phrases and even sentences is not difficult to detect, as repetitive formulations lend themselves most easily to this kind of treatment. While the increased power of modern-day computers speeds processing and facilitates the storage of large amounts of data in memory, the use of computer programs to ‘understand’ text through rule-based systems is still very limited (cf. Quah, 2006 for a summary); this goes some way to explaining why automatic translation systems, for instance, are highly constrained in their use, in relation either to subject field and genre, or to purposes for which the output is fit (for example, information only). Statistically or lexically based systems may prove more fruitful, as foreshadowed in the early work on corpora by Sinclair and Halliday suggesting that ‘grammar’ is highly localised and lexically based (cf. ‘lexicogrammar’). Throughout the discussion in ‘Linguistics and Translation’, the thinking that was to inspire Michael Halliday and John Sinclair is not difficult to detect. In words that were to be echoed by Halliday in discussions of his approach to a theory of grammar, Firth (1968a: 90, emphasis added) writes: ‘the whole of our linguistic behaviour is best understood if it is seen as a network of relations between people, things and events, showing structures and systems just as we notice in all our experience’. Equally discernible in Firth’s discussion is the semantic notion of ‘collocability’ (as discussed in Sinclair, 1966: 417).
Halliday and Sinclair: The ‘Neo-Firthians’ The debt of Halliday and Sinclair, sometimes called ‘neo-Firthians’, is clearly expressed in the edited collection In Memory of J. R. Firth (Bazell et al., 1966). Both acknowledge the importance of his notion of context of situation as inherited from Malinowski, as well as his concept of collocation, first used in the essay ‘Modes of Meaning’ (cf. Firth, 1968c: 204). His description of the term, often quoted by Halliday, is well known: ‘You shall know a word by the company it keeps! One of the meanings of ‘‘ass’’ is its habitual collocation with such other words as ‘‘you silly ’’, ‘‘he is a silly ’’, ‘‘don’t be such an ’’’ (Firth, 1968c: 179). As Halliday (1966: 152) observes: ‘lexis seems to require the recognition merely of linear co-occurrence together with some measure of significant proximity, either a scale or at least a cut-off point. It is this syntagmatic relation which is referred to as ‘‘collocation’’’. But in addition to collocational ‘span’ (the environment of a lexical item), a lexical description also appears to require
10
Incorporating Corpora
categories such as ‘simple’ and ‘compound’, as well as possibly ‘phrasal’ lexical items. Looking to the future and bearing in mind the potential of computers to formalise intuition about the behaviour of lexical items, Halliday (1966: 160, emphasis in original) further points out that ‘[a] thesaurus of English based on formal criteria, giving collocationallydefined lexical sets with citations to indicate the defining environments, would be a valuable complement to Roget’s brilliant work of intuitive semantic classification in which lexical items are arranged ‘‘according to the ideas which they express’’’. In the absence of such a work, even ‘a table of the most frequent collocates of specific items [. . .] would be of considerable value for those applications of linguistics’, Halliday (1966: 160) suggests, ‘in which the interest lies not only in what the native speaker knows about his language but also in what he does with it’. These studies, he continues presciently, may include investigations of register and style, of child language, the language of aphasics and other target groups which, as we now know, became established fields of linguistic exploration during the latter part of the 20th century (Halliday, 1966: 160). Alongside Michael Halliday, John Sinclair was one of the first scholars in the UK to explore the potential of computer text-processing for language description, focusing on lexical properties and, in particular, refining the notion of collocation to take account of how an item predicts the occurrence of others and is predicted by others (Sinclair, 1966). In those early days Sinclair predicted that it might not be possible for many years to establish how words relate to each other to create meaning in the context of the ‘total frequency of the two items’ (Sinclair, 1966: 428). In conclusion he pronounces the practical problems to be immense although well compensated for by the prospect of opening up new ways of describing language through the theory of lexis. At the time that Halliday and Sinclair were developing their theoretical framework for the study of lexis, in their own different ways incorporating semantics as derived from Firth’s ‘function in context’, a number of linguists were focussing their attention on grammar. The 1960s had seen the appearance of Chomsky’s Aspects of the Theory of Syntax (1965), and the subject of the nature of the rules involved in the generation of grammatical constructions was increasingly attracting interest among linguists. Some 40 years later, Sinclair (2007) was to claim that attempts to develop rulebased descriptions of any natural language had failed. According to Sinclair’s terse observation on computational linguistics as carried out in what became known as the natural language processing (NLP) community, it was both ‘headless’ (that is, without a theory, as the initially promising Chomskyan formalisms proved difficult to operationalise) and ‘legless’ (that is, without data in the form of naturally occurring language) (Sinclair, 2007: 24). Sinclair’s judgement is based on a simple evaluation measure: how well does the grammar (actually, partial grammar) account
The Linguist and the Translator
11
for the behaviour of language as observed in ‘open text’, even relatively highly constrained text such as that found in so-called sublanguages or domain-specific varieties (as opposed to constructed examples)? For Sinclair (2007: 36), a portfolio of what he calls ‘local grammars’ early work was on the grammar of definitions rather than a general grammar of any particular language seems ‘closer to the way language is used’ and the most realistic prospect for the foreseeable future of the machine ‘understanding’ text. Sinclair’s method is corpus-based, teasing out ever more complex patterns reflecting meaningform relations, usually starting with a particular token or word form. The seeds of this approach are already discernible in Sinclair’s 1966 paper, where he is critical of the notion of grammar as a set of systems for example, active versus passive, a binary choice which are treated independently. Instead, he turns to lexis ‘which describes the tendencies of items to collocate with each other’; such tendencies, he goes on, ‘cannot be got by grammatical analysis, since [they] cannot be expressed in terms of small sets of choices’ (Sinclair, 1966: 411). If we project from this monolingual position to the process of translation, from a linguistic point of view it could be argued that successful translation consists in finding authentic or ‘attested’ examples (cf. Firth’s ‘attested language’) of corresponding tendencies in the target language, problematising the notion of ‘equivalence’ more than ever. Under this view, the value of dictionaries, term banks and term bases organised around headwords or key concepts rather than larger ‘chunks’ of language becomes questionable. In the context of the present, the widespread use of Translation Memory (TM), which could be said to ‘learn’ (in the sense of matching patterns with the support of a human user it’s not that smart yet) to translate chunks of text above the word level, can be seen as a vindication of this approach within a narrow range of texts, in so far as TM is used to translate certain domain-specific genres of text of some length. It is also interesting to note in this context that corpus-based machine translation systems, including the statistical approach based on probabilities of occurrence in aligned parallel texts (the original exemplar being Canadian Hansard), are now serious contenders to rule-based systems (cf. Quah, 2006: 7684). What was for Firth called ‘collocation’ has been extended as a notion by Sinclair and Halliday in the computer age, in a way which is resonant in some respects of the alternative term ‘automation’ proposed by the Prague linguists in the 1930s (cf. McEnery & Wilson, 2001: 24).
The Emergence of Corpora for Linguistic Research In the early days, the question also being asked was the extent to which corpus-based studies might be used in order to shed light on phenomena of grammar. Prominent among the grammatical constructions discussed was
12
Incorporating Corpora
the passive in English and the syntactic rules required in order to generate it. In 1966, Jan Svartvik published his PhD thesis as Voice in the English Verb, based on material taken from the files of the Survey of English Usage representing co-existing varieties of spoken and written educated English. Svartvik’s research, carried out under the auspices of the director of the project, Randolph Quirk at University College, London, set out to investigate ‘to what extent ‘‘corpus-passives’’ differ from ‘‘rule-generated passives’’ and from actives’ (Svartvik, 1966: 6). The use of electronic corpora for linguistic research had its roots, according to McEnery and Wilson (2001: 2122), in the mid-1950s in the pioneering and ambitious work by Alphonse Juiland on French, Spanish, Romanian and Chinese. More well known work was carried out in English from the early 1960s at Brown University in the USA (Francis & Kucera, 1982), shadowed by the Lancaster-Oslo-Bergen (LOB) corpus of British English. But it was only in the 1970s that Randolph Quirk’s 1960s work at University College London on the Survey of English Usage was computerised by Jan Svartvik (McEnery & Wilson, 2001: 22). The ‘corpus’ in Corpus Linguistics is nowadays always machine-readable. It is worth reflecting for a moment, however, that the digitisation of text was not always so self-evident for the computing community who, of necessity, were called upon to support and help drive forward the introduction of computers into linguistic analysis and its applications. In the 1960s and 1970s, and to an extent the 1980s, the main application of computer power was number crunching, using physically large mainframe computers which operated in batch mode with data input mechanisms which relied on punch cards or tape. The use of ‘non-standard character sets’ (even in languages with Latin characters such as French, German and Spanish) was problematic, as was a perception among some computing colleagues that upper and lower case differences were of no import. So while the difficulties of monolingual text storage and processing were significant, extending this to multilingual material represented an even bigger culture change. Hurdles were gradually overcome, as interest boomed in the 1980s, particularly in English (cf. for example Johansson 1982), along with technological developments, and by the mid-1990s, Corpus Linguistics had become ‘mainstream’ (Thomas & Short, 1996: ix) (for further information on the development of corpora for different purposes and of various kinds, cf. Aarts & Meijs, 1990; Aijmer & Altenberg, 1991; Butler, 1992; Garside et al., 1987; Kennedy, 1998). In addition to being used for empirical research into the patterns of actual language use, facilitated by the possibility to store and process ever larger amounts of digital text and to approach the nevertheless ill defined notion of ‘representativeness’ (cf. Ahmad, this volume), corpora were also used for the first time to support the compilation of grammars (for example Quirk, Greenbaum, Leech and Svartvik’s A Comprehensive Grammar of the
The Linguist and the Translator
13
English Language (1985)) and dictionaries (for example, Sinclair’s 1987 Collins Cobuild English Language Dictionary). Around this time in the mid1980s, suggestions were also emerging for the use of corpora in language learning (for example, Ahmad et al., 1985: 126127). The general theme characterising all these developments was the study of language through observation rather than introspection or experimentation, promising richer and more sensitive descriptions of syntactic and lexical patterns. But of translation and electronic corpora there was little yet to be seen: the focus of discussions concerning ‘computers and translation’ was still largely machine translation (for example Lewis, 1992), most corpora remained monolingual and the development of text-alignment software was in the domain of research laboratories (for an overview of the development of text-alignment, cf. Ve´ronis, 2000).
Corpus-based Studies and Translation However, signs of interest did start to emerge in the 1980s, particularly in the Scandinavian countries, as Translation Studies started to develop into a discipline in its own right. Svartvik has since suggested that the obvious difficulties posed by introspection for non-native speakers of English engaged in English studies may have contributed to the popularity of the then rather unfashionable corpus approach in the early days, that is, in the 1960s (Svartvik, 2005). Twenty years later, linguists interested in corpus-based studies began to turn to the potential role of corpora in the study of translated texts, initially literary texts such as novels. At the Department of Swedish at Gothenburg University a selection of novels translated into Swedish from English together with a corresponding amount of text from novels originally written in Swedish, all published between 1976 and 1977, has long provided the basis for studies into the features characterising texts in translation (cf. Gellerstam, 1986 for a more detailed discussion). The increased volume of text which could be studied using machine-based search and retrieval methods gave rise to insights that would have been difficult to ‘observe’ using traditional paper-based methods. Gellerstam was able to show that the distribution of words in translated texts differs from that in original texts, casting new light on the hoary chestnut of ‘translationese’. In 1989 further corpus-based information about aspects of translation between English and Swedish was made available through the publication of Hans Lindquist’s study of English adverbials in translation (Lindquist, 1989), presented as a PhD dissertation at the University of Lund under the supervision of Jan Svartvik. Lindquist’s text corpus of five British and five American novels was used to provide a ‘corpus’ (a lexicographical understanding of the term) of 2000 adverbials, at first stored on cards, and then partly
14
Incorporating Corpora
computerised (Lindquist, 1989: 31), reflecting well the ongoing developments at that time in the move from paper to computer. Originally firmly rooted in the ‘paper tradition’ with its provenance in the late 1970s is Gideon Toury’s descriptive approach to works of literary translation in which so-called ‘norms’, which Baker (2001: 163) interprets as ‘regularities of translation behaviour within a specific sociocultural situation’, are studied. What Toury’s approach shares with the neoFirthian legacy is not the close analysis of linguistic behaviour in text, but the emphasis on the description of ‘attested’ texts here, literary translations rather than an idealised, in his view prescriptive, notion. While it has been suggested that Baker’s contribution to the use of corpora in Translation Studies lies in the application of Corpus Linguistics to Descriptive Translation Studies (cf. for instance, Kenny, 2001a: 50), the resulting work on so-called ‘universals of translation’3 has a strong linguistic flavour, for example, a tendency to explicitation, disambiguation, simplification, grammatical conventionality and ‘a tendency to overrepresent typical features of the target language’ (Mauranen, this volume), echoing earlier work such as that of Gellerstam. As we have seen, Corpus Linguistics was originally centred on monolingual corpora (although incorporating both written and spoken texts). Baker rightly points out (1995: 225226) that additional criteria of corpus design beyond those of, for instance, general versus restricted domain, synchronic versus diachronic, genre, geographical variant, were needed for translation research, including range of translators and respective genre in each language. While these criteria are potentially important, the former is hard to fulfil in practice as even the authorship of original texts, let alone translations, is often hard to trace for many ‘pragmatic texts’ (Gebrauchstexte); such considerations are important when research into language for special purposes (LSP) translation (Fachu¨bersetzen) is becoming more prominent as a research area in Translation Studies, including corpus-based studies. It is also worth noting that the typologies of corpora for translation and translation-related research vary (compare, for instance, Baker, 1995, Bernadini et al., 2003 and Teubert, 1996 for variations on the main themes in the context of translator education). The ‘value’ of translated texts in what is usually called ‘parallel corpora’ source texts and their translations may also vary depending on the purpose of the study. In Descriptive Translation Studies the point is not to evaluate but to describe and explain translation behaviour, while in bilingual lexicography or terminography (cf. Ahmad & Rogers, 2001; Teubert, 1996) the aim is to establish lexical equivalents that will function in new texts in the target language alongside authentic texts in that language. For such purposes, translated texts may provide skewed data for the very reason that
The Linguist and the Translator
15
interests descriptivists: that translations differ, sometimes in subtle ways, from comparable original texts. To conclude with the new century, publications on corpora and translation started to appear with regularity, some with a specific research brief (for example Kenny, 2001b using a GermanEnglish parallel corpus of literary texts), some with a pedagogical bent (for example Zanettin et al., 2003, covering both literary and specialist texts for a variety of training purposes), some with a broad sweep (for example Olohan, 2004) and others with a linguistic focus (for example Aijmer & Hasselga˚rd, 2004). Over 30 years after the publication of his paper in honour of Firth, in considering the risky and rather unfashionable question of what might constitute a theory of good translation, Halliday (2001: 1314) still calls on Firth’s broad sense of meaning as operating ‘at all linguistic strata’, including both expression and content, in order to ‘be able to explain why [the text] means what it is understood to mean’. He sees this as a step towards understanding equivalence, which he predictably problematises as a systemic issue in which ‘equivalence at different strata carries differential values’, suggesting that ‘contextual equivalence [is valued] perhaps most highly of all’ (Halliday, 2001: 15). This view is nuanced in the context of what Halliday (2001: 15) calls ‘the task’, which may lead to differential assignments of value in parts of the system. The early semantic seeds produced by the husbandry of Malinowski and sown by Firth were nurtured by linguists such as Halliday and Sinclair and can be seen to have taken firm root in the work of what we now know as ‘Corpus Linguistics’. The relationship of Corpus Linguistics to translation is one that is still developing between system and text; the current volume is a contribution to this development. Notes 1. David Graddol: personal communication, June 2006. 2. Cf. Firbas (1999) in relation to Firth’s ‘creative translation’, and Rogers (2006) with respect to ‘official translation’. 3. Cf. Mauranen (this volume) for an overview, and Malmkjær (this volume) for a critical review of norms and universals.
References Aarts, J. and Meijs, W. (eds) (1990) Theory and Practice in Corpus Linguistics. Amsterdam: Rodopi. Ahmad, K. (this volume) Being in text and text in being: Notes on representative texts. Incorporating Corpora. Clevedon: Multilingual Matters. Ahmad, K., Corbett, G., Rogers, M. and Sussex, R. (1985) Computers, Language Learning and Language Teaching. Cambridge: CUP. Ahmad, K. and Rogers, M. (2001) Corpus linguistics and terminology extraction. In S.-E. Wright and G. Budin (eds) Handbook of Terminology Management (Vol. 2, pp. 725 760). Amsterdam/Philadelphia: John Benjamins.
16
Incorporating Corpora
Aijmer, K. and Altenberg, B. (eds) (1991) English Corpus Linguistics. London and New York: Longman. Aijmer, K. and Hasselga˚rd, H. (eds) (2004) Translation and Corpora. Selected papers from the Go¨teborg-Oslo Symposium 1819 October 2003. Go¨teberg: Acta Universitatis Gothenburgensis. Baker, M. (1995) Corpora in translation studies: An overview and some suggestions for future research. Target 7 (22), 223 243. Baker, M. (2001) Norms. In M. Baker (ed.) Routledge Encyclopedia of Translation Studies (pp. 163 165). London: Routledge (assisted by K. Malmkjær). Bazell, C.E., Catford, J.C., Halliday, M.A.K. and Robins, R.H. (eds) (1966) In Memory of J.R. Firth. London and Harlow: Longmans, Green and Co. Bernadini, S., Stewart, D. and Zanettin, F. (2003) Corpora in translator education. An Introduction. In F. Zanettin, S. Bernadini and D. Stewart (eds) Corpora in Translator Education (pp. 1 13). Manchester: St. Jerome. Butler, C.S. (ed.) (1992) Computers and Written Texts. Oxford and Cambridge, MA: Blackwell. Catford, J.C. (1965) A Linguistic Theory of Translation. An Essay in Applied Linguistics. London: Oxford University Press. Chomsky, N. (1965) Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Collins Cobuild English Language Dictionary (1987) London and Glasgow: Collins. Firbas, J. (1999) Translating the introductory paragraph of Boris Pasternak’s Docot Zhivago: A case study in Functional Sentence Perspective. In G. Anderman and M. Rogers (eds) Word, Text, Translation. Liber Amicorum for Peter Newmark (pp. 129 141). Clevedon: Multilingual Matters. Firth, J.R. (1968a) Linguistics and translation. In F.R. Palmer (ed.) Selected Papers of J.R. Firth, 195259 (pp. 84 95). London and Harlow: Longmans, Green and Co. Firth, J.R. (1968b) Ethnographic analysis and language with reference to Malinowski’s views. In F.R. Palmer (ed.) Selected Papers of J.R. Firth, 195259 (pp. 137 167). London and Harlow: Longmans, Green and Co. Firth, J.R. (1968c) A synopsis of linguistic theory. In F.R. Palmer (ed.) Selected Papers of J.R. Firth, 195259 (pp. 168 205). London and Harlow: Longmans, Green and Co. Francis, W.N. and Kucˇera, H. (1982) Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin (with the assistance of A. Mackie). Garside, R., Leech, G. and Sampson, G. (eds) (1987) The Computational Analysis of English: A Corpus-based Approach. London: Longman. Gellerstam, M. (1986) Translationese in Swedish novels translated from English. In L. Wollin and H. Lindquist (eds) Translation Studies in Scandinavia. Proceedings of the Scandinavian Symposium on Translation Theory (SSOTT) (pp. 88 95). Lund: Lund Studies in English 75. Halliday, M.A.K. (1966) Lexis as a linguistic level. In C.E. Bazell, J.C. Catford, M.A.K. Halliday and R.H. Robins (eds) In Memory of J.R. Firth (pp. 148 162). London and Harlow: Longmans, Green and Co. Halliday, M.A.K. (2001) Towards a theory of good translation. In E. Steiner and C. Yallop (eds) Exploring Translation and Multilingual text Production: Beyond Content (pp. 13 18). Berlin and New York: Mouton de Gruyter. Johansson, S. (ed.) (1982) Computer Corpora in English Language Research. Bergen: Norwegian Computing Centre for the Humanities. Jespersen, O. (1961) A Modern English Grammar on Historical Principles (Vols I VII). London: Allen and Unwin; Copenhagen: Ejnar Munksgaard.
The Linguist and the Translator
17
Kennedy, G. (1998) An Introduction to Corpus Linguistics. London and New York: Longman. Kenny, D. (2001a) Corpora in translation studies. In M. Baker (ed.) Routledge Encyclopedia of Translation Studies (pp. 50 53). London: Routledge (assisted by K. Malmkjær). Kenny, D. (2001b) Lexis and Creativity in Translation: A Corpus-based Study. Manchester: St. Jerome. Lewis, D. (1992) Computers and translation. In C. Butler (ed.) Computers and Written Texts (pp. 75 113). Oxford and Cambridge, MA: Blackwell. Lindquist, H. (1989) English Adverbials in Translation. A Corpus Study of Swedish Renderings. Lund: Lund University Press. Malmkjær, K. (this volume) Norms and nature in translation studies. Incorporating Corpora. Clevedon: Multilingual Matters. Mauranen, A. (this volume) Universal tendencies in translation. Incorporating Corpora. Clevedon: Multilingual Matters. McEnery, T. and Wilson, A. (2001) Corpus Linguistics. An Introduction (2nd edn). Edinburgh: Edinburgh University Press. Olohan, M. (2004) Introducing Corpora in Translation Studies. London and New York: Routledge. Palmer, F.R. (ed.) (1968) Selected Papers of J.R. Firth, 195259. London and Harlow: Longmans, Green and Co. Quah, C.K. (2006) Translation and Technology. Basingstoke: Palgrave Macmillan. Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. (1985) A Comprehensive Grammar of the English Language. London: Longman. Rogers, M. (2006) Structuring information in English: A specialist translation perspective on sentence beginnings. The Translator 12 (1), 29 64. Sinclair, J.M. (1966) Beginning the study of lexis. In C. Bazell, J.C. Catford, M.A.K. Halliday and R.H. Robins (eds) In Memory of J.R. Firth (pp. 410 430). London and Harlow: Longmans, Green and Co. Sinclair, J.M. (2007) Language and computing, past and present. In K. Ahmad and M. Rogers (eds) Evidence-based LSP. Translation, Text and Terminology (pp. 22 51). Bern: Peter Lang. Svartvik, J. (1966) On Voice in the English Verb. The Hague/Paris: Mouton. Svartvik, J. (2005) Edited extract from ‘A Life in Linguistics’. The European English Messenger 14 (1), 34 44. On WWW at http://www.ucl.ac.uk/english-usage/ about/svartvik.htm. Accessed 17.9.06. Sweet, H. (1899/1964) The Practical Studies of Languages: A Guide for Teachers and Learners. London: Dent (republished by Oxford University Press). Teubert, W. (1996) Comparable or parallel corpora? International Journal of Lexicography 9 (3), 238 264. Thomas, J. and Short, M. (eds) (1996) Using Corpora for Language Research. Studies in the Honour of Geoffrey Leech. London and New York: Longman. Ve´ronis, J. (ed.) (2000) Parallel Text Processing. Alignment and Use of Translation Corpora. Amsterdam: Kluwer. Wagner, E. (2005) Translation and/or editing the way forward? In G. Anderman and M. Rogers (eds) In and Out of English: For Better, for Worse? (pp. 214 226). Clevedon: Multilingual Matters. Zanettin, F., Bernadini, S. and Stewart, D. (eds) (2003) Corpora in Translator Education. Manchester: St. Jerome.
Chapter 2
Parallel and Comparable Corpora: What is Happening? TONY MCENERY and RICHARD XIAO
Introduction With ever increasing international exchange and accelerated globalisation, translation and contrastive studies are more popular than ever. As part of this new wave of research on translation and contrastive studies, corpora and multilingual corpora in particular are playing an increasingly prominent role. In this chapter, we will illustrate the value of parallel and comparable corpora to translation and contrastive studies. From the 1980s onwards corpus linguistics has developed at an ever accelerating rate. While the construction and exploitation of English language corpora still dominates research in corpus linguistics, corpora of other languages, European as well as Asian languages such as Chinese, Korean and Japanese, have also become available and have notably added to the diversity of corpus-based language studies.1 In addition to monolingual corpora, parallel and comparable corpora have been a key focus of non-English corpus linguistics, largely because corpora of these two types are important resources for translation and contrastive studies. As Aijmer and Altenberg (1996: 12) observe, parallel and comparable corpora ‘offer specific uses and possibilities’ for contrastive and translation studies: (1) They give new insights into the languages compared insights that are not likely to be gained via the study of monolingual corpora. (2) They can be used for a range of comparative purposes and increase our knowledge of language-specific, typological and cultural differences, as well as of universal features. (3) They illuminate differences between source texts and translations, and between native and non-native texts. (4) They can be used for a number of practical applications, e.g. in lexicography, language teaching and translation. In this chapter, we will explore the potential value of such multilingual corpora. Before we explore the value of these corpora, however, it is necessary to clarify some terminological issues.
18
Parallel and Comparable Corpora
19
Multilingual Corpora: Terminological Issues When we refer to a corpus involving more than one language as a multilingual corpus, the term multilingual is used in a broad sense. A multilingual corpus, in a narrowed sense, must involve at least three languages, while those involving only two languages are conventionally referred to as bilingual corpora. In this chapter, we are using multilingual to cover the bilingual case also. Given that corpora involving more than one language are a relatively new phenomenon, with most research hailing from the early 1990s (e.g. the EnglishNorwegian Parallel Corpus (ENPC), see Johansson & Hofland, 1994),2 it is unsurprising to discover that there is some confusion surrounding the terminology used in relation to these corpora. Generally, there are three types of corpora involving more than one language: (1) Type A: Source texts plus translations, e.g. Canadian Hansard (cf. Brown et al., 1991), CRATER (cf. McEnery & Oakes, 1995). (2) Type B: Monolingual subcorpora designed using the same sampling frame, e.g. The Aarhus corpus of contract law (cf. Faber & Lauridsen, 1991). (3) Type C: A combination of A and B, e.g. the ENPC (cf. Johansson & Hofland, 1994), the EMILLE.3 Different terms have been used to describe these types of corpora. For Aijmer and Altenberg (1996) and Granger (1996: 38), type A is a translation corpus whereas type B is a parallel corpus; for McEnery & Wilson (1996: 57), Baker (1993: 248; 1995; 1999) and Hunston (2002: 15), type A is a parallel corpus whereas type B is a comparable corpus; and for Johansson & Hofland (1994) and Johansson (1998: 4) the term parallel corpus applies to both types A and B. Barlow (1995; 2000: 110) certainly interpreted a parallel corpus as type A when he developed the ParaConc corpus tool. It is clear that some confusion centres on the term parallel. When we define different types of corpora, we can use different criteria, for example, the number of languages involved, and the content or the form of the corpus. But when a criterion is decided upon, the same criterion must be used consistently. For example, we can say a corpus is monolingual, bilingual or multilingual if we take the number of languages involved as the criterion for definition. We can also say a corpus is a translation (L2) or a non-translation (L1) corpus if the criterion of corpus content is used. But if we choose to define corpus types by the criterion of corpus form, we must do so consistently. Then we can say a corpus is parallel if the corpus contains source texts and translations in parallel, or it is a comparable corpus if its subcorpora are comparable by applying the same sampling frame. It is illogical, however, to refer to corpora of type A as translation corpora by the criterion of content while referring to corpora of type B as
20
Incorporating Corpora
comparable corpora by the criterion of form. Consequently, in this paper, we will follow McEnery et al. and Baker’s terminology in referring to type A as parallel corpora and type B as comparable corpora. As type C is a mixture of the two, corpora of this type should be referred to as comparable corpora in a strict sense. A parallel corpus can be defined as a corpus that contains some source texts and their translations. Parallel corpora can be bilingual or multilingual. They can be unidirectional (e.g. from English into Italian or from Italian into English alone), bidirectional (e.g. containing both English source texts with their Italian translations as well as Italian source texts with their English translations) or multidirectional (e.g. the same piece of writing with English, French and German versions). In this sense, texts that are produced simultaneously in different languages (e.g. EU and UN regulations) also belong to the category of parallel corpora (cf. Hunston, 2002: 15). In contrast, a comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (cf. McEnery, 2003: 450) e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period. However, the subcorpora of a comparable corpus are not translations of each other. Rather, their comparability lies in their same sampling frame and similar balance. By our definition, corpora containing components of varieties of the same language (e.g. International Corpus of English (ICE)) are not comparable corpora as suggested in the literature (e.g. Hunston, 2002: 15), because all corpora, as a source for linguistic research, have ‘always been preeminently suited for comparative studies’ (Aarts, 1998), either intralingual or interlingual. Brown, LOB, Frown and FLOB are typically designed for comparing language varieties synchronically and diachronically. The British National Corpus (BNC), while designed for representing modern British English, is also a useful basis for various intralingual studies (e.g. spoken versus written, monologue versus dialogue, and variations caused by socioeconomic parameters). Nevertheless, these corpora are generally not referred to as comparable corpora. While parallel and comparable corpora are supposed to be used for different purposes (that is translation and contrastive studies respectively, see next section), the two are also designed with different focuses. For a comparable corpus, the use of an appropriate sampling frame is essential. The components representing the languages involved must match with each other in terms of proportion, genre, domain and sampling period. For a parallel corpus, the sampling frame is less relevant, as all of the corpus components are exact translations of each other, although the source texts for a parallel corpus may obviously be selected using a certain sampling frame, of course. However this does not mean that the construction of parallel corpora is easier. For a parallel corpus to be useful, an essential step
Parallel and Comparable Corpora
21
is to align the source texts and their translations, that is to produce a link between the two, at the sentence level, word level or some other level. Yet the automatic alignment of parallel corpora is not a trivial task for some language pairs (cf. Piao, 2000, 2002). Depending on the specific research question, a specialised (that is containing texts of a particular type, for example computer manuals) or a general (that is balanced, containing as many text types as possible) corpus should be used. Parallel and comparable corpora can be of either type. For terminology extraction, specialised parallel and comparable corpora are clearly of use while for the contrast of general linguistic features such as tense and aspect, balanced corpora, which are supposed to be more representative of any given language in general, would be used. Existing parallel corpora show that corpora of this type tend to be specialised (for example focussing on contract law or genetic engineering). This is quite natural, considering the availability of translated texts by genre (in machine-readable form) in different languages (cf. Aston, 1999; Johansson & Hofland, 1994: 27; Mauranen, 2002: 166), and indeed, as will be seen later in our discussion, specialised parallel corpora can be especially useful in domain-specific translation research.4 While most of the existing comparable corpora are also specialised, it is relatively easier to find comparable text types in different languages. Therefore, in relation to parallel corpora, it is more likely for comparable corpora to be designed as general balanced corpora. For instance, as the Korean National Corpus (Park, 2001) and the Chinese National Corpus (Zhou & Yu, 1997) have adopted a sampling frame quite similar to that of the BNC, hence these corpora can form a balanced comparable corpus that makes contrastive studies of these three languages possible. Parallel and comparable corpora are used primarily for translation and contrastive studies. The two types of corpora have their own advantages and disadvantages, and thus serve for different purposes. While the source and translated texts in a parallel corpus are useful for exploring ‘how the same content is expressed in two languages’ (Aijmer & Altenberg, 1996: 13),5 alone they serve as a poor basis for cross-linguistic contrasts, because translations (that is L2 texts) cannot avoid the effect of so called translationese (cf. Baker, 1993: 243245; Gellerstam, 1996; Hartmann, 1985; Laviosa, 1997: 315; McEnery & Wilson, 2001: 7172; McEnery & Xiao, 2002; Teubert, 1996: 247). In contrast, while the components of a comparable corpus overcome translationese by populating the sampling frame with L1 texts from different languages, they are less useful for the study of how a message is conveyed from one language to another. Also the development of application software for machine-aided and machine translation, while it may be based on comparable data, has clearly benefited from having access to parallel data, for example to bootstrap example-based machine translation (MT) systems (see next section).
22
Incorporating Corpora
Nonetheless, comparable corpora are a useful resource for contrastive studies and translation studies when used in combination with parallel corpora. Note, however, that comparable corpora can be a poor basis for contrastive studies if the sampling frames for the comparable corpora are not fully comparable. In the section that follows, we will illustrate, through examples, the value of corpora, particularly parallel and comparable corpora, to translation and contrastive studies.
Corpus-based Translation and Contrastive Studies As Laviosa (1998a) observes, ‘the corpus-based approach is evolving, through theoretical elaborations and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation’. Corpus-based translation studies come in two broad areas: theoretical and practical (Hunston, 2002: 123). In theoretical terms, corpora are used mainly to study the translation process by exploring how an idea in one language is conveyed in another language and by comparing the linguistic features and their frequencies in translated L2 texts and comparable L1 texts. In the practical approach, corpora provide a workbench for training translators and a basis for developing applications like MT and computer-assisted translation (CAT) systems. In this section, we will discuss how corpora have been used in each of these areas. Parallel corpora are a good basis for studying how an idea in one language is conveyed in another language. Xiao and McEnery (2002a), for example, use an EnglishChinese parallel corpus containing 100,170 English words and 192,088 Chinese characters to explore how temporal and aspectual meanings in English are expressed in Chinese. In that study, the authors found that while both English and Chinese have a progressive aspect, the progressive has different scopes of meaning in the two languages. In English, while the progressive canonically (93.5%) signals the ongoing nature of a situation (e.g. John is singing, Comrie, 1976: 32), it has a number of other specific uses that do not seem to fit under the general definition of ‘progressiveness’ (Comrie, 1976: 37). These ‘specific uses’ include its use to indicate contingent habitual or iterative situations (e.g. I’m taking dancing lessons this winter, Leech, 1971: 27), to indicate anticipated happenings in the future (for instance, We’re visiting Aunt Rose tomorrow, p. 29) and some idiomatic use to add special emotive effect (for instance, I’m continually forgetting people’s names, p. 29) (c.f. Leech, 1971: 2729). In Chinese, however, the progressive marked by zai only corresponds to the first category above, namely, to mark the ongoing nature of dynamic situations. As such, only about 58% of situations referred to by the progressive in the English source data take the progressive or the durative aspect, either marked overtly or covertly, in Chinese translations. The
Parallel and Comparable Corpora
23
authors also found that the interaction between situation aspect (that is the inherent aspectual features of a situation, for instance, whether the situation has a natural final endpoint) and viewpoint aspect (for example, perfective versus imperfective) also influences a translator’s choice of viewpoint aspect. Situations with a natural final endpoint (around 65%) and situations incompatible with progressiveness (92.5.% of individual-level states and 75.9% of achievements) are more likely to undergo viewpoint aspect shift and be presented perfectively in Chinese translations.6 In contrast, situations without a natural final endpoint are normally translated with the progressive marked by zai or the durative aspect marked by -zhe. Note, however, that the direction of translation in a parallel corpus is important in studies of this kind. The corpus used in Xiao and McEnery (2002a), for example, is not suitable for studying how aspect markers in Chinese are translated into English. For that purpose, a ChineseEnglish parallel corpus (that is L1 Chinese plus L2 English) is required. Another problem that arises with the use of a one-to-one parallel corpus (that is containing only one version of translation in the target language) is that the translation only represents ‘one individual’s introspection, albeit contextually and cotextually informed’ (Malmkjær, 1998). One possible way to overcome this problem, as suggested in Malmkjær, is to include as many versions of a translation of the same source text as possible. While this solution is certainly of benefit to translation studies, it makes the task of building parallel corpora much more difficult. It also reduces the range of data one may include in a parallel corpus, as many translated texts are translated once only. It is typically texts such as literary works where multiple translations of the same work are available. These works tend to be non-contemporary and the different versions of translation are usually spaced decades apart, thus making the comparison of these versions less meaningful. The distinctive features of translated language can be identified by comparing the translations with comparable L1 texts, thus throwing new light on the translation process and helping to identify translation norms. Laviosa (1998b), for example, in her study of L1 and L2 English narrative prose, finds that translated L2 language has four core patterns of lexical use: a relatively lower proportion of lexical words over function words, a relatively higher proportion of high-frequency words over low-frequency words, relatively greater repetition of the most frequent words and less variety in the words that are most frequently used. Other studies show that translated language is characterised, beyond the lexical level, by normalisation, simplification (Baker, 1993, 1998), explicitation (that is increased ´ verla˚s, 1998) and sanitisation (i.e. reduced connotational cohesion, Ø meanings, Kenny, 1998). As these features are regular and typical of translated English, further research based upon these findings may not only uncover the translational norms or what Frawley (1984) calls the ‘third
24
Incorporating Corpora
code’ of translation, it will also help translators and trainee translators to become aware of these problems. McEnery and Xiao (2002), on the basis of a specialised English Chinese parallel corpus of healthcare, found that the ratio of overt/covert marking of aspectual meanings was exceptionally low in Chinese translations. As Chinese is recognised as an aspect language (cf. Xiao & McEnery, 2004a), the authors hypothesised that the low frequency of aspect markers was atypical of the target L1 language and was attributable to the translated nature of the data in this case. To test this hypothesis, they constructed a comparable L1 Chinese corpus using the same sampling frame and compared the frequencies of two well established perfective aspect markers in the two data sets, namely, the translated Chinese and L1 Chinese. The experiment showed that in the translated Chinese, the two aspect markers occurred 27.32 times per 10,000 words whereas they occurred 62.33 times per 10,000 words in the comparable L1 Chinese data. A cross-tabulation between the word numbers and actual frequency counts showed a log-likelihood ration of 49.1 for 2 degrees of freedom, which is statistically significant at the level p0.001. Therefore, the authors’ null hypothesis that the difference in frequencies of aspect markers in the two datasets existed by chance was rejected and they were able to claim that translated Chinese is indeed different from L1 Chinese in terms of aspect marking. The above studies show that translated language is translationese. The effect of source language on the translations is strong enough to make the L2 data perceptibly different from the target L1 Chinese. As such, a unidirectional parallel corpus is a poor basis for cross-linguistic contrast. This problem, however, can be alleviated by a bidirectional parallel corpus (e.g. Ebeling, 1998; Maia, 1998), because the effect of translationese is averaged out to some extent. In this sense, a well matched bidirectional parallel corpus can become the bridge that brings translation and contrastive studies together. To achieve this aim, however, the same sampling frame must apply to the selection of source data in both languages. Any mismatch of proportion, genre or domain, for example, may invalidate the findings derived from such a corpus. While we know that translated language is distinct from the target L1 language, it has been claimed recently that parallel corpora represent a sound basis for contrastive studies. James (1980: 178), for example, argues that ‘translation equivalence is the best available basis of comparison’ while Santos (1996: i) claims that ‘studies based on real translations are the only sound method for contrastive analysis’. Mauranen (2002: 166) also argues, though not as strongly as James and Santos, that translated language, in spite of its special features, ‘is part of natural language in use, and should be treated accordingly’, because languages ‘influence each other in many ways other than through translation’ (Mauranen, 2002: 165).
Parallel and Comparable Corpora
25
While we agree with Mauranen that ‘translations deserve to be investigated in their own right’, as is done in Laviosa (1998b) and McEnery and Xiao (2002), we hold a different view of the value of parallel corpora for contrastive studies. It is true that languages in contact can influence each other, but this influence is different from the influence of a source language on translations with regard to immediacy and scope. Basically, the influence of languages in contact is generally gradual (or evolutionary) and less systematic than the influence of a source language on the translated language. As such, translated language is at best an unrepresentative special variant of the target language. If this special variant is confused with the target L1 language and serves alone as the basis for contrastive studies, the results are clearly misleading to teachers and students of second languages, because contrastive studies are ‘typically geared towards second language teaching and learning’ (Teich, 2002: 188). Using parallel corpora alone, for example, McEnery and Xiao (2002) would have come to the misleading conclusion that aspect markers occurred only infrequently in Chinese. As Chinese, as an aspect language, relies heavily on aspect to encode temporal information, which is different from English, which encodes both tense and aspect, this false conclusion would inevitably have an adverse effect on materials produced for Chinese learners of English. Parallel corpora can serve as a useful starting point for cross-linguistic contrasts because findings based on parallel corpora invite ‘further research with monolingual corpora in both languages’ (Mauranen, 2002: 182). In this sense, parallel corpora are ‘indispensable’ to contrastive studies (Mauranen, 2002: 182). With reference to practical translation studies, as corpora can be used to raise linguistic and cultural awareness in general (cf. Bernardini, 1997; Hunston, 2002: 123), they provide a useful and effective reference tool and a workbench for translators and trainees. In this respect even a monolingual corpus is helpful. Bowker (1998), for example, found that corpus-aided translations were of a higher quality with respect to subject field understanding, correct term choice and idiomatic expressions than those undertaken using conventional resources. Bernardini (1997) also suggests that traditional translation teaching should be complemented with large corpora concordancing (LCC) so that trainees develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, the skills that ‘distinguish a translator from those unskilled amateurs’. In comparison to monolingual corpora, comparable corpora are more useful for translation studies. Zanettin (1998) demonstrates that small comparable corpora can be used to devise a ‘translator training workshop’ designed to improve students’ understanding of the source texts and their ability to produce translations in the target language more fluently. In this respect, specialised comparable corpora are particularly helpful for highly domain-specific translation tasks, because when translating texts of this
26
Incorporating Corpora
type, as Friedbichler and Friedbichler (1997) observe, ‘the translator is dealing with a language which is often just as disparate from his/her native language as any foreign tongue’. Several studies show that translators with access to a comparable corpus with which to check translation problems ‘are able to enhance their productivity and tend to make fewer mistakes’ (Friedbichler & Friedbichler, 1997) when translating into their native language. When translation is from a mother tongue into a foreign language, ‘the need for corpus tools grows exponentially and goes far beyond checking grey spots in L1 language competence against the evidence of a large corpus’ (Friedbichler & Friedbichler, 1997). For example, Gavioli and Zanettin (1997) demonstrate how a very specialised corpus of texts on the subject of hepatitis helps to confirm translation hypotheses and suggest possible solutions to problems related to domainspecific translation. While monolingual and comparable corpora are of use to translation, it is difficult to generate ‘possible hypotheses as to translations’ with such data (Aston, 1999). Furthermore, verifying concordances is both timeconsuming and error-prone, entailing a loss of productivity. Parallel corpora, in contrast, provide ‘[g]reater certainty as to the equivalence of particular expressions’, and in combination of suitable tools (e.g. ParaConc), they enable users to ‘locate all the occurrences of any expression along with the corresponding sentences in the other language’ (Aston, 1999). As such, parallel corpora can help translators and trainees to achieve improved precision with respect to terminology and phraseology and have been strongly recommended for these reasons (e.g. Williams, 1996). A special use of a parallel corpus with one source text and many translations is that it can offer a systematic translation strategy for linguistic structures which have no direct equivalents in the target language. Boyse (1997), for example, presents a case study of the Spanish translation of the French clitics en and y, where the author illustrates how a solution is offered by a quantitative analysis of the phonetic, prosodic, morphological, semantic and discursive features of these structures in a representative parallel corpus, combined with the quantitative analysis of these structures in a comparable corpus of L1 target language. Another issue related to translator training is translation evaluation. Bowker (2001) shows that an evaluation corpus, which is composed of a parallel corpus and comparable corpora of source and target languages, can help translator trainers to evaluate student translations and provide more objective feedback. Finally, in addition to providing assistance to human translators, parallel corpora constitute a unique resource for the development of MT systems. Starting in the 1990s, the established methodologies, notably the linguistic rule-based approach to MT, have been challenged and enriched by an approach based on parallel corpora (cf. Hutchins, 2003: 511; Somers, 2003: 513). The new approaches, such as example-based MT (EBMT) and
Parallel and Comparable Corpora
27
statistical MT, are based on parallel corpora. With EBMT, for example, a new input is matched against the database of already translated texts to extract suitable examples which are then combined to generate the correct translation (see Somers, 2003; Hutchins, 2003). As well as automatic MT systems, parallel corpora have also been used to develop CAT tools for human translators, such as translation memories (TM), bilingual concordances and translator-oriented word processor (cf. Somers, 2003; Wu, 2002).
Conclusion In this chapter, we first clarified the confusion surrounding the terminology related to multilingual corpora. It was argued that consistent criteria should be applied in defining types of corpora. For us this means that parallel corpora refer to those that contain collections of L1 texts and their translations while comparable corpora refer to those that contain matched L1 samples from different languages. The main concern of this chapter was the potential value of parallel and comparable corpora to translation and contrastive studies.7 We maintain that while parallel corpora are well suited to research and teaching in translation studies, they provide a poor basis for cross-linguistic contrast if used as the sole source of data. They should most often be used in conjunction with L1 target and source corpora. These L1 target and source corpora may or may not be comparable. Parallel corpora are undoubtedly a useful starting point for contrastive research, however, and may lead to further research in contrastive studies based upon comparable corpora. In contrast, comparable corpora used alone are less useful for translation studies. Nonetheless, they certainly serve as a reliable basis for contrastive studies. It appears then that a carefully matched bidirectional parallel corpus provides a sound basis for both translation and contrastive studies. Yet the ideal bidirectional parallel corpus will often not be easy, or even possible, to build because of the heterogeneous pattern of translation between languages and genres. So we must accept that, for practical reasons alone, we will often be working with corpora that, while they are useful, are not ideal for either translation or contrastive studies. In this chapter, we also discussed the pros and cons of the use of different types of corpora in translation and contrastive studies and evaluated proposals for possible solutions to related problems. It is our belief that as the number of parallel and comparable corpora grows, the corpus-based paradigm will soon enter the mainstream of translation and contrastive studies. Acknowledgements This work is supported in part by a grant (Reference No. RES-00023-0553) from the Economics and Social Research Council (ESRC).
28
Incorporating Corpora
Notes 1. Lists of available corpus resources involving different languages, both monolingual and multilingual, can be found at the websites of the Evaluations and Language Resource Distribution Agency (ELDA, http://www.elda.org/ rubrique.php3?id_rubrique6tabtxt1.html), TELTRI Research Archive of Computational Tools and Resources (TRACTOR, http://tractor.bham.ac.uk/tractor/ catalogue.html), Oxford Text Archive (OTA, http://ota.ahds.ac.uk) and Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/Catalog/byType.jsp). 2. It is interesting to note, however, an earlier corpus-based contrastive study, namely Filopovic (1969), which dates back to as early as the 1960s. 3. An introduction to the EMILLE project can be found at the following URL: http://www.emille.lancs.ac.uk. 4. Readers are advised to refer to Halverson (1998) for an argument for the need for representative parallel corpora. 5. This view has been challenged recently, however, notably by Mauranen (2002: 167), who argues that interpreting translation as ‘the decoding and reencoding of fixed contents, which, presumably, exist outside languages’ is ‘hardly an adequate view of either language or translation’. However, if we interpret the relationship between contents and languages as that between meanings (the carried) and forms (the carrier), this view is quite natural. 6. Situations, telic, individual-level states and achievements are commonly used terms in aspect theory. Readers can refer to Xiao and McEnery (2002b; 2004b) for a more elaborate account of situation aspect. 7. Apart from translation and contrastive studies, Botley et al. (2000) give a further account of other potential uses of parallel and comparable corpora.
References Aarts, J. (1998) Introduction. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-linguistic Research (pp. ix xiv). Amsterdam: Rodopi. Aston, G. (1999) Corpus use and learning to translate. Textus 12, 289 314. Baker, M. (1993) Corpus linguistics and translation studies: implications and applications. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In Honour of John Sinclair (pp. 233 252). Amsterdam: Benjamins. Baker, M. (1995) Corpora in translation studies: An overview and some suggestions for future research. Target 7, 223243. Baker, M. (1999) The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics 4, 281 298. Barlow, M. (1995) A guide to ParaConc. Houston: Athelstan. Barlow, M. (2000) Parallel texts and language teaching. In S. Botley, A. McEnery and A. Wilson (eds) Multilingual Corpora in Teaching and Researching (pp. 106 115). Amsterdam: Rodopi. Bernardini, S. (1997) A ‘trainee’ translator’s perspective on corpora. Paper presented at Corpus Use and Learning to Translate held at Bertinoro, November 1997. Botley, S., McEnery, A. and Wilson, A. (2000) Multilingual Corpora in Teaching and Researching. Amsterdam: Rodopi. Bowker, L. (1998) Using specialised native-language corpora as a translation resource: A pilot study. Meta 43 (4). Bowker, L. (2001) Towards a methodology for a corpus-based approach to translation evaluation. Meta 46 (2), 345 364.
Parallel and Comparable Corpora
29
Brown, P., Lai, J. and Mercer, R. (1991) Aligning sentences in parallel corpora. In 29th Annual Meeting of the Association for Computational Linguistics (pp. 169 176). Berkeley, CA. Buyse, K. (1997) The study of multi- and unilingual corpora as a tool for the development of translation studies: A case study. Papers presented at Corpus Use and Learning to Translate held at Bertinoro, November 1997. Comrie, B. (1976) Aspect. Cambridge: Cambridge University Press. Ebeling, J. (1998) Contrastive linguistics, translation and parallel corpora. Meta 43 (4), 602 615. Faber, D. and Lauridsen, K. (1991) The compilation of a Danish-English-French corpus in contract law. In S. Johansson and A.B. Stenstro˝m (eds) English Computer Corpora. Selected Papers and Research Guide (pp. 235 243). Berlin: Mouton de Gruyter. Filipovic, R. (1969) The choice of the corpus for the contrastive analysis of SerboCroatian and English. In The Yugoslav Serbo-Croatian-English Contrastive Project B Studies 1 (pp. 37 46). Institute of Linguistics, University of Zagreb. Frawley, W. (1984) Prolegomenon to a theory of translation. In W. Frawley (ed.) Translation: Literary, Linguistic and Philosophical Perspectives (pp. 159 175). London & Toronto: Associated University Press. Friedbichler, I. and Friedbichler, M. (1997) The potential of domain-specific target-language corpora for the translator’s workbench. Paper presented at Corpus Use and Learning to Translate held at Bertinoro, November 1997. Gavioli, L. and Zanettin, F. (1997) Comparable corpora and translation: a pedagogic perspective. Paper presented at Corpus Use and Learning to Translate held at Bertinoro, November 1997. Gellerstam, M. (1996) Translations as a source for cross-linguistic studies. In K. Aijmer, B. Alternberg and M. Johansson (eds) Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies, Lund, March 1994 (pp. 53 62). Lund: Lund University Press. Grange, S. (1996) From CA to CIA and back: An integrated approach to computerised bilingual and learner corpora. In K. Aijmer, B. Altenberg and M. Johansson (eds) Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies, Lund, March 1994 (pp. 38 51). Lund: Lund University Press. Halverson, S. (1998) Translation studies and representative corpora: Establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study. Meta 43 (4). Hartmann, R. (1985) Contrastive textology Towards a dynamic paradigm for interlingual lexical studies? Language and Communication 5, 107 110. Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hutchins, J. (2003) Machine translation: General overview. In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics (pp. 501 511). Oxford: Oxford University Press. James, C. (1980) Contrastive Analysis. London: Longman. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-linguistic Research: Theory, Method and Case Studies (pp. 3 25). Amsterdam: Rodopi. Johansson, S., Ebeling, G. and Hofland, K. (1996) Coding and aligning the English Norwegian parallel corpus. In K. Aijmer, B. Altenberg and M. Johansson (eds) Languages in Contrast: Papers from a Symposium on Text-based
30
Incorporating Corpora
Cross-linguistic Studies, Lund, March 1994 (pp. 87 112). Lund: Lund University Press. Johansson, S. and Hofland, K. (1994) Towards an English Norwegian parallel corpus. In U. Fries, G. Tottie and P. Schneider (eds) Creating and Using English Language Corpora (pp. 25 37). Amsterdam: Rodopi. Johansson, S. and Oksefjell, S. (1998) Corpora and Cross-linguistic Research: Theory, Method and Case Studies. Amsterdam: Rodopi. Kenny, D. (1998) Creatures of habit? What translators usually do with words? Meta 43 (4), 515 523. On WWW at http://www.erudit.org/revue/meta/1998/ v43/n4/003302ar.pdf. Accessed 18.6.07. Laviosa, S. (1997) How comparable can ‘comparable corpora’ be? Target 9, 289 319. Laviosa, S. (1998a) The corpus-based approach: A new paradigm in translation studies. Meta 43 (4), 474 479. Laviosa, S. (1998b) Core patterns of lexical use in a comparable corpus of English narrative prose. Meta 43 (4), 557 570. Leech, G. (1971) Meaning and the English Verb. London: Longman. Maia, B. (1998) Word order and the first person singular in Portuguese and English. Meta 43 (4), 589 601. Malmkjær, K. (1998) Love thy neighbour: will parallel corpora endear linguists to translators? Meta 43 (4), 534 541. Mauranen, A. (2002) Will ‘translationese’ ruin a contrastive study? Languages in Contrast 2 (2), 161 186. McEnery, A. (2003) Corpus linguistics. In R. Miktov (ed.) Oxford Handbook of Computational Linguistics (pp. 448 463). Oxford: Oxford University Press. McEnery, A. and Oakes, M. (1995) Sentence and word alignment in the CRATER project: Methods and assessment. In S. Warwick-Armstrong (ed.) Proceedings of the Association for Computational Linguistics Workshop SIG-Dat Workshop. Dublin. McEnery, A. and Wilson, A. (1996) Corpus Linguistics (1st edn). Edinburgh: Edinburgh University Press. McEnery, A. and Wilson, A. (2001) Corpus Linguistics (2nd edn). Edinburgh: Edinburgh University Press. McEnery, A. and Xiao, Z. (2002) Domains, text types, aspect marking and EnglishChinese translation. Languages in Contrast 2 (2), 211 231. Øvera˚s, S. (1998) In search of the third code. An investigation of norms in literary translation. Meta 43 (4), 571 588. Park, B. (2001) Introducing Korean National Corpus. Talk presented at Corpus Research Group, Lancaster University, 19 November 2001. On WWW at http://www.sejong.or.kr/english/index.html. Accessed 18.6.07. Pearson, J. (1998) Terms in Context. Amsterdam: Benjamins. Piao, S. (2000) Sentence and word alignment between Chinese and English. PhD thesis, Lancaster University. Piao, S. (2002) Word alignment in English Chinese parallel corpora. Literary and Linguistic Computing 17 (2), 207 230. Santos, D. (1996) Tense and aspect in English and Portuguese: A contrastive semantical study. PhD thesis, Universidade Tecnica de Lisboa. Somers, H. (2003) Machine translation: Latest developments. In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics (pp. 512 528). Oxford: Oxford University Press. Teich, E. (2002) System-oriented and text-oriented comparative linguistic research: Cross linguistic variation in translation. Languages in Contrast 2 (2), 187 210.
Parallel and Comparable Corpora
31
Teubert, W. (1996) Comparable or parallel corpora? International Journal of Lexicography 9 (3), 238 264. Williams, A. (1996) A translator’s reference needs: Dictionaries or parallel texts. Target 8 (2), 277 299. Wu, D. (2002) Conception and application of computer-assisted translation. Paper presented at International Symposium on Contrastive and Translation Studies between Chinese and English, Shanghai. August 2002. Xiao, Z. and McEnery, A. (2002a) A corpus-based approach to tense and aspect in English Chinese translation. Paper presented at International Symposium on Contrastive and Translation Studies between Chinese and English, Shanghai. August 2002. Published in Pan, W., Fu, H., Luo, X., Chase, M. and Walls, J. (eds) (2005) Translation and Contrastive Studies (pp. 114 157). Shanghai: Shanghai Foreign Language Education Press. Xiao, Z. and McEnery, A. (2002b) Situation aspect as a universal aspect: Implications for artificial languages. Journal of Universal Language 3 (2), 139 177. Xiao, Z. and McEnery, A. (2004a) Aspect in Mandarin Chinese: A Corpus-based Study. Amsterdam: John Benjamins. Xiao, Z. and McEnery, A. (2004b) A corpus-based two-level model of situation aspect. Journal of Linguistics 40 (2), 325 363. Zanettin, F. (1998) Bilingual comparable corpora and the training of translators. Meta 43 (4), 616 630. Zhou, Q. and Yu, S. (1997) Annotating the contemporary Chinese corpus. International Journal of Corpus Linguistics 2 (2), 239 258.
Chapter 3
Universal Tendencies in Translation ANNA MAURANEN
Introduction When we discuss the influence of one language on another, the question often arises what effect translation might play in this. Do translations smuggle in features from the source language, gradually weakening the specificity of the target language? And is it particularly problematic to translate from a dominant language like English into smaller languages? The traditional view of translations as victims of strong interference from the source language or source text has been seen in the statements of linguists and translation scholars alike. While the corpus linguist Wolfgang Teubert (1996: 247) states that ‘[r]ather than representing the language they are written in, they give a mirror image of their source language’, the translation scholar Toury puts a similar idea in this way: The second language which may be said to be activated during the attempted production of a translated utterance in a certain target language [. . .] is not, as a rule, retrieved from the speaker’s ‘knowledge’ but is directly available to him in the source utterance itself. (Toury, 1986: 82) A new angle on the language of translations has been opened up by Baker (1993), who suggests that all translations are likely to show certain linguistic characteristics simply by virtue of being translations. She calls these general characteristics ‘translation universals’. The search for translation universals began in the mid-1990s, with roots in both Translation Studies and Corpus Linguistics. The fundamental influence came from descriptive Translation Studies, largely based on Toury’s (1980, 1995) work, which shifted the focus in translation research from the relationship between source and target texts to translations themselves. The second major influence originated in linguistics: the extension of electronic corpora to translations offered the prospect of seeing large-scale patterning in translated language in an unprecedented way. Although corpora compiled on the basis of translated language had already been used in Contrastive Linguistics, translations had not been investigated for their own sake, but for the purpose of comparing two languages.
32
Universal Tendencies in Translation
33
The traditions of descriptive Translation Studies and Corpus Linguistics were first brought together in Baker’s work. She suggested that this combination enabled an investigation of linguistic features that typically occur in translated rather than non-translated texts, and are not dependent on the specific language pairs involved. Baker (1993: 243) defined translation universals as . . . universal features of translation, that is features which typically occur in translated texts rather than original utterances and which are not the result of interference from specific linguistic systems. Baker saw that electronic corpora provided a testing ground on a new scale for hypotheses concerning general characteristics of translation universals that had been put forward earlier on the basis of small-scale studies. Such hypotheses concerned features like a tendency towards explicitation, disambiguation, simplification, growing grammatical conventionality and a tendency to over-represent typical features of the target language, as well as the tendency to reduce or remove repetitions (for a comprehensive list, see for example Laviosa, 2002). Corpora of different kinds have been used to study translations, and even though the terminology concerning corpus types is somewhat unstable, the most established distinction is this: a ‘parallel corpus’ consists of texts and their translations, and a ‘comparable corpus’ of matched subcorpora compiled on similar principles in one language, where one subcorpus comprises translations, and the other comparable texts written originally in the same language. A ‘bilingual corpus’ is an umbrella term for any corpus involving two languages. Empirical research on translation universals is still relatively new, but well on its way, as comparable corpora have been completed. Recent findings have thrown new light on older hypotheses and suggested more potential universals. They also reveal a more complex picture of the phenomena and processes involved than has perhaps been appreciated. It has been noticed, for example, that the earlier hypotheses were sometimes contradictory or overlapping. In addition, some hypotheses seem to make predictions about the relationship between sources and their translations, while others have been concerned with translated and non-translated texts; the two have not always been kept conceptually clearly apart. Thus, for example, many hypotheses have process-related names, like ‘simplification’ and ‘explicitation’ even if they do not refer to differences between source texts and their translations. The idea that Translation Studies should set out to find general laws and regularities is not new in itself. At the beginning of the 1980s, Toury made this the fundamental task of descriptive Translation Studies. Similar views have been expressed since: for example, Chesterman (2004) envisages
Incorporating Corpora
34
Translation Studies as a rigorously scientific pursuit that seeks generalisations like any other science. In theoretical discussions on the nature of translated language, conceptualising translation as a language form of its own, not reducible to either the source or the target, is an interesting and highly relevant view to the search for universals. Notions of this kind have been put forward in terms of translations as ‘hybrid language’ (Scha¨ffner & Adab, 2001; Trosborg, 1997), or as a ‘third code’ (Frawley, 1984). There have also been suggestions to the effect that the most fundamental universal processes, or the causes behind linguistic manifestations, might be cognitive (for example, Tirkkonen-Condit, 2004), and that these should be given a central place in the research, but the main focus of the studies undertaken so far has been on linguistic features, and this is also clearly reflected in the findings.
Universals
A Controversial Concept
Since the introduction of the idea of universals, the subject has been much debated in Translation Studies. Many scholars have hastened to express their doubts about the feasibility of such features, or dispute the concept, while most agree that it is nevertheless a very interesting idea. To many, the notion of ‘universals’ has clearly been too radical, and alternative terms and concepts like ‘laws’ and ‘tendencies’ have appeared more acceptable. The disputes in part reflect different understandings of the term ‘universal’, but even so there seems to be more conceptual than terminological unity in the field. One of the most influential scholars advocating the search for general laws of translation, but avoiding the term ‘universals’, is Toury, who formulated some general laws (‘the law of growing standardization’, ‘the law of interference’) (Toury, 1995). More recently (2004) he has characterised such laws as probabilistic propositions or conditioned regularities. In his view, the chief value of such laws lies in their explanatory power. On similar lines, Chesterman has argued that the quest for generalities which characterises all science should also be one of the prime tasks of Translation Studies. Accepting universals as one possible route to highlevel generalisations, he differentiates two types: one relates to the process from the source to the target text (what he calls ‘S-universals’), while the other (‘T-universals’) compares translations to other target-language texts (Chesterman, 2004). Chesterman’s distinction helps clarify the discussion, but actual empirical research does not seem to reflect this division equally clearly, either in research design or in results. The S-type has been less studied with large databases than the T-type, although parallel corpora are well suited to investigating the relations between translations and their sources.
Universal Tendencies in Translation
35
One of the strong objections to the idea of universal laws has come from translation history: Tymoczko (1998) argues that the very idea of making universal claims about translation is inconceivable, as there is no way in which we could capture translations from all times and all languages. This is undoubtedly true, but at the same time it must be pointed out that we need not have access to all translations that have ever existed to postulate general laws any research field needs to accept the fact that access to data is limited; the pursuit of generalities is based on the data that can be accessed. Nevertheless, translation history presents important reminders of the limitations of the data we can use. Without a historical perspective, it is much harder to appreciate that the status of translations relative to spontaneous texts in a given language can have undergone radical changes. In many languages and countries, translations have actually preceded domestic texts, provided models for new genres and generated linguistic innovations where target languages have had, for example, lexical gaps (see, for example Paloposki, 2005). In such situations, it is difficult to draw clear borderlines between translations, adaptations and texts otherwise heavily influenced by foreign sources, and clear comparisons of the kind assumed in parallel or comparable corpus studies are not really possible. Considerations of this kind clearly impose limitations on the claims concerning translation universals, but fuzzy categories and boundaries are typical rather than exceptional for many objects of study in the humanities and social sciences. In any case, Translation Studies rests on the idea that translations exist and are sufficiently identifiable to warrant research. In linguistics, universals have been investigated and debated for several decades. It has become generally accepted that a fruitful study of language universals needs to take into account different kinds of general tendencies shared by a large number of languages, not only ‘absolute’ universals, that is, features shared by every human language. As Greenberg et al. (1966: xv) already put it in their classic ‘Memorandum concerning language universals’: ‘Language universals are by their very nature summary statements about characteristics or tendencies shared by all human speakers.’ It seems that Translation Studies would also benefit from adopting a similar extended view, which includes general tendencies. Moreover, the study of translation universals could also usefully follow the lead of linguistics and distinguish between universals that can be traced back to general human cognitive capacities and those that relate linguistic structures to the functional uses of language. The term ‘universals’ does not, then, necessarily refer only to absolute laws, which are true without exception. Rather, most of the suggested universal features are general or law-like tendencies, or high probabilities of occurrence. Because their empirical study is a relatively new domain,
36
Incorporating Corpora
hypotheses have tended to be very general, even vague, of the type ‘all translations are X’ or ‘all translations have a tendency to Y’. As empirical work develops, hypotheses are becoming more specific and reflect a more complex reality than the first generation of hypotheses would seem to suggest.
How Should We Study Translation Universals? It is not only the concept of translation universals that has been questioned and debated, but also the possible or desirable nature of research approaches that might be used to investigate their existence. While it is clear that the chief value of research into this topic lies in deepening a general understanding of translation and contributing to translation theory, it is equally clear that methodological monism is unlikely to take us very far. The discussion so far has touched upon different possibilities of methodological choice, although the bulk of actual empirical research has been of a relatively uniform kind. The approaches featuring in these discussions have been linguistic, cognitive and social. The first two relate to the categories of translation product and translation process. The third draws on sociocultural research, which has often been hostile to general laws, but may conceivably contribute to generalisations as well at least by pointing out their limits. Most research into translation universals has been linguistically oriented. The study of universals originated in corpus studies: corpus methods are well suited for the discovery of large-scale tendencies, and can therefore be expected to continue to play a major role in universals research. Corpora have benefited traditional parallel text research by providing a much larger number of examples than earlier. Large numbers help us get beyond individual translators’ idiosyncrasies, reveal typicality as well as variation, and help detect the truly unique. A radical new departure has been the comparable translation corpus. Because these corpora do not allow access to source texts, they really focus attention on target texts and target languages. They have also made it possible to compare translations from different sources into one target language. The similarities that are discovered between translations from different sources but which differ from target language originals provide candidates for a number of universals; conversely, characteristics which single out one source language are likely to be caused by interference. It is important that comparable corpora include a variety of source languages, in order to distinguish that which is common to translations in general, and to make possible a comparison of different target languages to discover that which is specific to a language pair. A cognitive approach has been invoked as a model for explaining the linguistic characteristics discovered by other means. So far the
Universal Tendencies in Translation
37
explanations have remained at the level of theoretical discussion, but cognitive models give rise to other kinds of empirical work, notably thinkaloud-protocols (TAPs) (see, for example Alves, 2003; Ja¨a¨skela¨inen, 1999). The translator’s cognitive processes are seen as the mediating element which accounts for the shifts that are linguistically manifest in the text. In explaining corpus findings, some scholars make reference to cognitive concepts like the translator’s bilingual mental dictionary, while others seem to be searching for suitable cognitive concepts even if they describe the processes at a more common-sense level, and discuss translators’ tendencies to translate word for word, or to translate that which is stimulated by the source text. There is clearly a need to model linguistic findings in process terms. Sociocultural and historical approaches to translation research have tended to take a critical stance to translation universals (see Paloposki, 2002; Tymoczko, 1998). It is generally typical of sociohistorical views to emphasise the particular over the general, and accordingly this research has problematised excessively simple notions of translation. In the light of historical examples it becomes clear that there are cases where it may be very difficult to identify a source language, let alone a source text. In addition, sometimes considerable liberties have been taken with source texts, resulting in major transformations that rule out direct comparisons at a linguistic level. As noted earlier, translations have also been instrumental in bringing new genres and text types into new cultures, which means that domestic counterparts did not yet exist; thus the foundation for a comparative study, the existence of pairs of texts, does not materialise. Social constraints may weaken the comparability of texts even if source texts and their translations are identifiable without problems. The social status of translations is not homogeneous but genre-dependent. Another important factor is what Toury (1995) calls ‘preliminary norms’, that is, translation policies which determine, for example, the selection of texts for translation. It is also true that translations are unevenly distributed across the genres of the target culture, and across source languages: for a given language pair, more tends to get translated in one direction than the other, and translations in one direction can be differently biased for social factors like prestige, or the date of the original. The ensuing dilemma is that the comparability of texts conflicts with the objective of reflecting prevailing preliminary norms, although an ambitious corpus would wish to incorporate both criteria. Social factors of these kinds are possible sources of systematic bias in large databases, and impose limitations on their comparability. However, while such constraints must be borne in mind, they do not invalidate the search for generality, as it is inconceivable that we would be able to compile ideally homogeneous databases of real language and real translations. A
38
Incorporating Corpora
search for generality cannot assume perfect homogeneity of the research object; what it strives for is the search for what is common and shared within the variation.
Hypothesised Universals of Translation A number of suggestions concerning features that might be common to all translations were made by scholars before any discussion arose about universals. Such assumptions were first gathered together by Baker in her seminal paper (1993). Usually these hypotheses arose from small-scale studies of language pairs. There are overlaps, and some have become more commonly accepted than others. In the following I shall illustrate the discussion and ongoing research by looking at three among the relatively established hypotheses, then comment on three more recent ones. Clearly, the first hypotheses in the field originated in comparisons made not only on limited data but also in circumstances that were arguably not comparable. Nevertheless, some have been supported by more recent large-scale data, while others have turned out to be based on shakier ground. Among the most established and widely studied hypotheses are ‘explicitation’, ‘simplification’ and ‘conventionalisation’. Three more recent hypotheses have emerged from corpus research in the early 2000s: ‘under-representation of unique target-language items’, ‘untypical collocations’ and ‘source-language interference’. Explicitation To start with the least controversial case, the most widely accepted hypothesis which tends to receive support from empirical studies is known as ‘explicitation’. This hypothesis predicts that translations are more explicit than source texts. That is, the translation process tends to add information and linguistic material to the text being translated. In the light of the distinction made above between S- and T-universals, explicitation would seem to fall most naturally into the S-type but recently it has also been studied as a T-universal, and has also been supported in comparisons between translations and similar texts written in the same language. The term comes from Blum-Kulka (1986), who noticed that translations tend to contain more redundancy and explicit cohesion than their source texts. Blum-Kulka’s hypothesis has since been supported by a number of studies at different levels of language. Baker (1996) observed that translations tend to contain more explanatory lexis (cause, due to, lead to. . .) and connectors, more syntactic repetition and more textual extensions such as parentheses, dashes, footnotes and paraphrases of metaphors than non-translations. Olohan and Baker (2000) noticed a related tendency with the optional relative pronoun in reporting clauses: translations tended
Universal Tendencies in Translation
39
to use it more often, while non-translations tended to omit it more (he said that it’s all right rather than he said it’s all right). Johansson (1998) found that translations between English and Norwegian in either direction contained more words than their originals. It has also commonly been observed that culture-specific phenomena tend to be replaced by better known superordinates (shilling coin), or explanatory paraphrases. Some findings suggest that there might be genre differences, so even though the number of connectors in academic texts has been found to increase in translation (Mauranen, 2000), this has not been found in children’s literature (Puurtinen, 2004). Another difference in findings might also be accounted for by genre differences: Puurtinen (1995) found more non-finite clauses in translated than in non-translated children’s literature. This was not confirmed by Eskola (2002), who discovered that clause type and source language influenced the choices of finite versus non-finite structures. Eskola’s data were also literary, but the texts were not written for children. Explicitation has thus been found at different levels of language syntax, lexis and text, and also in culture-specific expressions. Nevertheless, there is variation even in these results, which could be explained in terms of the level of language studied, or the genre of the texts. It is also likely that other, text-external factors come into play, such as temporal or cultural distance between source and target: clearly, ‘remote’ texts can be expected to require more paraphrasing and extra clarification than contemporary texts from neighbouring cultures. Simplification Unlike the explicitation hypothesis, the simplification hypothesis has not been generally supported or refuted. The hypothesis predicts that translated language is ‘simplified’ as compared to non-translated language. As a process term, ‘simplification’ would seem to refer more appropriately to S-universals than T-universals, but more recently it has in fact mainly been applied to the latter setting. This was the case in LaviosaBraithwaite’s (1996) study, one of the very first to put hypotheses on translation universals to empirical test. Her findings supported the hypothesis of simplification in lexis. The study was based on a relatively small but carefully constructed comparable corpus comprising original English newspaper texts and corresponding translations from several source languages, and it compared word frequency distributions. The findings indicated that translations had a lower lexical density and a proportional over-representation of the most frequent lexis, while the typetoken ratio was not lower. The simplification hypothesis has subsequently been contested by results on syntax by Eskola (2002), and on lexis as well as syntax by Jantunen (2001, 2004a,b). My own findings on collocations (Mauranen, 2000) seem to point to a clear tendency against simplification. Patterns of lexical combination indicate untypical combinatory tendencies
40
Incorporating Corpora
in translations, which suggest a wider rather than a narrower use of the resources of the target language. This would seem to support Gellerstam’s (1996) earlier findings that lexical frequencies in translations are untypical in comparison with originals in the same language. A new feature in my findings was that patterns of co-occurrence were also untypical. Methodological differences may play a role in the conflicting findings. For example Laviosa-Braithwaite compared overall word frequencies, whereas Jantunen looked into a selection of individual items in greater detail. Detailed analyses tend to show that the behaviour of different linguistic items is not identical; therefore we need to consider factors other than overall tendencies of a very general kind. The increased complexity and detail of the hypotheses needed for the next stage of universals research is also clear from studies of syntax and connectors: particular structures or items manifest patterning which may be interpreted as increased simplicity, while others do not. However, it would seem that a major problem in interpreting the results is that simplification in itself is a multilayered phenomenon: what is simple at one level of language use may cause complexity at another. For example, it is possible that my findings are compatible with Laviosa-Braithwaite’s in that the individual words that participate in unusual collocations may themselves be more frequently found in translations than non-translations but the variation in collocations suggests that word frequency counts alone cannot capture the whole picture of lexical simplification. Word combinations are more natural units of meaning than individual items on their own. Also, simplifying sentence structures by converting sentences into simple main clauses with few subordinate clauses may lead to greater complexity at text level, reducing its coherent flow and making it seem fragmented and hard to follow. Similar conflicting processes have been commonly observed in second-language users, who appear to compensate for their simplification of target language morphology by employing periphrastic syntactic means thus making the syntax more complex. Conventionalisation It has frequently been observed by people who study or teach translation that translations have some kind of a tendency towards conventionality, or ‘normalisation’, as it is sometimes called. It is akin to a general conservatism or caution, often attributed to translations, which means that translations are supposed to avoid margins or periphery and remain safely within the mainstream. Conventionalisation may be likened to Toury’s notion of ‘the law of growing standardisation’. With this, Toury (1995: 268) referred to a tendency of translations to modify relations in the source text in favour of more habitual options in the target-language repertoire, or to use his more technical terms, the propensity of translations to replace ‘textemes’
Universal Tendencies in Translation
41
with ‘repertoremes’. The hypothesis of conventionalisation seems to overlap to an extent with simplification; both regard the noticeably high lexical frequencies of certain items as supporting evidence for their hypothesis. Translations have been reported to use generally unmarked grammar, cliche´s, and typical, common lexis instead of the unusual or the unique. They are said to replace standard language for dialect, normalise punctuation and exaggerate target-language features. Moreover, support for conventionalisation has also been seen in the greater proportions of ngrams, or recurrent multiword clusters, in translations as compared to nontranslations (see, for example, Nevalainen, 2005). Kenny (1999) carried out a corpus-based study on conventionalisation in translations between English and German literary texts that focused on hapax legomena. She found some evidence in support of the hypothesis, but as she also found considerable variation among individual translators as well as in one translator’s strategies in different contexts, she concluded that the tendency to avoid hapaxes in translations is not likely to be a universal tendency. Conventionalisation is a controversial hypothesis, and tendencies in the opposite direction have also been found: Toury (1995: 208) himself refers in a general way to ‘the well-documented fact that in translations, linguistic forms and structures often occur which are rarely, or perhaps even never encountered in utterances originally composed in the target language’. Mauranen (2000) reported untypical collocations in translations, and Jantunen (2004a,b) found translations to manifest greater variety among the collocates of frequent intensifiers. Altogether, the tendency of translations to favour conventional, middle-of-the-road language has not been demonstrated, and there is also plenty of anecdotal evidence of linguistic oddity in translations. Moreover, it is not always clear whether an S- or a T-type relationship is meant when the conventionality of translations is discussed. The concept is in need of more differentiation and more precise definitions, to ensure that empirical work is based on specific predictions. It is also necessary to make a clear distinction between conventionality and simplification.
Unique items in translations While Kenny’s (1999) study discussed above was concerned with the proportions of items occurring only once in a text, Tirkkonen-Condit’s (2004) hypothesis of ‘under-representation of target language unique items’ extends the horizon from individual texts to general tendencies in a language. She suggests that features which tend to be ‘untranslatable’, unique to the target language, or in any case do not occur in the source language, tend to be proportionally under-represented in translations.
42
Incorporating Corpora
Features such as pragmatic particles or rare lexicalisations are also sometimes regarded as ‘untranslatable’. Tirkkonen-Condit herself (2004) studied Finnish clitics and Finnish verbs containing the semantic feature of ‘sufficiency’ or ‘being enough’. The research strongly supports her hypothesis: these elements actually occurred more rarely in translations. The explanation she offers is that these items are missing from a translator’s bilingual ‘mental dictionary’, as they have no counterpart in the source language which would make up the other half of the bilingual entry, as it were. They are therefore not very likely to be chosen in translation, although they would have a greater chance in an ‘ordinary’ situation of writing an original text in the target language. This hypothesis relates to a pure T-universal. It also evokes a connection between the linguistic and the cognitive in translation. The unique items hypothesis has been supported by studies from lexis (Kujama¨ki, 2004), syntax (Eskola, 2002) and the interface of syntax and pragmatics (Mauranen & Tiittula, 2005). Kujama¨ki’s study (2004) was concerned with back-translations of uniquely Finnish items related to snow. A Finnish text with these items was translated into German and English, with the target items replaced by paraphrases. Student translators then translated the German and English versions into Finnish, and showed a strong tendency to translate the paraphrases rather than use the unique Finnish items. Eskola (2002) noticed that certain non-finite constructions in the target language were under-represented in translations when the source language did not possess a directly corresponding structure. Mauranen and Tiittula (2005) studied both parallel and comparable corpora, contrasting original Finnish texts with translations from English and German, and found that the Finnish ‘zero person’ structure was rare in translations even though it is highly characteristic of Finnish spontaneous usage. This hypothesis is very much in line with Gellerstam’s (1996) earlier findings about Swedish: Swedish discourse particles were under-represented in translations from English, where they had no direct counterpart. So far this hypothesis has been strongly supported, but apart from Gellerstam’s study, mainly with evidence from Finnish. It would benefit from further research in a wider variety of target languages, and it would be interesting to see whether Tirkkonen-Condit’s results on verbs of sufficiency would be different in translations between Finnish and Swedish, which share some of these verbs. The verbs ought to find their equivalents easily between these two languages, but Swedish translated from English should behave much like Finnish. Interference Interference or transfer from the source language is one of the most common (and common-sensical) assumptions about the linguistic
Universal Tendencies in Translation
43
characteristics of translations. It was first formulated in scientific terms by Toury (1995), who posited a ‘law of interference’ as one of the general laws of translation. In the study of universals, however, interference has had a problematic status, as Baker’s initial definition seems to exclude bilingual interference completely. It has nevertheless been rehabilitated and a few scholars have postulated this as a possible universal, for example Eskola (2002, 2004), Laviosa-Braithwaite (1996) and Mauranen (2004). One difficulty with interference has been the tradition of using the term in a negative sense, which makes it hard to appreciate its effect on translations in a non-judgemental way. Since the emergence of Toury’s descriptive Translation Studies (Toury, 1980, 1995), translation research has tried to shed its strongly normative traditions; the tradition nevertheless lingers on, in part perhaps because translations need to maintain a certain quality in order to constitute acceptable texts in the target culture. Sometimes ‘transfer’ has been used for a non-negative effect of the source on the target, but the terms have been used loosely and the difference is hard to tell without resorting to normative judgement. Eskola (2004) argues that we must start using the term ‘interference’ in a neutral sense, because it is an important feature of translation. Eskola’s (2002) research on non-finite constructions has shown that translations tend to over-represent those structures that have a straightforward counterpart in the source language. Lexical studies by Jantunen (2004b) and Mauranen (2004) used subcorpora with 10 source languages as well as subcorpora with a single source language; single-source translations differed from those representing a variety of source languages. These findings thus support the interference hypothesis. Nevertheless, translations taken as a whole turned out to be more similar to each other than to target language originals, which in turn is a clear indication of the existence of some independent feature of ‘being a translation’, which cannot be reduced to the effect of the source language on the target. Even if an interference or transfer hypothesis gains acceptance, a conceptual confusion seems to prevail with respect to the scope of such a hypothesis. In theory, the hypothesis can be postulated as an effect of the source language, the source text or both. It will be important in further studies to establish which of the assumptions is being tested, and to design the research accordingly. Intuitively, it would seem the most satisfying solution to expect both the source text and the source language system to be involved. But while it is a relatively straightforward task to show with parallel corpora that text interference takes place, the more intriguing question is the interaction of the two language systems in the translator’s bilingual brain and in the bilingual processing system of translation.
44
Incorporating Corpora
Untypical collocations It has been found that not only do translations show untypical item frequencies (Gellerstam, 1996; Laviosa-Braithwaite, 1996), but also that they display collocational patterns that deviate from those found in comparable target language originals (Jantunen, 2004a,b; Mauranen, 2000). This tendency was put forward as a hypothetical translation universal by Mauranen (2000). The hypothesis and the results supporting it are of the T-type, and the research has been based on comparable corpora. This hypothesis is in accordance with Kenny’s (1999) results on collocational differences in hapaxes, and has found support in Jantunen’s (2001) investigation of near-synonymous intensifiers and their combinatory tendencies: Jantunen found both under- and over-representation of collocations, colligations and cluster sequences in translations. Variation was item-specific. Mauranen’s (2000) study on item combinations, both collocational and colligational, found that translations tended to favour combinations that, although possible in the target language system, were rare or absent from actual target language texts. Conversely, translations often had few or no instances of combinations that were frequent in targetlanguage originals. Collocational preferences can be understood as reflections of a (nativespeaker) sense of ‘naturalness’ or perhaps conventionality in language. It seems that the translation process, as a bilingual processing situation, interferes with or upsets the spontaneous, or ‘ideally monolingual’ processing of a native speaker. Since translators, as in fact most speakers in the world, are bi- or plurilingual, it is not surprising that their performance in a situation that demands activation of more than one of their languages differs from that in monolingual situations. Nevertheless, even their monolingual processing cannot be assumed to be entirely ‘pure’ in the sense of an ideal (monolingual) native speaker. Obviously ideal speakers are never real, and recent evidence from second-language studies (Cook, 2003) indicates that a second language affects speakers’ nativelanguage use, which adds another source of impurity to the idealised monolingual native speaker. This hypothesis is also related to other findings concerned with untypical frequencies, which have included other aspects, not just lexical frequency distributions. Gellerstam (1996) found, for example, that nominal headwords were more frequent in translations than in original Swedish texts, and that some reporting verbs were differently distributed, while Swedish attitudinal particles were less frequent. Another combinatorial observation is that fixed word sequences are more common in translations (Nevalainen, 2005). Colligations are at the interface of lexis and grammar. Jantunen’s (2004a) study included highly frequent intensifiers, which can be regarded as partly grammaticalised and be placed on a continuum between
Universal Tendencies in Translation
45
grammar and lexis but otherwise these have been little investigated. Altogether, item combinations require further detailed research, as they should be able to throw more light on the tendencies of conventionality and unconventionality in translations. It seems that translators utilise the resources of the target language by making relatively greater use of what can be done rather than what typically is done. Corpus research is at its best in discovering what speakers typically do, and the new typicality that arises from translational language can perhaps best be captured by corpus methods.
Conclusion The search for translation universals suggests that translations are texts of a particular, specific kind, which reflect the complex cognitive processes and the particular social contexts from which they arise, but which at the same time share features that distinguish them from other kinds of texts. They are not flawed texts in the sense that they would break ordinary grammar rules or change the meaning of lexical items, but they show subtle tendencies of stretching the potential of the target language towards new directions in some places, while seeming to neglect its full potential in others. For example they tend to be explicit in a particular way, and manifest textual properties differently distributed from comparable nontranslated texts; they may under-represent features unique to the target language and over-represent elements that are less common. Even though it seems to be the case that translations reflect their source languages (through ‘interference’), and thereby may be viewed as infiltrating target languages with alien influences, translations are not the only form of frequent language contact in today’s globalised world, and therefore hardly the main means of importing new linguistic trends. They cannot be regarded as the enemy within pure, isolated, self-contained languages. Languages are in a constant state of change on account of internal as well as external developments even without translational influence. Thus, although in many ways it is an apt description to call translations hybrid texts, because they retain traits from their sources as well as their targets and yet possess something uniquely their own, it is clear that they constitute a natural and substantial part of any language that exists in the written mode. The search for translation universals benefits from different methodological approaches and needs findings from a wide variety of languages and language pairs, both typologically distant and close. In this way, it is by nature an international and collaborative research field. The specificity of translations is an important field of study beyond understanding the nature of translation in itself; by making sense of the shaping of language in the bilingual processing situation that translations
46
Incorporating Corpora
constitute, we can contribute to the understanding of other kinds of language contact. Translations straddle the individual and the social, whereby findings about their universal features can illuminate not only macro-level and micro-level processes in language change, but also language maintenance, as translations appear to display certain tendencies towards conservative or conventional language use, as we have seen in the above discussion. References Alves, F. (ed.) (2003) Triangulating Translation. Perspectives in Process Oriented Research. Amsterdam: John Benjamins. Baker, M. (1993) Corpus linguistics and Translation Studies implications and applications. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology. In Honour of John Sinclair (pp. 233 250). Amsterdam: John Benjamins. Baker, M. (1996) Corpus-based Translation Studies: The challenges that lie ahead. In H. Somers (ed.) Terminology, LSP and Translation: Studies in Language Engineering in Honour of Juan C. Sager (pp. 175 186). Amsterdam: John Benjamins. Blum-Kulka, S. (1986) Shifts of cohesion and coherence in translation. In J. House and S. Blum-Kulka (eds) Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies (pp. 17 35). Tu¨bingen: Gunter Narr. Chesterman, A. (2004) Beyond the particular. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 33 49). Amsterdam: John Benjamins. Cook, V. (2003) Introduction: The Changing L1 in the L2 user’s mind. In V. Cook (ed.) Effects of the Second Language on the First (pp. 1 18). Clevedon: Multilingual Matters. Eskola, S. (2002) Syntetisoivat rakenteet ka¨a¨nno¨ssuomessa. Joensuu: University of Joensuu. Eskola, S. (2004) Untypical frequencies in translated language: a corpus-based study on a literary corpus of translated and non-translated Finnish. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 83 99). Amsterdam: John Benjamins. Frawley, W. (1984) Prolegomenon to a theory of translation. In W. Frawley (ed.) Translation. Literary, Linguistic and Philosophical Perspectives (pp. 159 175). Newark: University of Delaware Press. Gellerstam, M. (1996) Translation as a source for cross-linguistic studies. In K. Aijmer, B. Altenberg and M. Johansson (eds) Languages in Contrast (pp. 53 62). Lund: Lund University Press. Greenberg, J.H., Osgood, C.G. and Jenkins, J.J. (1966) Memorandum concerning language universals. In J.H. Greenberg (ed.) Universals of Language (2nd edn, pp. xv xxvii). Cambridge, MA: The M.I.T. Press. Hansen, S. (2003) The Nature of Translated Text: An Interdisciplinary Methodology for the Investigation of the Specific Properties of Translations. Saarbruecken Dissertations. Computational Linguistics and Language Technology, Vol. XIII. Saarbruecken: University of Saarbruecken. Ja¨a¨skela¨inen, R. (1999) Tapping the Process: An Explorative Study of the Cognitive and Affective Factors Involved in Translating. Joensuu: University of Joensuu.
Universal Tendencies in Translation
47
Jantunen, J.H. (2001) Synonymity and lexical simplification in translations: A corpus-based approach. Across Languages and Cultures 2 (1), 97 112. Jantunen, J.H. (2004a) Synonymia ja ka¨a¨nno¨ssuomi. Joensuu: University of Joensuu. Jantunen, J.H. (2004b) Untypical patterns in translations. Issues on corpus methodology and synonymity. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 101 126). Amsterdam: John Benjamins. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research: Theory, Method, and Case Studies (pp. 3 24). Amsterdam: Rodopi. Kenny, D. (1999) Norms and creativity: Lexis in translated text. PhD Thesis. Mimeo. Manchester: Centre for Translation and Intercultural Studies, UMIST. Kujama¨ki, P. (2004) What happens to ‘unique items’ in learners’ translations? ‘Theories’ and ‘concepts’ as a challenge for novices’ views of ‘good translation’. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do they Exist? (pp. 187 205). Amsterdam: John Benjamins. Laviosa, S. (2002) Corpus-based Translation Studies. Theory, Findings, Applications. Rodopi: Amsterdam. Laviosa-Braithwaite, S. (1996) The English Comparable Corpus (ECC): A resource and a methodology for the empirical study of translation. PhD Thesis. Mimeo. Manchester: Centre for Translation and Intercultural Studies, UMIST. Mauranen, A. (2000) Strange strings in translated language: A study on corpora. In M. Olohan (ed.) Intercultural Faultlines. Research Models in Translation Studies 1: Textual and Cognitive Aspects (pp. 119 141). Manchester: St. Jerome. Mauranen, A. (2004) Corpora, universals and interference. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 65 82). Amsterdam: John Benjamins. Mauranen, A. and Kujama¨ki, P. (eds) (2004) Translation Universals Do They Exist? Amsterdam: John Benjamins. Mauranen, A. and Tiittula, L. (2005) Mina¨ ka¨a¨nno¨ssuomessa ja supisuomessa. In A. Mauranen and J.H. Jantunen (eds) Ka¨a¨nno¨ssuomeksi (pp. 35 70). Tampere: Tampere University Press. Nevalainen, S. (2005) Ko¨yhtyyko¨ kieli ka¨a¨nnetta¨essa¨? Mita¨ taajuuslistat kertovat suomennosten sanastosta. In A. Mauranen and J.H. Jantunen (eds) Ka¨a¨nno¨ssuomeksi (pp. 141 162). Tampere: Tampere University Press. Olohan, M. and Baker, M. (2000) Reporting that in translated English. Evidence for subconscious processes of explicitation? Across Languages and Cultures 1 (2), 141 158. Paloposki, O. (2002) Variation in translation: Literary translation into Finnish 1809 1850. PhD Thesis. Mimeo. University of Helsinki. Paloposki, O. (2005) Ka¨a¨nno¨ssuomen synty. In A. Mauranen and J.H. Jantunen (eds) Ka¨a¨nno¨ssuomeksi (pp. 15 32). Tampere: Tampere University Press. Puurtinen, T. (1995) Linguistic Acceptability in Translated Children’s Literature. Joensuu: University of Joensuu Publications in the Humanities, 15. Scha¨ffner, C. and Adab, B. (2001) The idea of the hybrid text in translation: Contact as conflict. Across Languages and Cultures (pp. 167 180). Special issue on Hybrid Texts and Translation. C. Scha¨ffner and B. Adab (guest eds.). Teubert, W. (1996) Comparable or parallel corpora? International Journal of Lexicography 9 (3), 238 264. Tirkkonen-Condit, S. (2004) Unique items over- or under-represented in translated language? In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 177 184). Amsterdam: John Benjamins.
48
Incorporating Corpora
Toury, G. (1980) In Search of a Theory of Translation. Tel Aviv: The Porter Institute for Poetics and Semiotics. Toury, G. (1986) Monitoring discourse transfer: A test-case for a developmental model of translation. In J. House and S. Blum-Kulka (eds) Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies (pp. 79 94). Tu¨bingen: Narr. Toury, G. (1995) Descriptive Translation Studies and Beyond. Amsterdam: John Benjamins. Trosborg, A. (1997) Translating hybrid political texts. In A. Trosborg (ed.) Analysing Professional Genres (pp. 145 158). Amsterdam: John Benjamins. Tymoczko, M. (1998) Computerized corpora and Translation Studies. Meta 43 (4), 652 659.
Chapter 4
Norms and Nature in Translation Studies1 KIRSTEN MALMKJÆR
Introduction Norms have played a central role in Descriptive Translation Studies, because ‘it is norms that determine the (type and extent of) equivalence manifested by actual translations’ (Toury, 1995: 61, emphasis in the original). Equivalence is the name given to the relationship, of whatever type and extent, between a translation and its source text, and the existence of such a relationship is axiomatic in the theory (Toury, 1980a: 45). There is a degree of theoretical tension between norms and another notion which has recently generated interest within Translation Studies, the notion of the universal. The tension between the concept of the norm and the concept of the universal arises because ‘there is a point in assuming the existence of norms only in situations which allow for different kinds of behaviour’ (Toury 1995: 55). Insofar, therefore, as the notion of the universal in translation theory implies invariable behaviour, the explanatory power of the norm concept is inversely proportional to that of the concept of the translation universal: The more variable translational behaviour can be assumed to be, the more theoretical power accrues to the norm construct; and the less variable translational behaviour can be assumed to be, the less theoretical power accrues to the norm construct. I will begin by trying to establish what we mean by the terms ‘norm’ and ‘universal’ in Translation Studies.
Norms in Translation Studies The notion of norms enters the broad field of Translation Studies with Toury’s essay ‘The nature and role of norms in Translation Studies’, first published in In Search of a Theory of Translation (1980b) and reproduced in somewhat more than its entirety in his Descriptive Translation Studies and Beyond (1995), to which my page references will be. Here, norms are understood as sociocultural phenomena (Toury, 1995: 62) situated
49
Incorporating Corpora
50
between two extremes of a scale of sociocultural constraint: Absolute rules at one end and complete idiosyncrasy at the other: Rules
N
O
R
M
S
Idiosyncrasy
A norm may be more or less close to one of these extremes and its position on the scale is subject to change, disappearance and appearance over time (Toury, 1995: 54); that is, norms are basically unstable (p. 62). Norms are assimilated by individuals in the course of their socialisation process, and adherence to them or deviance from them has the potential to incur approval or sanction of various kinds, including positive or negative criticism (p. 55). Norms are not directly observable, but they can be learnt and also studied through observation of patterned, recurrent behaviour, for example in talk aloud protocol studies, or through observation of the immediate results of translational behaviour, texts. Translation scholars may consider these data in light of what is known about extratextual factors such as translation policy, publishing constraints, sociocultural mores and customs and so on (Toury, 1995: 65) to arrive at verbalisations of the norms; but outside of theory, norms are rarely explicitly verbalised and tend to be followed by translators almost unawares. This understanding of the norm concept does not differ substantially from that evidenced in general, socioculturally oriented definitions of norms, such as the following: Expectations of how a person or persons will behave in a given situation based on established protocols, rules of conduct or accepted social practices. www.asq.org/glossary/n.html A way of behaving or believing that is normal for a group or culture. All societies have their norms, they are simply what most people do. Deviants break norms. Some norms are enshrined in law and society punishes those who deviate from them. Breaches of unwritten norms are unofficially punished. This is important to science, because innovation is a form of deviancy science formally encourages. (Hewitt, 2005: Glossary) An expected standard of behaviour and belief established and enforced by a group. www.socialpolicy.ca/n.htm Shared belief that a person ought to behave in a certain way at a certain time. (Stafford & Scott, 1986: 81) These definitions highlight both the social and the expectational nature of norms. Norms belong to social groups. They are not absolute (legislative) rules, and people are able to contravene norms as well as to adhere to them.
Norms and Nature in Translation Studies
51
Although norms, in this sense, obviously relate to systems of belief among groups of people about what is appropriate behaviour at a certain time in certain circumstances, it is important to note that what people believe should be done may not necessarily be what even those who hold the belief actually do. In the social and socially applied sciences, it is customary, therefore, to distinguish between attitudinal norms, which have to do with ‘shared beliefs or expectations in a social group about how people in general or members of the group ought to behave in various circumstances’ (Perkins, 2002: 165), and behavioural norms, which have to do with ‘the most common actions actually exhibited in a social group’ (Perkins, 2002: 165). Attitudinal norms do not necessarily determine behaviour; as Perkins puts it (2002: 165): ‘How most other community members believe everyone should behave and what behaviour is most common may be correlated, of course, but each component may also be somewhat distinct’. Clearly, a distinction might similarly be drawn in Translation Studies between what people’s (including translators’) attitudes are to translational phenomena, which might be partially tapped by protocol and interview studies, and what some people (translators) actually do when translating. But whereas it is possible in much research in the social sciences to identify behavioural norms with certain manifestations of the behaviour in question (called normal behaviour), this is not possible in Translation Studies, because, like all behaviour involving language, translating behaviour is primarily mental. Its results, however important and central to the immediate aim of translation, communication, are, nevertheless, merely the outward signs of a phenomenon which, as Locke ([1690] 1977: Book Three, Chapter 2) remarked, is itself ‘invisible and hidden from others’. Perkins studies college students’ drinking behaviour, and the behavioural norm in this case can be established by identifying, categorising, quantifying and carrying out statistical calculations on instances of this behaviour. The relationship between the behavioural norm and its manifestations is therefore relatively simple and direct (however complicated the relationship between a drinker, his or her attitude to drinking, and his or her actual drinking behaviour may be). Translational norms, in contrast, stand in a more complex relationship to the evidence for their operation. Even those of Toury’s norm categories that are most directly related to the linguistic material that ends up constituting a translation are guides to the selection of this material; they are never identified with it: Operational norms in general direct decisions made during translating and govern relationships between the translation and the source text (Toury, 1995: 58); the subclass, matricial norms, govern the fullness of translation, distribution of material in it and its segmentation (pp. 589); and the subclass, textual-linguistic norms, govern ‘the selection of material to formulate the target text in’ (p. 59). In other words, and as indicated in the introduction above, Toury draws a clear distinction
52
Incorporating Corpora
between the norms on the one hand and, on the other hand, textual material in actual translations, which, together with textual material in source texts, manifest equivalence relationships. Equivalence relationships are categorically different from norms, and linguistic material falls into an additional, separate, third category. So the relationship between (a) textual material, which is concrete, and which is distributed across two texts that stand to one another as translation to source text; and (b) equivalence, the relationships that are obtained between textual material in the translation and textual material in the source text; and (c) norms, which are socioculturally shared psychological phenomena is extremely complex, and it is important not to slip into ways of speaking or writing which might suggest identity between translation norms and their manifestations. Norms share this lack of identity with their manifestations with other phenomena that influence linguistic behaviour, but from which norms nevertheless differ in important respects. Consider, for example, the Gricean maxims of conversational co-operation (Grice, 1975). The maxims, and the principle of co-operation itself, are vocalisations of demands for connectivity imposed on conversational behaviour by human rationality: If our talk exchanges are to be rational, they must consist of utterances that are connected to each other, and the Co-operative Principle, ‘Make your conversational contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged’ (Grice, 1975: 45), ensures that this connectedness can be maintained, sometimes or perhaps normally by way of a complex conversational dance around, rather than inside spaces occupied by notions such as literalness, truth and explicitness. The Principle ensures that conversation can happen, but it does not influence its form except insofar as the form is relevant to the perceived rationality of the conservation. A person can contravene the Maxims that fall under the Principle, but not the Principle itself, while still being considered rational. Undeclared, non-obvious non-adherence to a Maxim, such as for example lying or pretending to have more knowledge than one actually has, will mislead. Obvious non-adherence, such as for example saying more or less than or something different than a questioner might reasonably expect, will generate implicature, that is information that the addressee adds to what is actually being said and which reinstates the Maxim. Norms, in contrast, (generally) regulate behaviour; but the behaviour survives, though it may be considered deviant, even if the norms are not adhered to (until perhaps non-adherence of a certain determined kind itself becomes the norm). We might encapsulate the differences between norms and maxims by saying that whereas norms are socially constrained, the Maxims of the Co-operative Principle are cognitively constrained. We
Norms and Nature in Translation Studies
53
could represent this on a cline of source of constraint, going from the social to the cognitive. Socially constrained
Cognitively constrained
Norms
Co-operative Principle
Let us now consider the similarities and contrasts between these two notions and the notion of the universal.
Norms and Universals The concept of the translation universal is not exactly new (see for example Toury, 1977), but the publication of Baker’s (1993) paper, ‘Corpus linguistics and translation studies: Implications and applications’ is generally acknowledged as the inspiration for the recent upsurge of interest in the concept (see for example Mauranen & Kujama¨ki, 2004a: 1). The laudable starting point of Baker’s (1993: 234) paper was to argue ‘that translated texts record genuine communicative events and as such are neither inferior nor superior to other communicative events in any language. They are however different’. Translation universals are introduced as a potentially identifying characteristic of this difference, defined by Baker (1993: 243) as universal features of translation, that is features which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems. If, at first sight, it seems a little odd to define something that merely occurs typically as ‘universal’, we should remember that in the case of linguistic phenomena, there is a long tradition of doing so. On Greenberg’s list (1966) of 45 universals (20 morphological, 18 syntactic and seven word order universals), based on the study of 30 languages, we find both absolute universals such as: ‘All languages have pronominal categories involving at least three persons and two numbers’ (Universal 42), and a number of universal tendencies, to which there are exceptions, such as for example: ‘In languages with prepositions, the genitive almost always follows the governing noun, while in languages with postpositions it almost always precedes (Norwegian has both genitive orders)’ (Universal 2). To qualify as a universal tendency, or, as these are also known, as a nonabsolute or statistical universal, the tendency in question must, however, be demonstrably statistically significant (see Song, 2001: 6), so in this tradition a universal can be defined as a property ‘which must at least be true of the majority of the human languages’ (Song, 2001: 8). Both absolute and non-absolute universals may be either implicational (‘if a language has
54
Incorporating Corpora
feature x it will (tend to) have feature y’) or non-implicational (‘all languages (tend to) have feature z’). The Greenberg tradition of research into language universals allows for a number of types of explanation for the existence of universals. For example, Hawkins (1994) proposes that certain orders of word and constituent predominate because they ease language comprehension and production. Others employ diachronic explanations (Bybee et al., 1990) and yet others seek to integrate processing explanations with diachronic explanations (Greenberg, 1957; Hall, 1988). In this, the typologically oriented tradition of research on universals differs absolutely from the Chomskyan. In the Chomskyan tradition, cognition alone, in its manifestation as Universal Grammar (UG), the initial state of the language acquisition device, is used to explain what is universal in languages, both (a) the principles that constrain the forms of languages and (b) the parameters that define the binary variations that they display (Chomsky, 1981). These are considered innate and include for example the Locality Principle, which says that grammatical operations are local (so that for example auxiliary inversion preposes the closest auxiliary and wh-movement preposes the closest wh-expression) and Parameters like the wh-Parameter which determines that a language either does (English does) or does not (Chinese does not) front wh-expressions, or the Null-Subject Parameter which determines that a language either does (Italian does) or does not (English does not) allow finite verbs to have null subjects. There are no implicational principles and parameters, and there are only absolute principles and parameters (see further Radford, 2004).2 Followers of this faith tend to consider that any viable, absolute universal can be interpreted in terms of the Principles and Parameters of UG. Given that principles and parameters are innate, they differ as absolutely from norms (see previous section) as it is possible to do: they are way off the rules end of the scale of variation on which norms are situated. It is not within our power to regulate Principles and Parameters, they simply are. In fact, even the non-absolute universals of the Greenberg tradition must be precluded from the rules-norms scale, because even if it is true that some of these may be explained in terms of processing ease and/ or diachronicity, they are not subject to social control, coercion or influence of the kind that may result from attitudes: We are dealing here with matters not of sociolinguistics, but of psycho- and historical linguistics. We can, however, add the non-absolute universals to our constraintdetermination scale, introduced at the end of the previous section. Whereas norms are constrained by attitudes that individuals develop in the course of social interaction with other individuals, and whereas Principles and Parameters are features of UG and therefore cognitively determined,
Norms and Nature in Translation Studies
55
non-absolute universals arguably resemble the Co-operative Principle in resulting from cognitive constraints on human linguistic interaction: Socially constrained Norms
Cognitively constrained
Cognitively determined
Co-operative Principle
Principles and Parameters
Non-absolute universals
Absolute universals
The question we must now ask, I think, is whether translation universals are of the kind that can be explained purely on cognitive grounds, or whether they are more like the type for which scholars in the Greenberg tradition offer processing ease and diachronicity explanations and which, therefore, may be universal simply because language users in every culture tend to find it advantageous to employ them; or neither.
What is the Nature of Translation ‘Universals’? Baker’s original formulation seems to suggest a purely cognitive source and explanation of translation universals, whereas the examples she uses to illustrate what a translation universal might be are strongly suggestive of explanation in terms of the kinds of norm that might guide translational behaviour; at most, it seems to me, the majority of these candidates for universalhood invite explanation in terms of processing ease or diachronicity, rather than in terms of innate aspects of the human cognitive apparatus. Baker (1993: 243 245) lists as candidates for the status of translation universal explicitation, disambiguation, simplification, conventionalisation, avoidance of repetition, exaggeration of features of the target language and manifestations of the so-called ‘third code’. Each of these, she says: can be seen as a product of constraints which are inherent in the translation process itself, and this accounts for the fact that they are universal (or at least we assume they are, pending further research). They do not vary across cultures. Other features have been observed to occur consistently in certain types of translation within a particular socio-cultural and historical context. These are the product of norms of translation that represent another type of constraint on translational behaviour. (Baker, 1993: 246) Of course, there are two senses in which the term ‘translation process’ can be used: It can be used to refer to the cognitive or mental process or processes that take place in the minds of translating translators, including and focusing mainly on subliminal processing; and it can be used to refer to
56
Incorporating Corpora
the variably social, physical and mental (but excluding subliminal) processes in which clients, translators and a variety of implicated others consciously engage in order to produce a translation. The contrast Baker invokes with ‘other features’ that are culture specific and are the product of normative constraints strongly suggests the cognitive-mental-subliminal understanding of ‘process’ in the quotation above, as does the reference to the translation process as a causal agent hypothesised ‘rather than’ the confrontation of specific linguistic systems, in the description of the features as ‘linked to the nature of the translation process itself rather than to the confrontation of specific linguistic systems’ (Baker, 1993: 243). It seems to me that of the candidates for universal-hood proposed by Baker (1993), listed by Chesterman (2004) and discussed by Mauranen and Kujama¨ki (2004a), very few qualify for the status of cognitively determined universals. One that does qualify, though, is identified by Tirkkonen-Condit (2004). Tirkkonen-Condit (2004) finds that clitics and verb types unique to Finnish occur more rarely in translations into Finnish than in text originally written in Finnish. Similar findings have been reported in Lykke Jakobsen’s (1986) study of (among other words) the pronoun ‘man’ and the discourse particle ‘jo’ in original writing in Danish and in translation into Danish from English, in Gellerstam’s (1986) study of translationese, which compares novels translated into Swedish with novels originally written in Swedish, and in Eskola’s (2004) study of nonfinite constructions in Finnish. Tirkkonen-Condit suggests that the phenomenon of under-representation in translation of features unique to the target language arises because such features are under-represented in a translator’s mental lexicon while he or she is translating. Nothing in the source text is likely to trigger them. This is an excellent candidate for the status of a universal: The phenomenon receives a cognitive explanation, and similar results have been found for unrelated languages, Swedish and Danish on the one hand and Finnish on the other. Notice that this represents a return to the idea of interference in the translation process named by Toury (1995: 274 279): ‘the law of interference’. Tirkkonen-Condit’s study is carried out using the methodology proposed by Baker (1993: 245246), which may be roughly described as follows: Take a corpus of translations into L from a large number of languages and compare it with a corpus of texts originally written in L, looking for evidence of feature F. Do this for as many Ls as possible. If it is found, for each pair of translation corpus and non-translation corpus, that evidence for F occurs more frequently in the corpus of translated text, then we will have cause to believe that it does so as a result of
Norms and Nature in Translation Studies
57
the translation process and not because of any relationship between any language pair. We may then be justified in calling F a translation universal. Yet, the study contradicts Baker’s understanding of a translation universal as arising from the translation process itself and, by implication, as therefore not having to do with the relationship between the languages or textual systems involved. Tirkkonen-Condit’s findings depend crucially on the relationship between the languages involved. But what is extremely interesting is that her study suggests that we have been looking at the question of influence or interference from the wrong end of the pole. As Toury (1995: 275 276) points out, interference is an inherent part of the translation process how can it not be, given that a translation is made on the basis of another text in another language? But Tirkkonen-Condit’s study suggests that what determines the outcomes of this interference may be the target pole, if not alone, then as much as or more than the source pole, which we have tended to think of as the major determinant of the shape of the translation. Differences between translations into L and texts originally in L are determined as much, if not wholly, by L’s unique features, rather than features of the language of the original for the translation. This, it seems to me, is among the most interesting findings to have arisen out of the search for translation universals to date. Further, it seems to me that if the concept of the universal is to retain any theoretical bite in our discipline, we would do well to reserve it for use in connection with phenomena such as this, for which it makes sense to produce a cognitively based explanation. Many possibly most other candidates for universal status would be better accounted for by the norm concept, which therefore remains to do its job relatively undisturbed within Descriptive Translation Studies. It goes without saying, I think, that corpus studies are extremely well suited to the search for potential evidence for norms, though, equally obviously, they cannot be used to reveal the norms themselves. Note 1.
2.
This is an expanded version of a paper that first appeared in SYNAPS 16/2005, pp. 1319. I am grateful to Ingrid Simonnæs, editor of SYNAPS, for her initial editing of the paper and to the editorial committee for permission to reproduce the paper here. A null subject is a subject which is not physically (phonetically or graphically) present in a text, but whose grammatical and semantic presence is understood. In English, imperative structures standardly have null subjects, and some structures commonly used in speech, such as ‘Can’t find my pen. Must be on my desk at home’ display ‘null truncated subjects’ (Radford, 2004: 349). A language that allows any finite clause to leave implicit a subject which would have been realised by a pronoun, is a null-subject language. For example (Radford, 2004: 349) in Italian, it is
Incorporating Corpora
58
possible to say simply ‘Sei simpatica’ (‘Are nice’), where English demands ‘You are nice’.
References Baker, M. (1993) Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In Honour of John Sinclair (pp. 233 250). Amsterdam and Philadelphia: John Benjamins. Bybee, J.L., Pagliua, W. and Perkins, R.D. (1990) On the asymmetries in the affixation of grammatical material. In W. Croft, S. Kemmer and K. Denning (eds) Studies in Typology and Diachrony (pp. 43 58). Amsterdam and Philadelphia: John Benjamins. Chesterman, A. (2004) Hypotheses about translation universals. In G. Hansen, K. Malmkjær and D. Gile (eds) Claims, Changes and Challenges in Translation Studies (pp. 1 14). Amsterdam and Philadelphia: John Benjamins. Chomsky, N. (1981) Lectures on Government and Binding. Dordrecht: Foris. Eskola, S. (2004) Untypical frequencies in translated language: A corpus-based study on a literary corpus of translated and non-translated Finnish. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 83 100). Amsterdam and Philadelphia: John Benjamins. Gellerstam, M. (1986) Translationese in Swedish novels translated from English. In L. Wollin and H. Lindquist (eds) Translation Studies in Scandinavia (pp. 88 95). Lund, Sweden: CWK Gleerup. Greenberg, J.H. (1957) Order of affixing: A study in general linguistics. In J.H. Greenberg, Essays in Linguistics. Chicago: University of Chicago Press. Grice, H.P. (1975) Logic and conversation. In P. Cole and J.L. Morgan (eds) Syntax and Semantics, 3: Speech Acts (pp. 41 58). New York: Academic Press. Hall, C.J. (1988) Integrating diachronic and processing principles in explaining the suffixing preference. In J.A. Hawkins (ed.) Explaining Linguistic Universals. Oxford: Blackwell. Hawkins, J.A. (1994) A Performance Theory of Order and Constituency. Cambridge: Cambridge University Press. Hewitt, J.A. (2005) A habit of lies: How scientists cheat. Glossary. On WWW at http://freespace.virgin.net/john.hewitt1/pg_gloss.htm. Accessed 18.6.07 Lykke Jakobsen, A. (1986) Lexical selection and creation in translation. In I. Lindblad and M. Ljung (eds) Proceedings from the Third Nordic Conference for English Studies, Ha¨sselby, Sept 2527, 1986, Volume I (pp. 101 112). Stockholm, Sweden: Almqvist and Wiksell International. Mauranen, A. and Kujama¨ki, P. (2004a) Introduction. In A. Mauranen and P. Kujama¨ki (eds) Translation Universals Do They Exist? (pp. 1 11). Amsterdam and Philadelphia: John Benjamins. Mauranen, A. and Kujama¨ki, P. (eds) (2004b) Translation Universals Do They Exist? Amsterdam and Philadelphia: John Benjamins. Perkins, H.W. (2002) Social norms and the prevention of alcohol misuse in collegiate contexts. Journal of Studies on Alcohol/Supplement 14, 164 172. Radford, A. (2004) English Syntax: An Introduction. Cambridge: Cambridge University Press. Song, J.J. (2001) Linguistic Typology: Morphology and Syntax. Harlow: Pearson Educational Limited.
Norms and Nature in Translation Studies
59
Stafford, M.C. and Scott, R.R. (1986) Stigma deviance and social control: Some conceptual issues. In S.C. Ainley, G. Becker and L.M. Coleman (eds) The Dilemma of Difference (pp. 77 91). New York: Plenum. Tirkkonen-Condit, S. (2004) Unique items over- or under-represented in translated language?. In A. Mauranen, P. Kujama¨ki and K. Pekka (eds) Translation Universals Do They Exist? (pp. 177 184). Amsterdam and Philadelphia: John Benjamins. Toury, G. (1977) Translational Norms and Literary Translation into Hebrew, 19301945. Tel Aviv: The Porter Institute for Poetics and Semiotics, Tel Aviv University. [In Hebrew]. Toury, G. (1980a) Translated literature: System, norm, performance. Towards a TT-oriented approach to literary translation. In In Search of a Theory of Translation (pp. 35 50). Tel Aviv: The Porter Institute for Poetics and Semiotics, Tel Aviv University. Toury, G. (1980b) In Search of a Theory of Translation. Tel Aviv: The Porter Institute for Poetics and Semiotics, Tel Aviv University. Toury, G. (1995) Descriptive Translation Studies and Beyond. Amsterdam and Philadelphia: John Benjamins Publishing Company.
Chapter 5
Being in Text and Text in Being: Notes on Representative Texts KHURSHID AHMAD
Introduction The aim of corpus linguistics is ‘to base accounts of language on corpora derived from systematic recordings of conversations and real discourse of other kinds, as opposed to examples obtained by introspection, by judgement of grammarians, or by haphazard observation’; and a corpus is defined ‘as any systematic collection of speech or writing in a language or variety of a language’ (Matthews, 1997: 78). This definition of corpus linguistics makes the enterprise of collecting and studying ‘real discourse of other kinds’ a scientific enterprise. Proponents of scientific enterprises claim to be rational and objective, and in these enterprises there is no room for introspection or ‘haphazard observation’. So, corpus linguistics is a scientific enterprise as science is concerned either with a connected body of demonstrated truths or with observed facts systematically classified. But who compiles the ‘connected body’, who observes the ‘facts’ and who classifies? In different scientific enterprises, from astronomy to zoology, it is not impossible to discern the influence of the sociopolitical outlook of an individual scientist or that of a group, learned societies or protest groups. The transformation of magic (alchemy) into the indispensable science of chemistry has been accomplished over the last 200 years; pre-Galileo scientists earnestly believed that our earth was the centre of our universe; pre-Einsteinians fervently believed, and did often find, the elusive ether; and there is an earnest hunt for quarks currently underway in many laboratories of the world. In each case one can find a sociopolitical dimension to the scientific enquiry. There are questions that arise from the definitions of corpus linguistics and of corpus. For instance, whose conversations are being recorded and who is recording the conversation? How is the real discourse of ‘other kinds’ written texts of a great variety selected and who is selecting the texts? The influence of the observer on the observed has been acknowledged in the esoteric science of the micro-world quantum theory, to be precise, acknowledges this ‘interference’ through the widely reported principle of uncertainty. The corpus linguistics literature will tell us that conversations and writings in a given corpus are carefully 60
Being in Text and Text in Being: Notes on Representative Texts
61
selected by a group, and in some cases the texts are randomly chosen from a catalogue that was available to the corpus builders. The design of a corpus is a human activity, carried out individually or in groups, and as such will always carry the unintended influence of the designer(s) the socioeconomic origins of the human designers, and their past and current working environments have to be acknowledged. Consider the case of the well known and widely used corpus of English, the British National Corpus (BNC). This unintended designer influence shows in the selection of texts. The written part of the British National Corpus (BNC, c. 90 million tokens) comprises ‘extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text’. But the newspaper texts in the BNC are mainly quality, broadsheet newspapers published in London, the romantic fiction novels are largely from one publisher, the selection of specialist texts is similarly skewed to small samples in narrow subdisciplines, and there is little by way of drama or poetry. The BNC is still one of the best sources of how words and phrases are actually used in modern British English: like the late John Sinclair we should all ‘trust the text’, but must also have some information about the provenance of the texts in a corpus. Some of the 21st-century corpus linguists have argued that one of the antecedents of corpus linguistics is hermeneutics, and other equally important ancient traditions of meditation and reflection, which involved systematic study of certain texts (Teubert, 2003.) For many pre-Second World War literary critics and teachers of literature, epic poems, lyrical poems and drama symbolised language (Wellek & Warren, 1963). A systematic study of representative texts from these genres was a necessary and sufficient condition for understanding literature and, by implication, language. It can be argued that written texts are but a trace of evidence of the work of a scientist, spin doctor, engineer, philosopher, journalist, linguist, and lawyer or art critic for instance. What is written down is perhaps a fragment of the very long life of an individual. What is written is usually in response to some stimulus or the other, another text, sound or image perhaps. What is written down, with very rare exceptions, can be viewed in the context of other similar/dissimilar writings. Given all these caveats, the written text (or speech excerpt) is the only evidence we can share amongst ourselves, especially if the author is otherwise unavailable. The question again is this: how was the evidence collected and analysed and to what extent can the evidence be used to create an account of the language used by the scientist, the spin doctor, the philosophers and others?
62
Incorporating Corpora
Being in Text The modern corpus linguistics movement owes much to the efforts of Randolph Quirk and John Sinclair, both keenly interested in the teaching and learning of the English language and emphasised the study of English in a wider variety of contemporary texts. Quirk’s Survey of English Usage was motivated, in part, after finding that ‘the masses of material compiled over the years prove quite inadequate to serve as the basis of even elementary teaching-grammars’ (Quirke, 1968: 70). The guiding principle for selecting texts for Sinclair’s Bank of English was to ensure that the ‘texts should be in some sense typical, the sort of English that a learner would want to understand’ (Halliday, 1993: 10). One of the earliest users of the Survey was Jan Svartvik who studied [v]oice in the English verb (1966) and developed a corpus of 323,000 words including texts from science and technology. Svartvik’s key conclusion was that there was ‘value’ in using data for the construction of grammar. For dictionary makers at Longman, for a time working under the guidance of Randolph Quirk, ‘a corpus is much more effective in giving evidence about the norms of the language’; the value of ‘native speaker intuition’ was regarded ‘highly’ at Longman due to his or her ‘experience of very long-term, contextually diverse exposure to millions of lexical items in their natural environment. [. . . But] a corpus will correct some of the apprehension that we all may have about certain words and grammatical patterns’ (Summers, 1993: 183). The Bank has been used in the production of dictionaries and a range of grammar texts such as Sinclair’s (1991) and has grown from a collection of 20 million words in 1985 to a corpus of 450 million words in 2002 and on to a corpus of one billion words now. The considerable practical utility of a corpus for dictionary making is matched by the excitement, on the part of its principal creator, John Sinclair, that the Bank offers a point of departure in linguistic research: ‘What is new about this [lexical computing] project, apart from the technology, is the ability to get for the first time a view of a language which is both broad and comprehensive. Many thousands of the observations are about the commonest patterns in the language. For example, we think of verbs like see, give, keep, as having each a basic meaning; we would expect those meanings to be the commonest. However, the database tells us that see is commonest in uses like I see, you see, give in uses like give a talk and keep in uses like keep warm. The power of meanings made by phrases and nearphrases like the above is gradually understood’ (Sinclair, 1987: vii). Twenty years later, in 2007, we find that the study of collocation patterns for inferring meaning automatically is essential for the development of computer-mediated understanding of texts, including the next stage of the development of the semantic web for civilian purposes and the
Being in Text and Text in Being: Notes on Representative Texts
63
automatic identification of threats embedded in telephone and e-mail traffic for military purposes. (My own studies of how to extract information about the sentiments of financial markets from news wire, or to see knowledge obtained in research laboratories is converted into products (Ahmad & Al-Sayed, 2005), suggests that Sinclair’s observation related to collocation and meaning will have impact beyond the confines of lexicography.) Randolph Quirk, Michael Halliday and John Sinclair have demonstrated the utility of large bodies of written texts and speech excerpts for compiling lexica and grammars, and for understanding the nature and function of language. For Quirk (1968: 72, 79), the analysis of a large body of texts will enable language researchers to investigate whether or not ‘[p]hrasal construction and interpretation alike depend upon an indissociable complex of semantic analogy and grammatical analogy’. Halliday’s lexicogrammar, ‘a vast network of choices, through which the language construes its meaning: like the choices, in English, between ‘positive’ and ‘negative’, or ‘singular’ and ‘plural . . .’ (Halliday, 2004: 2) can perhaps only be observed in a text corpus: the choice may have been made at random in one single text by one author, but when one sees the same choice exercised in a number of texts written by different authors within a corpus, then the observations related to ‘positive/negative’ or ‘singular/plural’, for example, cannot be dismissed as haphazard or introspective. In Sinclair, there is the notion of semantic prosody (Sinclair, 1991): the manner in which collocation patterns reveal pragmatic intent; for example, set in invariably has a negative connotation, and handsome is associated with maleness and gravitas ‘connotations through association’ is what Sinclair (2005) calls these meaningful collocations. The key observation the authors have made is about the relationship between frequency of usage of a linguistic unit and its acceptability by a linguistic community. This frequency-based approach appears very helpful in finding the indissociable complex (Stein & Quirk, 1991: 197 203), the network of choices (Halliday, 1991) and the semantic prosody (Renouf & Sinclair, 1991). The principal sentiment in the emergent field of corpus linguistics is that language is a ‘living system’ and the best way to observe language phenomena, behaviour or miracle, is to study language as it is being used. This sentiment is also found in writings that are essentially rooted in rationalism. For Zellig Harris a language (system) appears to have a self-organising quality in that ‘the existing forms limit what is available for use, and also to some extent affect (direct) this use, while on the other hand use favors which forms are preserved or dropped, and in part also how they change’ (1991: 291). The influence of Quirk, Halliday and Sinclair on modern-day corpus linguistics has been summed up by Michael Stubbs as a set of
64
Incorporating Corpora
‘nine principles’ that have been central to much British linguistics since the 1930s (Stubbs, 1996: 2344). It appears that the rallying cry of the corpus linguists is the first principle: (i) ‘Language should be in studied in actual, attested, authentic instances of use, not as intuitive, invented, isolated sentences (Stubbs, 1996: 28). Two of the other principles are overarching: (ii) ‘linguistics is essentially a social and applied science’ (p. 25) and (iii) ‘language in use transmits the culture’ (p. 43). A number of corpora do not comprise excerpts from whole texts. Initially, this was due to the limitations of memory available in computer systems. One of Stubbs’ important principles relates to the study of whole texts (p. 32). I will start by looking at the compound term, representative text, used extensively in corpus linguistics, and one of its principal antecedents standard language. This will be followed by a survey of corpus design and realisation over the last 50 years or so, where we will try to understand the key design features used in the construction of English text corpora. I would like to argue that implicit in the design features are the unarticulated linguistic, societal and personal preferences. By way of conclusion, I will mention the newer opportunities that are becoming available through the spread of digital media.
Text in Being The word ‘text’ has a range of meanings in authoritative dictionaries. Most of these dictionaries trace the etymology of the word to its Latinate root, textere, meaning to weave or fabricate. Originally used to describe passages from Judeo-Christian Scriptures (or other authoritative source), the word ‘text’ has been used in a number of senses. Sometimes it is used to refer to an original text rather than the derivatives of the same like paraphrase, translation, and at other times the word is used as a deprecated form of the compound ‘text book’. Modern computational linguists have introduced a curious and perhaps paradoxical distinction between ‘natural language’, by which they mean written text, and ‘speech’, by which they mean spoken output that may be digitally recorded. Then there are political writers for whom text only stands for a speech that has appeared in print. The word representative also has a number of highly interrelated senses: a representative object, person or place, serves to ‘represent, portray, figure, or symbolize’ a collection of objects, a collective of persons or a group of places. The etymology of the word can be traced back to the Latin repr?sentativus. The 19th-century proponents of literary criticism (cf. Fowler, 1987) were looking for a representative expression that was uniquely capable of revealing the truth of contemporary life in society. Some of the corpora created since the 1960s do have poetry and drama as text types. But more importantly, the super-categories used by some
Being in Text and Text in Being: Notes on Representative Texts
65
corpus builders, imaginative and informative texts, hark back to the division between the more creative lyrical poetry and the more factual epic poetry (Welleck & Warren, 1963). Imaginative texts include works of all kinds of fiction, and informative texts are largely drawn from texts written by scientists, technologists and from other specialist paraphernalia of 21st-century life administrative texts, safety texts and so on. ‘Drama’, even when excluded as a text type, is present through speech excerpts derived from radio programmes, television programmes or surreptitious recordings of academic committee meetings. The notes of the guardians of the British National Corpus here are particularly interesting: ‘The overall distribution between informative and imaginative text samples in the BNC was set to reflect the influential cultural role of literature and creative writing. [. . .] Eight informative domains were arrived at by consensus, based loosely upon the pattern of book publishing in the UK during the past 20 years’ (Aston & Burnard, 1998: 29). There is, it appears, a change in what is regarded as representative delivery medium for texts: In the early period, the focus was on literary texts (1960s), then the newspapers had their turn (1970s), followed by learned texts (1980s). And, now there are electronic texts e-mails, web logs, which are to form 15% of the newly mooted American National Corpus. Over the last 50 years, the corpus builders have tended to include more and more informative texts compared to imaginative texts (see Figure 5.1). 100%
%age Composition
90% 80% 70% 60% Imaginative
50%
Informative
40% 30% 20% 10% 0% Longman
Brown
BNC
Bank of English
ICE
Corpus
Figure 5.1 The distribution of texts according to genre in different corpora. The earliest date (1960) refers to the release of the Brown Corpus and the latest to that of the International Corpus of English (2000). The data for the Bank of English (created at the University of Birmingham and now part of Collins Word Web (http://www.collins.co.uk/books.aspx?group=140-) is from the first release of the corpus in 1985.
66
Incorporating Corpora
Representing Mother Tongue or Other Tongue The collocate representative text is a ‘downward collocate’ (Sinclair, 1991) of text, as the word text is used more frequently than representative, especially in the British National Corpus. The implicit, and sometimes articulated assumption is that representative texts are written in Standard English. The post-WWII boom in English studies, both within the UK and the world over, led to an examination of how the language was taught to native speakers, to others for work, rest and play (English as a Foreign Language, English as a Second Language, English for Specific Purposes and English Special Language). The teaching material came under close scrutiny as the ‘wind of change’ a phrase coined by the late Harold Macmillan blew across continents, and according to Kachru (1995) it was no longer tenable to hold on to a notion of institutional English and the focus shifted to performance. Are we seeing this shift in the greater emphasis on informative texts in the recent corpora of English? But before we explore this question let us return to post-Imperial Britain and scholars of English language. The ‘inadequate’ state of material ‘to serve as a basis of even elementary teaching-grammars’, especially in the teaching of English as a second language, was highlighted notably by Quirk. He also commented that the writers of teaching materials have ‘still to rely upon their own uncertain impression of what is normally written or spoken by the educated (and therefore safely imitable) native speaker, and some are emboldened by the lack of reliable information to continue prescribing according to their own predilections’ (Quirk, 1968: 69, emphasis added). So a copious record was created in London and was called the Survey of Educated English Usage; this record was subsequently digitised. ‘Educated English’ was tentatively defined as ‘English that is recognized as such by educated native English speakers’ (Quirk, 1968: 79). Perhaps aware of the circularity of his definition, Quirk (1968: 80) goes onto elaborate educated English as comprising ‘the whole range [. . .] from learned and technical writing to the most spontaneous colloquial English’. The Survey was subsequently known simply as The Survey of English Usage. The term native speaker is geographically constrained and in the context of English the default native speaker is a British English native speaker. However, corpus builders have taken into account the rise of American English. Currently, the proponents of the American National Corpus are collecting data largely in American English. The International Corpus of English, related by antecedence to the Survey, does contain Other Englishes. The corpus was expected to help in ‘making linguistic statements which take into account of both form and meaning’ by
Being in Text and Text in Being: Notes on Representative Texts
67
exploring the ‘indissociable complex of semantic analogy and grammatical analogy’ (Quirk, 1968: 83 84). Traditionally, grammarians and lexicographers have examined quantities of recorded material, but only ‘sporadically’ and only exceptional patterns have been carefully analysed without reference to the great framework of patterns into which the exceptional patterns of usage may fit. The Survey of English Usage was designed to contain ‘copious materials, made up of continuous stretches or ‘texts’ taken from the full range of co-existing varieties and strata of educated English, [spoken and written], at the present time. For each stretch of material, account must be taken of all the grammatical data, distinguishing between the normal and variant type of each constructional type, and observing which constructions occur with which other constructions’ (Quirk, 1968: 78). Many years later, Quirk (1995) was still robust in his defence of nativespeaker-based Standard English (NSBSE): ‘any local variety, and especially one of uncertain stability, will be of diminishing usefulness in contrast to the [NSBSE] with its world-wide currency’. It is not clear whether Quirk thought the NSBSE was the North American English or British English variety when he talked about worldwide currency. Braj Kachru (1992) was one of the proponents of Other Englishes, especially the post-Imperial varieties. For him there are many varieties of English that are distributed geographically, representing ‘both the institutionalised varieties (for example, Nigerian English, Kenyan English) and the performance varieties (Chinese English, Japanese English)’ (Kachru, 1992: 8). The term ‘institutionalised variety’ is used as a code for varieties that had evolved during the 1819th-century British colonial expansion, and ‘performance variety’ is used for national varieties that generally evolved through trade and commerce. One can, if one wishes, trace the institutional and performative varieties to controversial notions of ‘attitude and power’. The rise of American English from ‘a colonial substandard to a prestige language’ was predicted in the 18th century because ‘the increasing population in America, and their universal connection and correspondence with all nations . . . force their language into general use’ (Kahane, 1992: 212, citing John Adams). The quest for Standard English and the ability of scholars to find the evidence for such a language through a representative corpus has since entered the mainstream of English dictionary making. Della Summers (1993: 189190) has argued, for instance, that ‘representative is what we judge to be typical and central aspects of language and providing enough occurrences of words and phrases for the lexicographers and other students of language, to believe that they have sufficient evidence from the corpus to make accurate statements about lexical behaviour’.
68
Incorporating Corpora
Arguments for and against Standard English notwithstanding, the ‘winds of change’ did blow also for corpus linguistics: the late Sidney Greenbaum’s International Corpus of English A Corpus of English from East Africa, Hong Kong, India, Jamaica, New Zealand, the Philippines, Singapore and Wales is partly based on the Lancaster-Oslo Bergen corpus model, with 2000 words per text, and partly on the notion of full texts. Let us now survey some of the major corpora of English Language that have been used in creating dictionaries and teaching material. Representing representative American English: ‘Habeas corpus’ The first reported digital text corpus was the Brown (University) Corpus christened by the pioneers, Nelson Francis and Henry Kucˆera (1979), as a Standard Sample of Present-day American English. This corpus was created during the 1960s and there were considerable computational limitations on how much data could be stored inside a computer. The pioneers were at pains to suggest that the term ‘standard’ should be interpreted in its strict statistical sense and not in its sociolinguistic sense: ‘Samples were chosen for their representative quality rather than for any subjectively determined excellence. The use of the word standard in the title of the Corpus does not in any way mean that it is put forward as ‘standard English’; it merely expresses the hope that this corpus will be used for comparative studies where it is important to use the same body of data’ (Francis & Kucˆera, 1979, emphasis added). The 1-million-word corpus was the largest collection of texts captured on a magnetic tape; anecdotal evidence suggests that when the Brown team presented the magnetic tape to Randolph Quirk the term they used was habeas corpus (Svartvik, 2006). To emphasise that the term ‘standard’ was not a subjective, socially discriminating term, the corpus builders introduced another statistical term random sampling. A random sample is chosen from a universe of objects under study; one can choose a library catalogue or a directory of books-in-print as the ‘universe’. Each text in the universe is assigned a unique number, a pseudo-random number generator is used to generate sequences of numbers, and these numbers are used, in turn, to select a book with a given number. The corpus builders wanted different genres and writing styles to be represented in the corpus for some corpus builders, romantic fiction is very important and for others drama may be more important, and for yet others learned journals may have preference. In order to have a degree of generality, and perhaps following traditional divisions in genre analysis, Francis and Kucˆera introduced two text types informative and imaginative. The creators of the Brown University Corpus selected their texts by randomly selecting titles and articles from catalogues and listings of
Being in Text and Text in Being: Notes on Representative Texts
69
Table 5.1 The Brown Corpus ‘universe’ of texts (Francis & Kucˆera, 1979) Genre
Source
Newspapers
Listing, microfilmed collection, New York Public Library; Providence Athenaeum
Periodicals: popular lore and skills and hobbies
A catalogue of a second-hand magazines store, New York
Books
Brown University Library Catalogue
various sorts (Table 5.1). Passages of 2000 words were then selected from each of the selected titles. Table 5.2 shows a detailed breakdown of the 15 different text types grouped into two categories informative and imaginative. The balance between imaginative and informative texts is 1:4; 25% of the Brown Corpus comprises imaginative texts and 75% informative texts. Amongst the informative texts, the Learned genre (J) and the three (sub-) genres grouped under Press (A, B & C) have roughly equal share, ca. 17%. The belles lettres etc (G) and popular lore (F) comprise a quarter of all the texts. The learned texts comprise texts from: Humanities (18 in all), Political Science, Law & Education (15), Social and Behavioural Sciences (14), Natural Sciences (12),Technology and Engineering (12), Medicine (5) and Mathematics (4); there is a balance of sorts between ‘softer subjects’, 47 texts of 2000 words each for the first three subgenres, and 33 for science, engineering, technology and medicine. The Brown corpus makers excluded verse, drama and fiction excerpts that comprised more than 50% dialogue. Verse ‘presents special linguistic problems different from those of prose’; and drama is ‘the imaginative recreation of spoken discourse, rather than true written discourse’ (Francis & Kucˆera, 1979) Representing ‘mirrored’ British English: The LOB Corpus The Lancaster-Oslo/Bergen (LOB) Corpus1 was aimed at a general representation of text types for research on a broad range of text types selected from four ‘media’: books, newspapers and periodicals, and government documents (Table 5.3). The LOB corpus was a ‘mirror’ corpus for the already established Brown Corpus; the LOB comprised 500 texts of 2000 words each written in British English to reflect the American English corpus. One of the aims was to find systematic differences between British and American English; some results of this inquiry were encouraging in that systematic differences in spelling were found but at other levels of linguistic description the inquiry was less conclusive.
Incorporating Corpora
70
Table 5.2 The composition of the Brown Corpus Category
No. of texts
Genre (Code)
Total tokens
%
Informative
Learned (J)
80
160,000
16.0
Informative
Belles Lettres, biography, memoirs, etc. (G)
75
150,000
15.0
Informative
Popular lore (F)
48
96,000
9.6
Informative
Press: reportage (A)
44
88,000
8.8
Informative
Skills and hobbies (E)
36
72,000
7.2
Informative
Miscellaneous (H)
30
60,000
6.0
Imaginative
General fiction (K)
29
58,000
5.8
Imaginative
Adventure and Western fiction (N)
29
58,000
5.8
Imaginative
Romance and love story (P)
29
58,000
5.8
Informative
Press: editorial (B)
27
54,000
5.4
Imaginative
Mystery and detective fiction (L)
24
48,000
4.8
Informative
Press: reviews (theatre, books, music, dance) (C)
17
34,000
3.4
Informative
Religion (D)
17
34,000
3.4
Imaginative
Humour (R)
9
18,000
1.8
Imaginative
Science fiction (M)
6
12,000
1.2
500
1,000,000
100.0
Total
Table 5.3 The genre and source of texts in the LOB corpus Genre
Source
Newspaper
Willing’s Press Guide (1961)
Periodical
Willing’s Press Guide (1961)
Books
The Br. Nat. Bibliography Cumulated Subject Index (1960 64)
Govt. documents
Catalogue of Govt Publications (1961: London, HMSO)
Being in Text and Text in Being: Notes on Representative Texts
71
For the Press categories A & B in Brown, the LOB team followed the American team; national (or mainly London-based) newspapers comprised 60% of all texts in these categories and regional newspapers made up 40%. Category C (Press Reviews etc.) in LOB was structured in favour of the national press, with a high representation of ‘quality’ Sunday papers and the inclusion of the Times Literary Supplement and the Times Educational Supplement on the basis of the importance of these in review writing. The compilers of LOB appear to depend on published catalogues for newspapers and periodicals, and for books they have used the British National Bibliography; the Brown compilers relied on their local libraries for newspapers and books, and for the periodicals the source used was a ‘local second hand magazine store’. I have used the anecdotal categories of newspapers: red tops popular newspapers, for example, the Daily Express and the now defunct Daily Worker; middle brow newspapers, for instance the Daily Mail; and quality papers this list includes The Guardian, The Times and perhaps the Daily Telegraph and their Sunday varieties. I have classed all provincial morning papers as middle brow and the evening papers as red top. The 88 newspaper excerpts of about 2000 words each, including reportage (44), editorials, op-eds and letters to the editors (27) and reviews (17) comprise over 86,000 words of the 1-million-word LOB corpus. The red tops account for 42% of the three Press categories (A, B, C), middle brows for 35% and the quality newspapers for 23% of the 86,000 or so words. Newspaper circulation may not be a good indicator of the stratification of a society, but the LOB division of 58% (35% 23%) to 42% shows how the quality and middle brow makes up 3/5 of the text and the remaining 2/5 the red tops; this may fail to reflect circulation figures but perhaps does not quite reflect the British demographics even in the 1960s (Table 5.4).
Table 5.4 Comparison of the composition of the newspaper subcorpora in three different corpora: Lancaster-Oslo Bergen (LOB), British National (BNC) and Bank of English LOB
BNC
Bank of English
Newspaper type/corpora
1961
1995
Red top
41.9%
7.79%
Middle brow
34.8%
59.90%
35.7%
0.0%
Quality
23.3%
32.31%
64.3%
74.1%
1993
2001 25.9%
72
Incorporating Corpora
Representative English for the learner: The Birmingham Bank of English The Birmingham Bank of English was compiled under the guidance of the late John Sinclair, in close collaboration with Collins Publishers, and served as a source of ‘sufficient and relevant textual evidence’ (Renouf, 1987: 1) for the production of ‘the first wholly new dictionary for many years’ (Sinclair, 1987: vii): a dictionary not based solely on the introspection of lexicographers and their advisers but based rather on how authors of a wide variety of texts (and speakers partaking in conversation and delivering speeches) use words and phrases. The first release of the Bank of English (in 1993) corpus contained 20 million words of current English. The figure is now around 500 million words and the corpus is refreshed at regular intervals. The focus of the Bank compilers was on texts published between 1960 and 1985, the team preferring general language text rather than texts written in ‘technical language’. The compilers, with advice from teachers of English in the UK and abroad, and from British Council Libraries across the world, selected texts themselves. This method of text selection was different from the random-selection approach used by the LOB corpus, which randomly selects titles from bibliographies, and so on. However, in order to check the ‘relevance’ and ‘influence’ of a given text, the Bank compilers regularly checked bestseller lists in newspapers as well as catalogues from leading publishers. The text in the first Bank of English corpus is not split along LOB’s informative/imaginative axis; rather, the textual ‘medium’ is taken as a basis of text classification: books, newspapers, magazines, brochures and leaflets, and personal correspondence are used to define the text typology. In the first release in 1993, the book variety dominated the corpus, contributing around 214 texts, with the rest contributing 70 texts. The book variety is subdivided into fiction and non-fiction (‘imaginative’ and ‘informative’ perhaps?) with the former dominating and contributing 177 texts out of 214. The authorship of books is 75% male and 25% female; British English texts account for 70% of texts, American English for 20% and the rest comprise other varieties including Australian English. The Bank comprises a range of topics from American Indians to Vietnam, from childcare to sex through to the Third World, and myths and cults to natural history. The size of the corpus increased from 121 million words in 1993 to 448 million words in 2001. The close relationship with the owners of HarperCollins, News International Corporation, has added substantially to the corpus, notably increasing the holdings of American English Texts. The material has been revised and there is a drop in the book genre comprising less than 1 in 5 texts in the 2001 corpus as compared to the
Being in Text and Text in Being: Notes on Representative Texts
73
Table 5.5 The rise of the ‘newspapers’ and the fall of ‘books’ in the constitution of the Bank Text type/genre
2001
1993
448 million
121 million
Newspaper
45%
26%
Books
17%
39%
Magazines (Sp)
15%
7%
Radio
9%
25%
Spoken
5%
3%
Miscellaneous
4%
Business
2%
Ephemera
2%
Academic
1%
Total no. of tokens in millions
100%
1%
100%
2 in 3 texts in 1993; the contribution of the newspaper genre has nearly doubled between 1993 and 2001 (see Table 5.5). The data in the Bank are pruned regularly and more data are added: old newspapers are removed and newer issues are introduced. The change in the kind of newspaper is quite interesting over the years, as shown in Table 5.6. The middle-brow, now-defunct Today, has been replaced by two red tops, The Sun and the Sunday News of the World; and the high-brow American English Wall Street Journal has made way for the British English Guardian the politics of the two newspapers could not be more different (see Table 5.6).
Representative English for the learner: The Longman/Lancaster English Language Corpus The motivation for creating the Longman/Lancaster Corpus was to provide lexicographers and linguists with ‘an entirely new, conceived from scratch, corpus of English that could serve a number of purposes and be organised according to objective criteria’ (Summers, 1991: 1). The Longman/Lancaster team acknowledges the influence of Geoffrey Leech and Douglas Biber. The primary purpose of this 30-million-word corpus was ‘to provide an objective source of language data from which reliable linguistic judgments about the meaning and typical behaviour of words
Incorporating Corpora
74
Table 5.6 Changes in the distribution of newspapers in the Bank of English Title
2001
Total no. of tokens in millions
202
Times
26%
Sun and News of the World
22%
Australian newspapers
17%
Guardian
16%
Independent
14%
American newspapers
1993 31 32%
16%
5%
Today*
32%
Wall Street Journal
19%
Total
100%
100%
*Today newspaper ceased publication in 1995
and phrases can be made as a basis for dictionaries, grammars and language books of all kinds’ (Summers, 1991: 3). What distinguishes the Longman/Lancaster Corpus from the LOB and the Brown Corpus is that the former is ‘topic driven’ whilst the latter are ‘genre driven’. LOB distinguishes ‘academic discourse’ and ‘press reportage’, ‘press editorial’ from ‘arts’, and so on. The ‘topic-driven’ texts in the Longman/Lancaster Corpus are categorised in 10 superfields (Summers, 1991) (see Table 5.7). The lexicographic argument for choosing the topic-based approach was that ‘it was more likely to produce text categories that were lexically homogenous’ (Summers 1991: 7). The Longman/Lancaster Corpus contains ‘drama’ as a genre within the fiction superfield drama was not included by the Bank of English compilers, suggesting that drama is not an example of ‘naturally occurring text[ . . .]’. The Longman/Lancaster corpus contains roughly equal samples of English dating back to 190049 and 195069 (30% each) and 40% of the corpus content is post-1970s. Texts in Longman/ Lancaster are divided into meta-categories, informative and imaginative, subdivided into the superfields. Like the LOB corpus, informative texts comprise books, newspapers and journals, unpublished and ephemera, and imaginative texts, and are mainly works of fiction in book form. There are four ‘external factors’ that form the basis of text categorisation in Longman/Lancaster: ‘region’, including language varieties; ‘time’, a diachronic corpus containing text published between 1900 and the 1980s; ‘medium’, including the ‘sources’ of texts books (80%),
Being in Text and Text in Being: Notes on Representative Texts
75
Table 5.7 Distribution of texts in the Longman/Lancaster Corpus Topics
%
Fiction
40.0
Social sciences
14.1
World affairs
10.9
Arts
7.9
Natural/pure sciences
6.0
Leisure
5.7
Beliefs and thoughts
4.7
Commerce and finance
4.4
Applied sciences
4.3
periodicals (13.3%) and ephemera (6.7%); and finally, the ‘level’ of text. For informative texts there are three levels: ‘technical’, ‘lay’ and ‘popular’. Similarly, the imaginative texts are divided into ‘literary’, ‘middle’ and ‘popular’. The text external features used in the compilation of Longman/ Lancaster include authors’ gender and country of origin, number of words in total and title of the text. Most text types in Longman/Lancaster are about 40,000 words long. No whole texts were included because the ‘emphasis was on many sources rather than the completeness of texts’ (Summers, 1991) (cf. length of Bank of English texts is ca. 70,000, and includes a number of whole texts). The Longman/Lancaster corpus design is such that half of the 20 million words are derived from carefully selected texts (ca. 15 million) the ‘selective texts’ and the other half is the randomly selected, individual titles collectively known as the ‘microcosmic texts’. Like the LOB corpus, Longman/Lancaster have used a book catalogue Whittaker’s Book in Print and selected texts originally published in English (in English-speaking countries) before 1900. The selective texts are generally well known texts, whereas the microcosmic texts include technical texts that are not well known.
The British National Corpus The BNC was ‘designed to characterise the state of contemporary British English in its various social and generic uses’; the BNC does not necessarily ‘provide a reliable sample for any particular set of such criteria’ (Aston & Burnard, 1998: 28). The BNC compilers chose three
76
Incorporating Corpora
features for the written texts within the corpus (BNC comprises 90% written texts and 10% speech excerpts): subject fields, publication date (or equivalent), and media. There are five classes of media books, newspapers and magazines (collectively referred to as periodicals), miscellaneous published or unpublished documents, and written-to-be-spoken texts. For each text in the corpus the designers have attached a ‘Level’ of writing (a subjective measure of reading difficulty): the more literary or technical a text, the ‘higher’ its level. The levels range from ‘Low’, ‘Medium’ to ‘High’. Only under 10% of the text in the BNC is whole text and the rest is either taken from the beginning/middle/end of the texts (ca. 50%); and for just under a third there is no information. This corpus is one of the most extensively used corpora and has been discussed by many authors. The design team included major dictionary publishers Oxford University Press, Longman, and ChambersLarousse and the University of Lancaster. Oxford University Computing Service maintains the corpus. Randolph Quirk was closely involved with the BNC project. I will briefly look at the two main categories of texts in the BNC books and periodicals that comprise just under 90% of the written corpus. The selection of texts is perhaps indicative of how the corpus compilers selected texts that are expected to ‘reflect the widest possible variety of users and uses of the language’ (Aston & Burnard, 1998). My observations are not based on a close reading of the texts but rather on what I regard as the key attributes of a given text: (i) the sources; (ii) the authorship; (iii) the ‘level of difficulty’; and (iv) the diversity of texts in different media, domains and genre. I have benefited from David Lee’s work on helping researchers through the ‘BNC jungle’ a maze of overlapping terminology and what he calls ‘broad and misleading’ classification of the written word and the various areas of human endeavour and knowledge (Lee, 2001). The data in Lee is not exactly the same as is shown on the BNC web pages: there are 4055 texts in Lee’s Indexer rather than 4124 texts. Publications related to the BNC suggest that the ‘book’ section of the corpus comprises over 52 million tokens but Lee’s database comprises 49.62 million tokens only. (Lee’s index refers to 951 informative texts and 456 imaginative texts making a total of 1407 texts the BNC appears to have 1422 texts in the ‘book’ category). I have used Lee’s database to look at the distribution of 49,623,530 tokens in 1407 ‘book’ type of texts in the BNC. The BNC does not have an explicit informative category, rather there are specialist texts covering world affairs, sciences (applied, natural and social), beliefs & thought and commerce & finance. There are texts on leisure as well. I will call these texts informative. Then there is a separate class of book excerpts that have been
Being in Text and Text in Being: Notes on Representative Texts
77
assigned the imaginative genre. The ratio of informative and imaginative book excerpts in the BNC is about 2:1 (see Table 5.8). The average length of the texts in the two genres is the same (about 35,200 tokens) but there is a greater variation in the length of the imaginative texts the shortest text excerpt comprises 931 tokens and the longest comprises 84,656 tokens. The variation in the length of the informative book excerpts is smaller the shortest text is 631 tokens and the longest 53,295 tokens (std dev. = 8460). Table 5.8 The distribution of genres in the BNC with respect to size No. of tokens
No. of texts
Average no. of tokens
Informative
33,539,831
951
35,268
8460
Imaginative
16,083,699
456
35,271
13,674
Genre
Std dev.
Books: Imaginative texts Genre distribution
The BNC designers have included the poetry and drama texts but in a rather limited way: over 98% of the 456 text excerpts are prose texts there are only 216,544 tokens taken from poetry and this contributes only 1.3% to the ‘book’ section of the corpus. (There are 2 excerpts from the genre of drama comprising about 45,000 tokens and making up about a quarter of 1%.) It is not clear how these will make the BNC representative unless poetry only accounts for 4 words in every 300 written in English; the drama genre has the same ‘representation’ as the ‘miscellaneous’ texts. This is perhaps unkind to dramatists writing in British English like Harold Pinter and Tom Stoppard (see Table 5.9 for details). Table 5.9 Distribution of imaginative texts by genre in the BNC No. of tokens
No. of texts
Average no. of tokens
Std dev.
15,774,302
424
37,204
12,036
Fiction: poetry
216,544
27
8020
4504
Miscellaneous
47,096
3
10,766
9293
Fiction: drama
45,757
2
22,879
1285
16,083,699
456
35,271
13,674
Genre Fiction: prose
Total
Incorporating Corpora
78
The average length of the texts in the imaginative genre is 35,271, with a standard deviation of 13,674. This disparity in standard deviation is due (a) to the shorter texts in poetry and (b) to a considerable variance in length in the ‘prose’ genre itself. Authors’ gender and level of difficulty
One way of looking at the representativeness in a society at large currently is to ascertain how well women are represented in a given walk of life. The choice of prose text excerpts in the imaginative-book category in the BNC appears to be quite enlightened given that the design of the corpus began in 1991: the female authors appear to have written just over half of the 16 million tokens of the prose whereas male authors have authored 45% of the texts the women’s writing in fact exceeds the men’s contribution by over a million tokens. As well as the gender distribution, the ‘level of difficulty’ that may be required by an ideal reader can also reveal something about the nature of texts in a corpus. Twelve in 100 prose excerpts are difficult to read and 47 out of 100 texts have ‘medium’ level of difficulty. The easy-to-read texts (referred to as ‘low’ by the BNC designers) account for 41% of the prose excerpts (see Table 5.10 for details).
Table 5.10 Distribution of texts by authors’ gender and ‘level of reading’ difficulty in the imaginative prose category used by the BNC
No. of tokens
Drama
Poetry
Prose
Miscellaneous
Total
45,757
216,544
15,774,302
47,096
16,083,699
Gender distribution Male
52%
64%
45%
21%
45%
Female
48%
30%
52%
56%
52%
Others
0%
6%
3%
23%
3%
100%
100%
100%
100%
100%
0%
40%
12%
0%
12%
100%
55%
47%
0%
47%
0%
5%
41%
100%
40%
100%
100%
100%
100%
100%
‘Level of writing’ High Medium Low
Being in Text and Text in Being: Notes on Representative Texts
79
Table 5.11 Distribution of texts within the Imaginative BNC by publishers Rank
Publisher
No. of tokens
No. of texts
1
Mills & Boon
2,600,602
49
2
Headline Book Pub. plc
1,125,059
28
3
Fontana Paperbacks
880,827
22
4
Corgi Books
837,966
21
5
Penguin Books
773,397
20
6
Faber & Faber Ltd
686,644
21
7
OUP
490,725
31
8
Oxford Bookworms (OUP)
386,885
29
9
Pan Books Ltd
353,098
9
10
Victor Gollancz
337,406
8
8,472,609
238
Total
Diversity of publishers
A look at the publishers shows that just over half of the 456 books (approximately 8.4. million tokens) were published by 10 publishers 8 based in London and 2 in Oxford; women’s romantic fiction publisher Mills & Boon accounts for 49 books comprising 2.6 million tokens, and Oxford University Press and its sister publisher Oxford Bookworms account for 60 of the books, contributing 8.4 million tokens of the imaginative texts in the BNC (see Table 5.11). In total 105 publishers can be discerned for the 456 texts and just under a half (51 publishers) contributed one text only: the total number of tokens in the imaginative books subcorpus of the BNC is 16 million half of that is from the top 10 publishers located in Southern England (see Table 5.12). Books: Informative texts Genre distribution
The informative books, 33.5 million tokens from 951 books, within the BNC can be divided into three major subject areas: (i) world affairs arts, leisure, thought and belief; 470 books or book excerpts were used comprising 16.46 million tokens 49% of all the informative texts within the corpus by tokens and world affairs dominate this category with a 24% contribution (8 million tokens); (ii) 385 science books were used
Incorporating Corpora
80
Table 5.12 The most widely and least widely used publishers in the imaginative book category of the BNC Number of publishers contributing
Number of publishers
No. of texts
No. of tokens
40 or more texts
1
49
2,600,602
30 or more but less than 40 texts
1
31
490,725
20 or more but less than 30 texts
5
121
3,917,381
10 or more but less than 20
1
20
773,397
16
92
3,519,617
4 texts exactly
7
28
991,259
3 texts exactly
12
36
1,186,689
2 texts exactly
14
28
1,028,541
1 text exactly
51
51
1,575,488
105
456
16,083,699
5 or more but less than 10 texts
Total
contributing 41% of the total (13.78 million tokens); texts from social science books dominate this category amounting to a total of 28% of the 33.5 million tokens; (iii) commerce books comprise just under 10% of the informative texts with a total contribution of 3.28 million tokens (Table 5.13 has the details). The average length of text in these different genres is approximately the same between 36,790 and 32,784 tokens; only the texts in ‘commerce’ have a higher variance in the length of texts than other genres average 34,269910,234 as opposed to the overall genres average 35,26898460. Author’s gender, level of difficulty and target readership
The target reader of the informative books was a person who is used to high/medium level of difficulty of reading material. Female authors are not as well represented as in the imaginative book selection in the BNC; male authors outnumber female ones by a factor of 4. The level of reading difficulty is much higher here in that 90% of the texts, about 27 million tokens, have medium-to-high level of difficulty (41% with high level of difficulty and 49% with medium level of difficulty). Texts with
Being in Text and Text in Being: Notes on Representative Texts
81
Table 5.13 Distribution of texts in the BNC informative book subcorpus according to domains Domain
No. of tokens
%
No. of texts
Average
Social sciences
9,418,206
28.1
256
36,790
7494
Natural sciences
2,260,888
6.7
66
34,256
9267
Applied sciences
2,107,679
6.3
63
33,455
8499
World affairs
8,126,462
24.2
230
35,332
8537
Arts
2,903,992
8.7
80
36,300
8178
Leisure
2,852,169
8.5
87
32,784
8418
Belief & thought
2,580,574
7.7
73
35,350
7465
Commerce
3,289,861
9.8
96
34,269
10,234
33,539,831
100.0
951
35,268
8460
Total
Std. dev.
low levels of reading difficulty only account for 10% of all the 951 texts in the informative book excerpt part of the BNC (see Table 5.14 for more details). Diversity of publishers
The 951 books included in the informative books category were published by 248 publishers this may give the impression that each publisher has contributed between 34 texts each. But, much like the imaginative texts, just under half of the 33.5 million tokens in the informative category 425 books out of 951 were published by 10 publishers: Longman Group and Oxford University Press accounting for 22% of the texts by token. The top 10 list is not as London-centric as is the case for imaginative texts 5 out of 10 publishers are based in and around London (see Table 5.15). Only one informative text was selected from each of the 144 publishers (see Table 5.16 for details). Periodicals: newspapers The coverage of newspapers in the BNC corpus is an interesting one: the national dailies all London-based newspapers comprise a third (32%) of the ‘medium’, two London-based tabloids (the defunct
Incorporating Corpora
82
Table 5.14 Target readership/authorship, gender and level of difficulty in the informative books category World affairs, leisure, belief & thought, and arts
Sciences
Commerce
Total
16,463,197
13,786,773
3,289,861
33,539,831
Academic
72%
66%
98%
72%
Nonacademic
28%
34%
2%
28%
Male
69%
59%
64%
68%
Female
19%
16%
5%
17%
Unknown & mixed
11%
25%
31%
15%
100%
100%
100%
100%
High
37%
52%
50%
41%
Med
50%
42%
45%
49%
Low
13%
5%
4%
10%
100%
100%
100%
100%
No. of tokens Target readership
Gender distribution
‘Level of writing’
Today classified as ‘other’ and the Daily Mirror) contribute 21% of the text, with the other provincial dailies accounting for 46% of the texts (see Table 5.17). Genre distribution
The BNC compilers have divided the texts in the national and provincial newspapers into eight categories: (i) reportage; (ii) editorial, (iii vii) features on sports, social affairs, commerce, the arts and the sciences and (viii) miscellaneous. The texts extracted from the tabloid are not categorised. The features dominate the newspaper corpus, followed closely by reportage and editorial (see Table 5.18). A detailed analysis shows that reportage (36%), sports features (14%) and social news (13%) dominate the newspaper subcorpus of the BNC.
Being in Text and Text in Being: Notes on Representative Texts
83
Table 5.15 Distribution of texts within the informative books subcorpus of the BNC by publishers Rank
Publisher
No. of tokens
No. of texts
%
1
Longman Group UK Ltd
3,963,036
116
11.8
2
OUP
3,518,701
98
10.5
3
Routledge & Kegan Paul plc
2,116,279
58
6.3
4
Blackwell Scientific Pub
1,160,578
30
3.5
5
Macmillan Education Ltd.
1,114,056
30
3.3
6
Cambridge University Press
1,020,765
27
3.0
7
Hodder & Stoughton Ltd.
814,451
21
2.4
8
Open University Press
605,231
16
1.8
9
Penguin Books
416,834
11
1.2
Basil Blackwell Ltd.
407,414
11
1.2
15,137,345
418
45.1
10
Total
Table 5.16 The most widely and least widely used publishers in the informative book category of the BNC Number of publishers contributing
Number
No. of texts
No. of tokens
100 or more texts
1
116
3,963,036
90 or more but less than 100texts
1
98
3,518,701
50 or more but less than 90 texts
5
58
2,116,279
30 or more but less than 50 texts
1
108
4,109,850
10 or more but less than 30 texts
16
49
1,734,184
5 or more but less than 10 texts
30
151
6,932,689
4 texts exactly
20
84
3,052,243
3 texts exactly
12
36
1,204,074
2 texts exactly
31
62
2,136,614
1 text exactly
144
144
4,772,161
Total
248
951
33,539,831
Incorporating Corpora
84
Table 5.17 Distribution of individual newspapers in the BNC Newspaper type
Name
No. of tokens
%
Other
Today
1,248,922
13.36
Other
The Scotsman
1,238,831
13.26
National
The Daily Telegraph
1,159,537
12.41
Other
Northern Echo
1,122,361
12.01
National
The Independent
994,550
10.64
National
The Guardian
865,584
9.26
Other
Liverpool Daily Post and Echo
817,393
8.75
Other
Belfast Telegraph
754,241
8.07
Tabloid
The Daily Mirror
728,413
7.79
Other
The East Anglian Daily Times
246,350
2.64
Other
Alton Herald
153,411
1.64
Other
Ulster News
16,285
0.17
Total
9,345,878
100.0
Table 5.18 The coverage of different newspaper texts in the BNC Text type
%
Features (arts, commerce, science, social news, & sports)
43.9
Reportage & editorial
36.2
Miscellaneous
11.1
Unclassified tabloid
8.1 100
All the ‘editorial’ content of the subcorpus is derived from national newspapers, in particular from The Independent (Table 5.19). Regional diversity
The concentration here is on newspapers from South-eastern England (53% of the total) followed by one Scottish newspaper (13.26%) and a Darlington-based newspaper (Northern Echo) from the North East
Being in Text and Text in Being: Notes on Representative Texts
85
Table 5.19 Genre distribution in the BNC informative newspapers subcorpora Provincial (‘other’)
National
Tabloid
Total
5,597,794
3,019,671
728,413
9,345,878
Sports
18.36%
9.86%
14.18%
Social
20.42%
2.71%
13.11%
Commerce
7.42%
14.07%
8.99%
Arts
4.27%
11.65%
6.32%
Science
0.98%
2.16%
1.29%
Reports
48.54%
21.97%
36.17%
Miscellaneous
0.0%
34.2%
11.1%
No classification
Editorial
0.00%
3.37%
No. of tokens Genre distribution
Total
‘Level of writing’
100%
100%
Medium
Medium
100%
7.79% 1.09%
100%
100%
Low
(12.8%). There are no newspapers from the Midlands and Wales (see Table 5.20).
Periodicals: Popular lore magazines Popular lore in the BNC covered a range of topics: applied sciences, arts, commerce, leisure, world affairs, social science and natural science. The magazines included range from Amnesty to Dogs Today and from Esquire to Machine Knitting; the game cricket is well represented with the magazine The Cricketer and the cricketers’ almanac Wisden. The category leisure comprises 49.1% of all tokens followed closely by the arts comprising 31% a total of 80% of the corpus. The commerce section, exclusively drawn from the authoritative and establishment magazine The Economist, comprises 11.6% of the popular lore subcorpus. The three sciences, applied, natural and social, comprise just over 7% of the whole subcorpus. The ‘level of writing’ indicates that two thirds of the text has a ‘medium’ level of difficulty, followed by a quarter of the texts that have
Incorporating Corpora
86
Table 5.20 Regional coverage of newspaper texts in the BNC Region
Place of publication
Titles
%
London
Daily Telegraph, Independent, Guardian, Today, Mirror
53.47
Hampshire
Alton Herald
Scotland
Edinburgh
Scotsman
13.26
Northern England
Darlington
Northern Echo
12.01
Liverpool
Liverpool Daily Post & Echo
8.75
Northern Ireland
Belfast
Belfast Telegraph, Ulster News
8.24
Eastern England
Norfolk
East Anglian Daily Times
2.64
Southern England
1.64
‘low level of writing’; only 7% are labelled as ‘high level of writing’ (see Table 5.21). Around 11.6% or just about one in eight tokens in the magazine subcorpus is from The Economist, an establishment magazine; 8.2% of the tokens are from The Art Newspaper. There is a nod towards popular culture by the inclusion of New Musical Express (5.4%) and Guitarist (4.4%), and an eclectic range of leisure pursuits is included (see Table 5.22)
Periodicals: Specialist discourse journals The specialised discourse covers a number of disciplines categorised under four broad themes applied science, natural science, social science and the arts. The ‘sciences’ make up over 90% of the corpus and most of the texts have a high level of writing (73%) (see Table 5.23). Just under half (46%) of the specialised discourse corpus comes from 10 publications covering all four domains of knowledge, and 19 texts from two journals comprise a quarter of all texts when measured as number of tokens in texts. A very large number of legal texts, 75 texts from the Weekly Law Reports, comprise just over 1% of all texts in this subcorpus (28,504 tokens out of a total of 2,266,752 tokens) (see Table 5.24). Oxford University Press has been very generous, again here contributing a third of the top 10 texts 388,606 tokens (in Table 5.24) comprising about 15% of the specialist course with BNC.
4
Rank
47.0%
Low
100%
28.1%
Medium
Total
24.9%
High
‘Level of writing’
100%
97.14%
Others
Total
1.75%
Female
100%
13.1%
72.7%
14.2%
100%
98.25%
2.86%
2
31.0
2,265,919
Arts
Male
Gender distribution
5.6
408,135
%
No. of tokens
Applied sciences
100%
37.0%
100%
60.6%
2.4%
100%
96.92%
3.08%
1
49.1
3,592,067
Leisure
100.0%
100%
100.00%
3
11.7
856,200
Commerce
Table 5.21 The subject diversity of the popular lore genre in the BNC
100%
100.0%
100%
100.00%
6
1.0
87,056
World affairs
100%
100.0%
100%
100.00%
7
0.4
31,439
Social science
100%
100%
25.8%
67.2%
100.0%
7.0%
100%
97.36%
0.97%
1.67%
100.0
7,314,239
Total
100%
100%
5
1.2
73,423
Natural sciences
Being in Text and Text in Being: Notes on Representative Texts 87
Incorporating Corpora
88
Table 5.22 Periodicals that dominate the popular lore subcorpora Rank
No. of articles
Periodical
No. of tokens
%
1
The Economist
14
856,200
11.6
2
The Art Newspaper
12
607,897
8.2
3
New Musical Express
7
396,560
5.4
4
Guitarist
6
321,518
4.4
5
Esquire
5
293,758
4.0
6
The Face
6
233,206
3.2
7
Practical Fishkeeping
5
230,127
3.1
8
Climber and Hill Walker
7
201,773
2.7
9
[Articles from Practical PC]
1
191,743
2.6
10
Outdoor Action
7
188,359
2.6
11
Rugby World and Post
5
168,645
2.3
75
3,689,786
50.0
Total
Afterword There can be little doubt that corpus linguistics has had a considerable effect on lexicography in the last 50 years or so; major publishers of English dictionaries and grammars advertise the fact that their lexicographers have had access to a large corpus (approximately 100 500 million words) in the production of lexica, grammar texts and other language teaching and learning resources, in a sense fulfilling a need identified by Randolph Quirk (1968). And, there are supermarkets selling language resources, including text corpora, in Philadelphia (Linguistic Data Consortium 2006) Table 5.23 Subject diversity of the specialised genre in the BNC Applied sciences
Natural sciences
Social sciences
Arts
Total
896,129
652,955
916,253
201,415
2,666,752
High
85%
31%
86%
100%
73%
Med
15%
69%
14%
total
100%
100%
100%
No. of tokens ‘Level of writing’
100%
27% 100%
Being in Text and Text in Being: Notes on Representative Texts
89
Table 5.24 Periodicals that dominate the specialised subcorpora No. of tokens
Domain
Discipline
Applied science
Gastroenterology & hepatology
713,179
5
Gut: J. of Gast. & Hep.
BMA
Natural science
Biochemistry
202,931
14
Nucleic Acids Res.
OUP
Social science
Law (reports)
28,504
75
The Weekly Law Rep.
HMSO
Arts
Language & literature
79,106
26
Lang. & Lit.
Longman
Arts
Music
71,784
2
Early Music
OUP
Social science
History
26,250
1
20th Cen. Br. History
OUP
Applied science
Computers
28,504
4
Electronic Pub
John Wiley
Social science
Politics
18,558
1
Parliamentary Affairs
OUP
The arts
Art
26,225
1
Oxford Art J.
OUP
The arts
Film studies
24,300
1
Screen
OUP
Applied science
Linguistics
18,558
1
J. of Semantics
OUP
1,237,899
131
Total
Texts
Title
Publisher
and Paris (European Language Resource Association 2006). Text corpora have been used in translation in ontology studies in web-based computing, and in terminology studies.2 The corpus builders have the learner firmly focussed in their minds the diminution of the texts written in literary language in the modern corpora and the large quantities of informative texts, drawn from science and technology, shows that the builders are focussing on the learner who would like to study science and technology. What of the representative corpora? Ergo, how representative is the text in the BNC? One can ask subjective questions like have the BNC builders sampled the writings of the various social classes in England? This is a difficult question to answer without checking in detail the biographical details of the authors of all the 4000 plus texts items of information not available readily from the BNC description. However,
Incorporating Corpora
90
when the speech samples were recorded by the corpus builders they did note the social class distribution of the informants. And what of some of Stubbs’ principles (1996) I described in the beginning of this chapter? (a) (b) (c)
‘Language should be in studied in actual, attested, authentic instances of use, not as intuitive, invented, isolated sentences (p. 28); ‘language in use transmits the culture’ (p. 43); and the unit of study must be whole texts (p. 32).
All corpus compilers have endeavoured to study actual, authenticated instances of language use in its various genres and styles. In the selection of texts used in the various corpora discussed above, one can discern a preponderance of texts used by ‘educated native English speakers’; and in the case of the BNC, the speakers appear to read texts published by many enterprises located in Southern England. The texts used in most of the corpora, with the possible exception of the Bank of English, is only a part of the text and not the whole text in the BNC there is evidence that just under 10% of the texts are whole texts. The social class distinction in the BNC is quite vivid: here just less than 40% (59 out of 153) of the speech samples come from the more affluent and more educated strata of the society (Table 5.25). The pressing problem in any scientific, technological and creative endeavour is that of the observers’ personal framework of thinking, social interactions, and biases and prejudices: all the traits that make humans human. The questions regarding personal framework are not unique to corpus builders; psychologists, sociologists, scholars of the arts and other similar human endeavours face the same problem. For example, if gender of the author of a text is an important attribute, then why is it that authors of one gender are over-represented in the BNC in the informative texts; 68% of the book excerpts in the BNC, about 20 million tokens, were written by men and only 17% by female authors? Table 5.25 The ‘representation’ of the various social classes in the BNC speech corpora Social Class AB
speech
Texts% 38.56
C1
23.52
C2
20.26
DE
13.07
Unclassified
4.57
Being in Text and Text in Being: Notes on Representative Texts
91
The selection of a very large number of texts from one publisher in one place at one time in some respect defies some of the basic tenets of random sampling. I do understand the pragmatics of choosing one single publisher; for expeditious corpus building one has to approach publishers who have some empathy with research in language sciences and arts. In our discussion of ‘newspaper’ text, we encountered yet again the perennial argument about standard English: tabloids appear to be underrepresented in the BNC, and appear and disappear in the Bank. If the point is that learners of English outside England only need to know about academic English, then it is not clear how texts published in one or two journals, dedicated to very narrow subdomains of a specialism, will help the learner. The ‘representation’ of different human endeavours specific to one interest variously called domain, discipline, specialisation and instantiated as the Arts, Natural Science or Applied Science for example has its own problems. Can one 18,000-word paper in journals like the Journal of Semantics or 25 papers, comprising 77,793 words, in Language and Literature be regarded as representative text in the study of language? Can biochemistry be represented by one journal? In the BNC the 12 research papers will be the learners’ ‘universe’ in biochemistry published in Nucleic Acid Research? The texts in the BNC are largely focussed on ‘medium’ to ‘high’ level of difficulty of reading (81.6%); comparatively fewer texts have ‘low’ level of difficulty (18.4%) (see Table 5.26). The assignment of a level of ‘difficulty’ is prescriptive. Texts published in self-selecting journals, newspapers and magazines may indeed be difficult to understand. This difficulty may lie in the abstruse concepts, and sometimes counterintuitive arguments, being presented with an unfamiliar vocabulary and the vocabulary is ordered by the grammar used by a closed community of language users. Three points are perhaps worth noting here. First, the choice of lexical items, and to a lesser extent that of grammatical constructs, depends upon the author and the author is expected to be writing without the overt or covert coercion of his or her domain Table 5.26 Level of difficulty in the BNC texts; numbers relate to the written part of the corpus Level of difficulty
No. of tokens
%
High
24,485,484
28.1
Medium
46,675,620
53.5
Low
16,015,520
18.4
Total
87,176,624
100
92
Incorporating Corpora
community. Second, lexical choice can be monitored and peer pressure can in many ways be quite persuasive. There are anecdotal stories about the Nobel Laureate Barbara McClintock suggesting it was her plain (and ‘folksy’) style when treating complex issues in botany that prevented her colleagues from recognising her genius much, much earlier than her 80th birthday or so. And it was not so long ago that the term behaviour was term non grata at the linguistics high table. And third, given the dominance of a small group of metropolitan publishers within the BNC, it will be difficult to put forward the notion of random sampling very forcefully. There is a touch of modernity (the 21st century) in corpus linguistics the use of text technology that helps to handle gigabytes of texts, for some lends a degree of objectivity to the enterprise of corpus linguistics. The objectivity is based on the use of statistical methods in the analysis of randomly sampled texts. Technology and objectivity are essential for good science but so is scepticism. The scepticism is due to the fact that while language is very accessible to human beings, it invariably has escaped the matrix of theories, rules, schemata that have been proposed since the times of Panini. No doubt English language scholars have done well, but there is a long way to go before more technology and more objectivity will help reduce scepticism about the fascinating way of teaching and learning language. Notes 1. The LOB Corpus originated in 1970 at the University of Lancaster (UK) under the direction of Geoffrey Leech. In 1977 the project was transferred to Oslo and Bergen in Norway under the direction of Knut Hoffland and Stig Johansson. 2. Corpus-based studies have worked for me in that, using text corpora in conjunction with text analysis and knowledge representation systems, I have been able to: automatically extract candidate terminology for specific domains (Ahmad, 1995; Ahmad & Rogers, 2001); and then proposed, quite successfully, tentative conceptual structures, or ontology, of these domains by analysing the relationships between the candidates (Gillam et al., 2005). In this respect, conversations with John Sinclair (Tuscan Word Center), Wolfgang Teubert (University of Birmingham) and Christer Lauren (University of Vassa) have helped a great deal.
References Ahmad, K. (1995) Pragmatics of specialist terms and terminology management. In P. Steffens (ed.) Machine Translation and the Lexicon (pp. 51 76). 3rd Int. EAMT Workshop, Heidelberg, Germany, 26 28 April 1993. Heidelberg, Germany: Springer. (Lecture Notes on Artificial Intelligence 898). Ahmad, K. and Al-Sayed, R. (2005) Community of practice and the special language ‘ground’. In S. Clarke and E. Coakes (eds) Encyclopaedia of Knowledge
Being in Text and Text in Being: Notes on Representative Texts
93
Management and Community of Practice (pp. 77 88). Hershey: The Idea Group Reference. Ahmad, K. and Rogers, M. (2001) Corpus linguistics and terminology extraction. In S.-E. Wright and G. Budin (eds) Handbook of Terminology Management (Vol. 2, pp. 725 760). Amsterdam & Philadelphia: John Benjamins Publishing Company. Aijmer, K. and Altenberg, B. (eds) (1991) English Corpus Linguistics: Studies in Honour of Jan Svartvik. London & New York: Longman. Aston, G. (2001) Text categories and corpus users: A response to David Lee. Language Learning & Technology 5 (3), 73 76. On www at http://llt.msu.edu/ vol5num3/aston/. Accessed 7.11.05. Aston, G. and Burnard, L. (1998) The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. European Language Resource Association (2006) On WWW at http://www. elra.info/. Accessed 4.10.06. Fowler, R. (1987) A Dictionary of Critical Terms (revised edition). London and New York: Routledge & Kegan Paul. Francis, W.N. and Kucˆera, H. (1979) Brown Corpus Manual. On WWW at http:// khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM. Accessed 17.7.06. Gillam, L., Tariq, M. and Ahmad, K. (2005) Terminology and the construction of ontology. Terminology 11 (1), 55 81. Halliday, M.A.K. (1991) Corpus studies and probabilistic grammar. In K. Aijmer and B. Altenberg (eds) English Corpus Linguistics: Studies in Honour of Jan Svartvik (pp. 30 44). London & New York: Longman. Halliday, M.A.K. (1993) Quantitative studies and probabilities in grammar. In M. Hoey (ed.) Data, Description, Discourse: Papers on the English Language in Honour of John Sinclair (pp. 1 25). London: Harper-Collins. Halliday, M.A.K. (2004) Lexicology. In M.A.K. Halliday, W. Teubert, C. Yallop ˇ erma´kova´ (eds) Lexicology and Corpus Linguistics An Introduction and A. C (pp. 1 22). London & New York: Continuum. Harris, Z. (1991) A Theory of Language and Information A Mathematical Approach. Oxford: Clarendon Press. Kachru, B. (ed.) (1992) The Other Englishes: English Across Cultures (2nd edn). Urbana and Chicago: University of Illinois Press. Kahane, H. (1992) American English: From a colonial substandard to a prestige language. In B. Kachru (ed.) The Other Englishes: English Across Cultures (2nd edn) (pp. 211 219). Urbana and Chicago: University of Illinois Press. Lee, D.Y.W. (2001) Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology 5 (3), 37 72. On WWW at http://llt.msu.edu/vol5num3/lee/. Accessed 7.11.05. Linguistic Data Consortium (2006) On WWW at http://www.ldc.upenn.edu. Accessed 17.7.06. Malmkjær, K. (2005) Linguistics and the Language of Translation. Edinburgh: Edinburgh University Press. Matthews, P.H. (1997) Oxford Concise Dictionary of Linguistics. Oxford & New York: Oxford University Press. Quirk, R. (1968) Essays on the English Language: Medieval and Modern. London & Harlow: Longmans, Green and Co. Ltd. Quirk, R. (1995) Grammatical & Lexical Variance in English. London & New York: Longman.
94
Incorporating Corpora
Renouf, A. and Sinclair, J.McH. (1991) Collocational frameworks in English. In K. Aijmer and B. Altenberg (eds) English Corpus Linguistics: Studies in Honour of Jan Svartvik (pp. 128 143) London & New York: Longman. Sinclair, J.McH. (ed.) (1987) Looking Up: An Account of the COBUILD Project. London and Glasgow: Collins ELT. Sinclair, J.McH. (1991) Corpus, Concordance and Collocation. Oxford: Oxford University Press. Stein, G. and Quirk, R. (1991) On having a look in a corpus. In K. Aijmer and B. Altenberg (eds) English Corpus Linguistics: Studies in Honour of Jan Svartvik (pp. 197 203). London & New York: Longman. Stubbs, M. (1996) Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford & Cambridge, USA: Blackwell Publishers. Summers, D. (1993) Longman/Lancaster English Language Corpus criteria and design. International Journal of Lexicography 6 (3), 181 208. Svartvik, J. (1966) On Voice in the English Verb. The Hague: Mouton. Svartvik, J. (2006) ‘Jan Svartvik’. On WWW at http://www.ucl.ac.uk/englishusage/about/svartvik.htm. Accessed 1.10.06. Teubert, W. (2003) Writing, hermeneutics and corpus linguistics. Logos and Language IV (2), 1 17. Wellek, R. and Warren, A. (1963) Theory of Literature (3rd edn). Harmondsworth: Penguin Books Ltd.
Chapter 6
Translating Discourse Particles: A Case of Complex Translation1 KARIN AIJMER
Introduction It is a common observation that discourse particles ‘do not translate well’ in the sense that they have no satisfying correspondences in other languages (Fillmore, 1984: 128; cf. Wierzbicka, 1976: 327). Joan Tate, well known for her translations from Swedish into English, found them tricky because they can have many different functions depending on their context and because of their expressive qualities: All those small words are tricky and can mean different things according to context . . . Much more difficult to translate are what I call ‘Swedish noises’ jaha, jaha du, joda˚, jasa˚, jasa˚a˚a˚a˚, ja visst ja, java¨l, men du, vet du vad, sa¨ger du det, na¨men du, jara˚, visst, visst inte, nej du, nejda˚, nera˚, joda˚, jasa˚ du, ojda˚, ojojojda˚, usch, ajda˚, vad sa¨ger du, sa¨ger du det du, voj voj! English has more or less nothing but yes, no, well, um or er and nothing like the expressiveness of those sounds.(personal correspondence with Joan Tate) Words that lack systematic lexical correspondences in another language constitute ‘a crucial and stimulating area for translation theory’ (Bazzanella & Morra, 2000). The translator’s problems may reflect linguistic differences between languages that have consequences for the way in which people think and act (Gumperz & Levinson, 1999: 2) and for the general problem of translation (cf. for example, Bazzanella & Morra’s discussion of Quine’s theory of the indeterminacy of translation). In English there is oh as an alternative corresponding to what Joan Tate so aptly called ‘Swedish noises’. Oh is one of the most frequently used discourse particles in English and is interesting to study from a translation perspective. In this chapter I will show how translations into one or more languages can help us to get a better picture of its meaning and of its correspondences. To begin with, we need to describe what we mean by discourse particles.
95
Incorporating Corpora
96
Discourse Particles Characterised As has been observed on several occasions, discourse particles are difficult to define in formal terms and it is not clear whether they should ¨ stman, 1982, 1995; be regarded as a special word class (cf. Hansen, 1998; O Van Baar, 1996). Discourse particles, for example, have the form of manner adverbs (well), interjections (oh), conjunctions (and, but), clauses (I think). Oh is both an interjection (when it stands alone) and a discourse particle. Prototypical examples of the category are small words or phrases such as oh, well, now, I mean. Their frequency in spoken language is remarkably high and they cannot be omitted from the conversation without a loss of naturalness. Andersen (2001: 21) describes them with the following features (adapted from Brinton, 1996: 33ff). Discourse particles (including items like ah, actually, and, just, like, now, really, well, I mean, I think and you know):2 . . . . . . . . .
constitute a heterogeneous set of forms which are difficult to place within a traditional word class; are predominantly a feature of spoken rather than written discourse; are high-frequency items; are stylistically stigmatised and negatively evaluated; are short items and are often phonologically reduced; are considered to have little or no propositional meaning, or at least to be difficult to specify lexically; occur either outside the syntactic structure or are loosely attached to it and have no clear grammatical function; are optional rather than obligatory features; may be multifunctional, operating on different levels (including textual and interpersonal levels).
The semantic properties of a word determine how it is translated into another language. This is true also when the meaning of a word cannot be identified with an object, situation or event in the world. Discourse particles do not affect the truth conditions of the utterance and lack referential meaning. Their basic function is to take up an intersubjective stance to a particular world-view or to the ongoing interaction, whether to agree or disagree with this opinion (Aijmer & Simon-Vandenbergen, 2003). They are used strategically to sway opinions, confirm the speaker’s commitment, reject or accept previous statements. Their multifunctionality makes them difficult to translate into another language. The interpersonal function may, for instance, be expressed in many different ways depending on the language and on norms of speaking in a society.
Translating Discourse Particles: A Case of Complex Translation
97
Universal and Language-specific Studies of Discourse Particles Earlier cross-linguistic studies of discourse particles have demonstrated that there are shared (perhaps universal) patterns of grammaticalisation across languages. Fleischman and Yaguello (2004) explained the ‘striking functional similarity’ between English like and the French genre as metaphorical extensions from the same lexical source. However we need to test similar data by looking at more languages: But to the extent that markers are translatable a premise our data strongly support and given the high probability that certain discourse functions are if not universal then at least widely attested across languages, it seems that cross-language pragmatics would be well served by additional studies testing out the findings of our investigation on data from other languages and/or similarly slanted studies of other discourse functions. (Fleischman & Yaguello, 2004: 16) The correspondences revealed by translation are functional rather than lexical (semantic). Bruti (1999), for example, found only a weak correspondence in a cross-linguistic perspective between English in fact and Italian infatti. Although the lexical-semantic source was the same, the overlap of functional correspondence of the English and Italian cognates was much smaller than that of functional divergence (cf. also, Bazzanella & Morra, 2000; Carlson, 1984; Wierzbicka, 1976). Altenberg and Granger (2002: 19) observe that even cognate or functionally similar items seldom reach a mutual correspondence of 80%. Another problem is that not all discourse particles have a lexical correspondence in other languages. When there is a gap in the other language, it is particularly interesting to see what the translator has chosen instead. There are several models that could be used to explain similarities and differences between the compared languages depending on whether we focus on what is universal or what is language-specific. Both theories focusing on grammaticalisation (for example Traugott, 1995) and linguistic relativity are of interest (Gumperz & Levinson, 1996). Grammaticalisation theory provides a dynamic explanation of the relation between meaning and culture by emphasising how meaning shifts are rooted in the use of language in discourse and the social and cultural context, while linguistic relativity pays attention to semantic and cultural differences and the relation between the two. Contrastive studies also form a complement to typological studies sampling and classifying the world’s languages according to the lexicogrammatical patterns recurring in several languages. The aims are, however, different. The aim in contrastive pragmatic corpus-based
Incorporating Corpora
98
studies is to study ‘small facts’, for instance, what a discourse particle in language X corresponds to in language Y. However, these ‘small facts’ are often instances of larger practices by which languages differ in a systematic way (Fillmore, 1984: 134).
The English Swedish Parallel Corpus and Representativeness Translation is ‘one of the very few cases where speakers evaluate meaning relations between expressions not as part of some kind of metalinguistic, philosophical or theoretical reflection, but as a normal kind of linguistic activity’ (Dyvik, 1998: 51). The translator is therefore an ideal linguistic ‘informant’ and a corpus of translations can be used for ‘empirically testing one’s intuitions (or hypotheses) about the semantics of linguistic forms that is complementary to the systematic exploitation of the circumstantial evidence provided by monolingual corpora’ (Noe¨l, 2003). The EnglishSwedish Parallel Corpus (ESPC) does not contain authentic speech. However it can be assumed that discourse particles are used for the same purposes in literary dialogue as in authentic conversation. Interpersonally they are used to mark social relationships and on the textual level to signal cohesive connections. A translation corpus consists of texts in one language and their translations into another and documents the bilingual correspondences between words, grammatical constructions and complete texts. For example, the ESPC contains both English texts translated into Swedish and Swedish texts translated into English (cf. Altenberg & Aijmer, 2000). The corpus currently contains almost three million words and has been used for research in contrastive linguistics and in translation theory. All the translations that have been included in the corpus have been carried out by professional translators. Many different translators are represented in order to avoid translator bias. The corpus has been aligned at sentence-level in order to facilitate searches that can be made with the help of special software (Ebeling, 1998). A corpus that is used either for contrastive studies or for translation studies should ideally be representative of the whole language. Both fiction and non-fiction texts are therefore represented, which makes it possible to compare translation patterns in the two varieties. Different text types are represented, each with its close correspondence in the other source language. The corpus is balanced as far as this is possible: for example, corresponding to children’s books in one language, there are children’s books in the original texts in the other language What we get from the translation corpus is ‘raw’ material. The translations are always the translator’s subjective interpretation of the source text, an ‘invention’
Translating Discourse Particles: A Case of Complex Translation
99
adapted to norms rather than a discovery of connections between words and constructions in the two languages (cf. Teubert, 2002). A criterion of a natural and appropriate translation would be the selection of forms and constructions that are native-like in form (cf. Pawley & Syder, 1983). However, speakers in general have poor intuitions about the meaning and uses of discourse particles. This was shown in a test by Fischer (2001: 37), which included English oh and ah as well as German oh and ach. When asked to describe the meanings of the English and German particles, several speakers opted for ‘no meaning’ or for lists of emotional terms. A translation corpus has greater potential than introspection and offers a firmer empirical basis for cross-linguistic studies as it is based on the judgements made by many translators on the basis of context. One important use of the corpus is to ‘serve to disambiguate polysemous items, reveal the degree of mutual correspondence of lexical items in different languages, and uncover cross-linguistic sets of translation equivalents in the languages compared’ (Altenberg & Granger, 2002: 14). A problem is that some translations may be ‘bad’ and the correspondences slanted towards the source language. For example, a˚h was chosen as a translation of oh much more often than it appeared in original texts and it was less frequent as a source of oh than as a translation (cf. Tables 6.1 and 6.2). A translation corpus therefore has to be used with caution and the translations evaluated.In the following discussion I will show how translations into two or more languages help us to analyse the meaning and multifunctionality of discourse particles and succeed in throwing light on important issues such as their grammaticalisation.
The Discourse Particle Oh The task of translating discourse particles has to take into account several ‘layers’ of conceptual and contextual factors (cf. Altenberg & Granger, 2002: 27). Their translation reflects a consideration of syntactic position, collocations, text type and lexical source. As discourse particles do not contribute to truth conditions, translators often choose to omit the discourse particle altogether. Oh is frequent in authentic speech (rank 43 in informal conversation in the London-Lund Corpus (cf. Svartvik, 1990: 67)). In the fiction part of the ESPC, there were 178 examples, which makes it one of the most frequent discourse particles in the corpus (cf. well 182 instances). It has been extensively studied (for example, Bolinger, 1989; Fischer, 2000; Heritage, 1984; James, 1978; Schiffrin, 1987), and a number of different theories have been proposed for its analysis. In order to analyse the data from the translation corpus we need to find a theory that can help us to explain the multifunctionality that is
100
Incorporating Corpora
Table 6.1 Translations of oh into Swedish in the ESPC Translation equivalent
Frequency
a˚h
54
a˚
10
jasa˚ (jasa˚ jo, jasa˚ ja, jasa˚ jaha)
9
ja
8
men
7
javisst (ja)
6
joda˚
4
ack (ja)
3
oh
3
a˚ha˚
2
jo (jo-o)
2
jaha
2
jada˚
1
minsann (‘indeed’)
1
fo¨r all del (‘by all means’)
1
oj da˚
1
ah
1
visst (‘certainly’)
1
nej men (‘no but’)
1
nej fo¨rresten (‘no by the way’)
1
faktiskt (‘actually’)
1
na˚ja
1
just ja
1
othera
7
ø Total
50 178
a The category ‘other’ includes translations deviating too much from the original text to be regarded as equivalents. However, closer analysis may show that the translations suggested under ‘other’ may throw light on the meaning of the source text.
Translating Discourse Particles: A Case of Complex Translation
101
shown by the cross-linguistic correspondences. A basic difference between theories of discourse particles has to do with whether the meaning of discourse particles is vague or underspecified or whether they should be regarded as polysemous (Fabricius-Hansen, 1999). If oh is polysemous we have to look for its core meaning. Oh is assumed to have the core meaning surprise with a gestural origin ‘and continuing gestural associations’ (Bolinger, 1989: 266). Many languages have an interjection phonetically similar to oh (cf. the discussion of Swedish a˚h and German oh and ah below). Oh also occurs loosely associated with a following utterance: (1) ‘What’s the matter with her?’ Marjorie yawns evasively. ‘Oh, nothing serious.’ (David Lodge) Many linguists have recently been influenced by Bakhtin’s notion of heteroglossia and the assumptions about language and language use it implies (cf. White, 1999, 2000). Texts are what Bakhtin calls heteroglossic. They ‘address alternative realities as expressed in previous texts and as they are expected to be realised in future texts’ (White, 1999). This analysis makes it possible to account for the rhetorical functions of discourse particles as well as for other elements by means of which speakers take up attitudinal positions or ‘stances’ as heteroglossic options. For example, by means of oh, speakers recognise the notion of heteroglossic diversity and enter into dialogue with alternative positions which either diverge from or converge with their own thoughts. Thus the difference between oh nothing serious and nothing serious in example (1) is one of the degree to which the speaker acknowledges the dialogic or heteroglossic context in which the utterance operates. In a heteroglossic approach discourse particles may be either backward-looking or forward-looking, reflecting the speaker’s stance towards what has happened before, to the hearer or to a new stage in the discourse (cf. Andersson, 1984, Lenk, 1998). Oh is, for instance, a backward-looking ‘stance marker’. It polemicises with earlier utterances or treats them as agreeable or simply presupposes that an earlier utterance is new to the hearer. Depending on the context, it can also signal a new departure in the conversation. Below in the discussion of ‘form function equivalents’, I will consider some of the strategic options in the use of oh. The focus will be on the most frequent translations: the cognate a˚h and the discourse particle ja and its variants (jo, jasa˚, jaha).
102
Incorporating Corpora
The Translation Paradigm of Oh In Swedish, the semantic equivalent of English oh is a˚h, a˚. However, there is only partial equivalence between them. Table 6.1 gives an overview of the ‘rich translation equivalence’ of this particle in Swedish (Dickens & Salkie, 1996). The data are taken from the fiction part of the corpus (roughly 1.4 million words). The translations in the corpus show how the semantic space occupied by the discourse particle is filled in. Oh was translated in 24 different ways including interjections, conjunctions, discourse particles, adverbs, phrases or no translation. A˚h, a˚, a˚ha˚, ack ja, oh and ah are interjections. They focus on exclamatory meanings like surprise, disappointment, etc. A˚h (a˚) may be linked with the ensuing utterance but nevertheless is interjectional (cf. Bolinger, 1989: 266282; Schourup, 2001: 1049). Joda˚, jasa˚, jasa˚ jo, ja, jo, jaha, ja ja, jada˚ and javisst ja are response words. Men (‘but’) is a conjunction and faktiskt (‘actually’) an adverb. As shown, the most frequent translation is a˚h. A˚h (and a˚) together account for 36% of the uses. The translation not only mirrors the meaning of oh but shows how it fits into a particular context. A certain translation may therefore be frequent because it is chosen in many contexts. Eleven translations were singletons. Dyvik (1998: 71) regards singletons as ‘spurious senses’ because of their low frequency. However, even unique translations may have some evidential value. They contribute to the translation profile of the discourse particle and illustrate the close tie between the context and translation, which forces the translator to vary the translation in order to preserve the function of oh in the original text. If the size of the corpus were to be increased, it is likely that the number of singleton translations would also increase.
Intertranslatability Correspondences are established by our looking at both translations and sources. However we need to compare sources and translations in order to establish if they are ‘good’ equivalents. A˚h occurred only five times as a translation of oh and other sources are also infrequent. Oh has often been added in the translation, as is obvious from the high number of zero-correspondences in the source text. The sources are interjections (a˚h, ah, oh, a˚, a¨h, a¨sch, ojoj, ha˚ha˚jaja, ack) or response particles (ja, jasa˚, na˚ja, jo, javisst ja, tja). The notion of translation equivalent is a measure based on intertranslatability. The intertranslatability between oh and a˚h is, for example, arrived at by considering their frequencies both as targets and sources. If they are often translated as reciprocal equivalents, the intertranslatability between corresponding items in the two languages is strong. Intertranslatability is also affected by translation bias (cf. oh 0 ah 30%; a˚h 0 oh 9.3%).
Translating Discourse Particles: A Case of Complex Translation
103
The overuse of an item may be the result of transfer from the source language. For example, a˚h was chosen as a translation of oh much more often than it appeared in original Swedish texts and was also found as a translation not only of oh but also of other lexical items (cf. Table 6.3). Interjections are often similar across languages: Oh and a˚h are phonetically so close that translators overuse the lexicosemantic correspondence although pragmatically and stylistically this may result in a translation which sounds unnatural. As Gellerstam (1986: 88) points out the cognate a˚h in the translation may sound like a translation from English: Studying the concordance version of the corpus [material collected from the Language Bank, Go¨teborg University], I came across a rather insignificant Swedish word a˚h (the Swedish counterpart of the English oh) and was surprised to see that most of the concordance lines were taken from English. We would expect natural translations to be native-like in the selection of items and to have the same frequencies as in originals. When there is a gap in the other language or if the language systems are asymmetric, this may be difficult. For example, translating from a language that has a rich tense system into a language not making the same distinctions, we either lose something or add distinctions that are not made in the source language. Also, individual words may lack a direct lexical correspondence in another language, causing translation problems. As can be expected when a word lacks propositional content and there is no natural correspondence in the other language, omission was found to be a common strategy (28% of the cases in Table 6.2). ‘Oh’ was also added by the translator in 20% of the cases when there was no visible correspondence in the Swedish source text, which suggests that it is needed in English to make the dialogue idiomatic. When not invested with propositional content, on the other hand, oh was often omitted when it did not result in a loss of meaning. For example, in (2), the meaning of oh (marking a new departure) is present in men (‘but’). (2) ‘Then when I’m sure that he does understand, that he really does realise, that he feels just terrible, I’m going to open my purse and pull out a gun and shoot him between the eyes.’ ‘Oh, well, sweetheart ’ (Anne Tyler)
‘Sedan na¨r jag a¨r sa¨ker pa˚ att han fattar, att han verkligen begriper, att han ka¨nner sig avskyva¨rd, da˚ ska jag o¨ppna handva¨skan och ta fram en revolver och skjuta honom mellan o¨gonen.’ ‘Men sna¨lla du ’
104
Incorporating Corpora
Table 6.2 Swedish sources of oh in the ESPC a˚h
5
jasa˚
4
a˚ nej
3
oj
3
a˚
3
jaha
2
oh
2
a¨h
2
a¨sch
3
nej da˚
2
ha˚ha˚jaja
2
na˚ja
2
ojoj
1
men oj
1
jo da˚
1
jo
1
da˚ sa˚
1
sa˚
1
ack
1
men
1
javisst ja
1
tja
1
ø
11
Total
54
Translation Paradigms and Lexical Fields Translations may be closely related to each other because they are near-synonyms. They may also be mutually exclusive. This suggests that they are components of a lexical semantic field.
Translating Discourse Particles: A Case of Complex Translation
105
A semantic lexical field is like ‘a large, vague potential ‘‘sense’’ which is not necessarily the sense of one sign, but rather the joint ‘sense’ of a set of semantically related signs’ (Dyvik, 1998: 72). It is established by contrasting related elements in the same language in terms of synonymy or homonymy. We could also use the bilingual correlations in the corpus to account for the fact that a lexical item is ambiguous (rather than homonymous or polysemous). For instance, Dyvik (1998) has shown that the ambiguity of Norwegian tak meaning either ‘hold’ or ‘grip’ on the one hand, or ‘ceiling’ or ‘roof’ on the other, is reflected in the ‘inverse t-image’ (reverse translation) of tak. Depending on the meaning of tak, it corresponded with different Norwegian synonyms, either ‘ceiling’ or ‘hold’. However, discourse particles present special problems. Because of their indeterminate meaning, the translation corpus may be used to show that items are closely related to each other because they share translations. A semantic field established on the basis of contrastive data from two languages could not be expected to show the same neatness as the semantic field in a single language (Fischer, 2000: 210; Hansen, 1998: 68). By means of back-translation, an item in the target text is shown to be related to many different lexical items in the source text. These correspond to translation equivalents and non-equivalents (cf. Ebeling, 1999: 18): The most interesting result of back-translation is not when the backtranslation produces the initial form but the conditions that hold between items A, B, and C in step 2 (the establishment of translation equivalents and the non-equivalent items produced in step 3 (backtranslations)). When I studied the translations I found that in some cases I could replace oh by well, which was not possible in all contexts. In other cases oh was difficult to distinguish from ah. In Table 6.3, I show some possible back-translations of the Swedish translations of oh. Table 6.3 Back-translations of Swedish a˚h, ja, jasa˚, jaha Translation of oh in the Swedish subcorpus
Back-translations
a˚h (12 examples)
oh (7), ah (1), wow (1), ø (3)
ja (244 examples)
yes (160), oh 18, well 27, so 5, ø 30 (less than 5 examples: all right, sure, OK)
jasa˚ (21 examples)
oh (6), so (4), I see (3) really (2), ah (1), and (1), yes (1), yes, yes (1) ø (2)
jaha (13 examples)
oh yes (3), that’s right (2), ah well (2), I see (1) is that so (1), OK (1), well well (1), yes now (1), ah (1)
Incorporating Corpora
106
The characteristic picture which emerges is that of clustering and functional overlapping, that is the same source item has several different translations, as shown by Altenberg (1999). Altenberg investigated semantic conjuncts in English with contrastive meanings, that is a delimited semantic field that he compared with a corresponding field in Swedish. In the present analysis the starting point was a single word. Some of the cross-linguistic correspondences form contrastive subsystems. For example, oh is related to ah, wow and other interjections as the back-translation of a˚h. There is another subsystem consisting of oh, so, I see, really, etc., as shown by the fact that they are translated by jasa˚. Back-translations show, for example, that ja can be translated by both oh and well. Yes and oh are also related as both are possible translations of ja. There is little overlap between a˚h, jasa˚ and jaha except that they correspond to oh in the translation. A˚h correlates, for example, with other interjections (ah, wow). Jasa˚ is also used to mark a conclusion (so). The translations of jaha include items indicating acceptance (well well, ah well, that’s right). We need to look in more detail at these subsystems based on the relation between form and function in the translations.
Form-function Equivalents When is oh translated by a˚h? Why is it translated by ja or by jasa˚? When it is translated by jasa˚, could the translator also have chosen a˚h? We need to capture the overlap between oh and a˚(h) and explain other correspondences, for example between oh and response particles. In (3) oh and a˚h are correspondences: (3) That is to say not just in office hours, but after hours as well. Well, you must have known.
Det vill sa¨ga, inte bara under kontorstid, det vore naturligt, utan pa˚ fritiden ocksa˚. Fast det ma˚ste ni ju ha vetat om.
Oh.
˚ h. A
No? Oh dear!
Inte? Ka¨ra na˚n da˚.
(Fay Weldon) What seems to be involved in the use of a˚h is a rather sharp response of surprise directed at what has just been said (‘I didn’t know’) (cf. ah, wow with the same meaning as oh). In (4), a˚ (a variant of a˚h) is interjectional or exclamatory and expresses the speaker’s emotions:
Translating Discourse Particles: A Case of Complex Translation
(4) Brother Arie bragging about all the fancy people he meets on the job at Hermanus.
107
Bror Arie som skryter om alla de fina ma¨nniskor han tra¨ffar i jobbet pa˚ Hermanus.
Brother Sonny’s furtive visits; crumpled banknotes thrust in their mother’s hand when Dedda is out to sea.
Bror Sonnys fo¨rstulna beso¨k, skrynkliga sedlar som stoppas i ha¨nderna pa˚ mor na¨r Dedda a¨r till sjo¨ss.
Oh, Cape Town, Cape Town!
˚ , Kapstaden, Kapstaden! A
(Andre´ Brink) There are several interjections besides a˚h which are used in the translation corpus expressing emotions such as surprise, regret, pity or disappointment. Oh in English corresponds to a˚, a¨h, ack, a¨sch in Swedish. In (5), the value expressed by oh and its translation is ‘enhancement’ or upgrading of an emotion (oh as an intensifying modal particle): (5) And Macon (oh, he knew it, he admitted it) had been so intent on preparing him for every eventuality that he hadn’t had time to enjoy him.
Och Macon (ack ja, han visste det, han erka¨nde det) hade varit sa˚ ma˚n om att go¨ra honom rustad fo¨r alla eventualiteter att han inte hade haft tid att gla¨dja sig a˚t honom.
(Anne Tyler) In many examples the translator has used a response signal with discourse particle function (jasa˚, jaha, ja) instead of a˚h. The response signals are used for different, sometimes incompatible, reasons as might be expected in a heteroglossic perspective. When oh is translated by a response signal, the meaning of surprise has been weakened or has disappeared. The main function of oh is a ‘reception marker’ taking up a position towards information provided by another speaker, or to the preceding discourse. The translations suggest that a large number of different functions are expressed by oh and distinguished in the translations although the boundary between different response signals is fuzzy. When oh is translated by ja it is basically positive and conveys that the speaker has perceived and understood the linguistic action and accepts the conversational partner’s right to perform it (cf. Teleman et al., 1999: 786).
Incorporating Corpora
108
Ja is heteroglossic because it recognises that other divergent viewpoints are possible: (6) ‘Ten acres of gladiolus?’
‘Tio tunnland gladiolus?’
‘Oh, your brother-in-law Pete was talking about that before you came’.
‘Ja, din sva˚ger Peter pratade om det innan du kom’.
(Jane Smiley) In (7), the effect of ja is above all to soften the effect of the utterance that follows. (7) ‘You know what a daimon is?’
‘Vet du vad en daimon a¨r?’
‘Yes, but go on.’
‘Ja, men fortsa¨tt.’
‘Oh, of course you’d know. I keep forgetting what a knowing girl you are’.
‘Ja, det a¨r klart att du ma˚ste veta det. Jag glo¨mmer hela tiden vilken la¨rd flicka du a¨r’.
(Robertson Davies)
Jaha seems to be distinct from ja. In (8), jaha signals the reaction to new information or a clarification. It does not imply agreement with the preceding speaker but is registering understanding as a response to new information: (8) ‘I didn’t mean your chairs, I mean for visitors.’ ‘Oh.’ They don’t know quite how to react.
‘Jag menar inte era stolar. Fo¨r beso¨kare.’ ‘Jaha.’ De vet inte riktigt hur de ska reagera.
(David Lodge) Jasa˚, on the other hand, suggests that the speaker suddenly realises how things are connected or recognises a fact. In (9), jasa˚ is uttered as a response to clarification (cf. jaha):
Translating Discourse Particles: A Case of Complex Translation
(9) ‘Well, I am, but she’s . . . living elsewhere. They don’t allow pets.’ ‘Oh.’
109
‘Jo, det a¨r jag. men hon . . . hon a¨r pa˚ annat ha˚ll. Och da¨r fa˚r man inte ha sa¨llskapsdjur.’ ‘Jasa˚ pa˚ det viset.’
(Anne Tyler) In such examples we find a sharp contrast with ja as a response signal. Ja never expresses surprise at new information but accepts old, previously known information (Heritage, 1984). In (10), joda˚ expresses a weak contrast with what has been said earlier. Its main function seems to be to mark a return to the main topic (‘could the speaker have prevented one of her patients from having an abortion’): (10) ‘You can’t play Juliet six months pregnant, Jack, which is what she would have been when the run started. Oh, I did my bit, suggested she talk it through with you.’
‘Man kan inte spela Julia na¨r man a¨r i sja¨tte ma˚naden, Jack, och det skulle hon ha varit lagom till premia¨ren. Joda˚, jag gjorde vad som pa˚ mig ankom, jag fo¨reslog att hon skulle tala igenom det hela med dig.’
(Minette Walters) In (11), on the other hand, oh expresses some surprise as shown by its translation. Nej men expresses surprise and is followed by a greeting. The reason for surprise in greetings, acceptance of offers and apologies is politeness. The speaker feigns pleasant surprise at meeting someone (cf. Carlson, 1984: 92): (11) ‘Don’t you ever stop reading?’ he snapped at her.
‘Ma˚ste du alltid sitta med na¨san o¨ver en bok?’ fra¨ste han a˚t henne.
‘Oh, hello daddy,’ she said pleasantly.
‘Nej men hej, pappa’, sa hon va¨nligt.
‘Did you have a good day?’
‘Hur har du haft det idag?’
(Roald Dahl)
110
Incorporating Corpora
Surprise or degrees of surprise (disappointment, pity) are clearly present in some translations (ah, ack, etc). It is also found in some conventional politeness phrases such as greetings. In many of its functions, however, the meaning of surprise has disappeared, as shown by the translations into Swedish where the effect of interjection has disappeared. In the framework we have suggested, we can explain its other functions as options following from the speaker’s choice of a rhetorical strategy for adopting different heteroglossic stances.
Table 6.4 Translations of English oh into German German translations
Frequency
ach (ah)
62
oh
50
omission
23
so
4
ach so
2
ach ja
1
na scho¨n
1
tja (also) ‘oh now let me see’
1
also
1
oje
1
och
1
ja
1
tatsa¨chlich
1
aber ja
1
schon gut
1
o ja
1
o gott
1
a¨chz
2
o Total
41 197
(d) Exclamation (intensification)
(c) Suggesting reservation or modification before a new departure in the conversation
(b) Drawing a conclusion (recognising a fact), starting a new turn
(a) Receiving new and unexpected information
‘Hej, hej!’ sade hon muntert. ‘Hur har resan varit?’ ‘Jo-o, den var . . . Var finns Edward?’
‘Ja, halli-hallo!’ gru¨ßte sie strahlend. ‘Wie war die Reise?’ ‘Ach, es war . . . Wo ist Edward?’
‘Well, hi there!’ she said brightly. ‘How was your trip?’ ‘Oh, it was . . . where’s Edward?’
Ach, Andrew!
‘Jasa˚, a¨r det dags fo¨r England nu igen?’
‘So, ist schon wieder einmal England fa¨llig?’
‘Oh, is it time for England again?’
Oh, Andrew!
‘Jag far till England i morgon eftermiddag’, sade han.
‘Ich fliege morgen nachmittag nach England’, sagte er.
‘I leave for England tomorrow afternoon,’ he said.
˚ h, Andrew! A
˚ , Kapstaden! A
‘Jasa˚’, sade Macon.
‘Oh.’
‘Oh,’ Macon said.
Ach, Kapstadt, Kapstadt!
‘Alla ma¨nniskor vet ju att pojkarna Leary a¨r sva˚ra att bo ihop med. ’
‘Alle wissen, daß man mit den Leary-Ma¨nnern kein leichtes Leben hat.’
‘Everybody knows the Leary men are difficult to live with.’
Oh Cape Town!
Swedish
German
English
Table 6.5 A comparison of English, German and Swedish (the English original contains ‘oh’)
Translating Discourse Particles: A Case of Complex Translation 111
112
Incorporating Corpora
Comparing Translations A method that is of interest both to Translation Studies and contrastive work is the comparison of translations from one source language into several languages (cf. Viberg, 2002). By comparing translations we can get further support for contextual relations between form and function as well as evidence for their strength across languages. The study of the translations of oh into other languages provides a mirror image of the meanings of the particle. Translations can differ with regard to the ‘granularity’ or fine-grainedness of the distinctions made in the other language (cf. Dyvik, 1999: 218). In Aijmer and Vandenbergen (2003), the method of comparing translations was used in order to provide a fuller picture of the meaning and function of well by studying its translation correspondences in more than one language. This method can also give a better picture of which functions are general (and potentially universal) and which language-specific. The Oslo Multilingual Corpus provides us with examples where comparisons can be made between oh and its correspondences in German (as well as other languages) (Johansson, 1998: 9 10) (cf. Table 6.4). German has 17 equivalents, most of which are interjections (cf. a similar investigation by Fischer, 2000: 207). As in the ESPC, omission was frequent. A comparison of English, German and Swedish reveals some noteworthy differences between the languages (cf. Table 6.5). In German, oh is related to both ach and oh (cf. Fischer, 2000: 147ff). Swedish differs from both English and German because there are a larger number of different items in translation and because the equivalents are also response particles used to register new information, draw a conclusion or mark a new departure. In German, as in English, oh is used to register new information.
Conclusion According to Kay (1996: 111), ‘translation will always remain an art’ and ‘there is much to be learned about linguistic relativity from professional linguists’. From this perspective, translators can be treated as linguistic informants. Translations, rightly used, are a resource to be tapped both for semantic and cross-linguistic lexical studies. However we need to treat translations with caution. Translations may be influenced by many factors such as negative transfer from the source language. From a translation perspective, discourse particles offer particular problems as they do not belong to the vocabulary that we use to orient ourselves among things, events, processes and actions in the world. Moreover, while all languages seem to have interjections, the category of discourse particle or modal particle is less likely to be a universal category (cf. Fillmore, 1984; Hakulinen, 1986: 3 and references therein).
Translating Discourse Particles: A Case of Complex Translation
113
When I studied instances of oh in the corpus, the translations gave evidence of much lexical incongruity. Corresponding to oh in English, there appeared response particles, conjunctions, adverbs and interjections, reflecting the multifunctionality of oh. Analysing the translations in more detail, I found it necessary to pay attention to the dialogic context and how speakers position themselves in relation to the discourse, reality and the hearer. Apparently oh can point both forwards and backwards in the discourse as indicated by its translations. Jasa˚ expresses the speaker’s recognition of a fact, and oj da˚ or nej men expresses the speaker’s surprised reaction at a fact. The basic function of jaha is to register new information without necessarily accepting it as a fact. As oh is highly contextdependent, it can also point forwards, introducing a new departure in the discourse, for example, a new topic or a new stage in the conversation. Jo is clearly a marker of new departure. When it has this function it cannot be substituted by ja or by another response particle. The description of discourse particles should not be restricted to a single lexical item. When we look at the back-translations of ja (and related particles) both well and oh seem to be used under similar pragmatic conditions, as shown by the fact that they share some translations into Swedish. Our investigation has also shown that English conversation welcomes oh to indicate that new information has been registered, recognising something as a fact or as in need of explanation, as a polite transition to an elaboration or a new stage in the discourse. In Swedish it is less important to mark these functions. In many cases the appropriate correspondence to oh in the original text seems to be omission. To conclude, the large, typological difference between Swedish and English seems to be that the Swedish forms ja and jo are response particles when English uses oh. The lexical item a˚h in Swedish is purely interjectional or has intensifying meaning, that is, functions as a modal particle. Notes 1. 2.
I am grateful to Go¨ran Kjellmer for comments on an earlier version of this chapter. Brinton and Andersen both prefer ‘pragmatic markers’ to ‘discourse particles’. On the problems of defining discourse particles and the lack of agreement about terminology, cf. for example, Jucker & Ziv (1998: 1f).
References Aijmer, K. and Simon-Vandenbergen, A.-M. (2003) The discourse particle well and its equivalents in Swedish and Dutch. Linguistics 6, 11231161. Altenberg, B. (1999) Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In H. Hasselga˚rd and S. Oksefjell (eds) Out of Corpora. Studies in Honour of Stig Johansson (pp. 249268). Amsterdam & Atlanta, GA: Rodopi.
114
Incorporating Corpora
Altenberg, B. and Aijmer, K. (2000) The EnglishSwedish Parallel Corpus: A resource for contrastive research and translation studies. In C. Mair and M. Hundt (eds) Corpus Linguistics and Linguistic Theory. Papers from the Twentieth International Conference on English Language Research on Computerized Corpora (ICAME 20), Freiburg im Breisgau, 1999 (pp. 1533). Amsterdam & Atlanta, GA: Rodopi. Altenberg, B. and Granger, S. (2002) Recent trends in cross-linguistic lexical studies. In B. Altenberg and S. Granger (eds) Lexis in Contrast. Corpus-based Approaches (pp. 348). Amsterdam & Philadelphia: Benjamins. Andersen, G. (2001) Pragmatic Markers and Sociolinguistic Variation. A RelevanceTheoretic Approach to the Language of Adolescents. Amsterdam/Philadelphia: John Benjamins. Andersson, L.-G. (1984) Stolpe in stolpe ut. In Svenskans beskrivning 14. Lund 1983 (pp. 718). Bazzanella, C. and Morra, L. (2000) Discourse markers and the indeterminacy of translation. In I. Korzen and C. Marello (eds) Argomenti per una linguistica della traduzione. On linguistic aspects of translation, Notes pour une linguistique de la traduction (pp. 149157) Alessandria: Edizioni dell’Orso. Bolinger, D. (1989) Intonation and Its Uses. Melody in Grammar and Discourse. London: Edward Arnold. Brinton, L.J. (1996) Pragmatic Markers in English: Grammaticalization and Discourse Functions. Berlin: Mouton de Gruyter. Bruti, S. (1999) In fact and infatti: the same, similar or different. Pragmatics 9 (4), 519 533. Carlson, L. (1984) ‘Well’ in Dialogue Games: A Discourse Analysis of the Interjection ‘Well’ in Idealized Conversation. Amsterdam/Philadelphia: John Benjamins. Dickens, A. and Salkie, R. (1996).Comparing bilingual dictionaries with a parallel corpus. Euralex’96 Proceedings III. Papers Submitted to the Seventh EURALEX International Congress on Lexicography in Go¨teborg, Sweden (pp. 551 559). Go¨teborgs universitet: Institutionen for svenska spra˚ket. Dyvik, H. (1998) A translational basis for semantics. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research. Theory, Method and Case Studies (pp. 5186). Amsterdam/Atlanta, GA: Rodopi. Dyvik, H. (1999) On the complexity of translation. In H. Hasselga˚rd and S. Oksefjell (eds) Out of Corpora. Studies in Honour of Stig Johansson (pp. 215230). Amsterdam/Atlanta, GA: Rodopi. Ebeling, J. (1998) The Translation Corpus Aligner: A browser for parallel texts. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research. Theory, Method and Case Studies (pp. 101112). Amsterdam/Atlanta, GA: Rodopi. Ebeling, J. (1999) Presentative Constructions in English and Norwegian. A Corpusbased Contrastive Study. Department of British and American Studies. University of Oslo. University of Oslo Press. Fabricius-Hansen, C. (1999) Bei dieser Gelegenheit on this occasion ved denne anledningen. German bei A puzzle in a translational perspective. In H. Hasselga˚rd and S. Oksefjell (eds) Out of Corpora. Studies in Honour of Stig Johansson (pp. 231 248). Amsterdam/Atlanta, GA: Rodopi. Fillmore, Ch.J. (1984) Remarks on contrastive pragmatics. In J. Fisiak (ed.) Contrastive Linguistics. Prospects and Problems (pp. 119141). Berlin & New York: Mouton Publishers. Fischer, K. (2000) From Cognitive Semantics to Lexical Pragmatics. The Functional Polysemy of Discourse Particles. Berlin & New York: Mouton de Gruyter.
Translating Discourse Particles: A Case of Complex Translation
115
Fleischman, S. and Yaguello, M. (2004) Discourse markers across languages? Evidence from English and French. In C.L. Moder and A. Martinovic-Zic (eds) Discourse across Languages and Cultures (pp. 129147). Amsterdam: Benjamins. Gellerstam, M. (1986) Translationese in Swedish novels translated from English. In L. Wollin and H. Lindquist (eds) Translation Studies in Scandinavia. Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) (pp. 8895). Lund Studies in English 75. Gumperz, J.J. and Levinson, S.C. (1996) Introduction: Linguistic relativity revisited. In J.J. Gumperz and S.C. Levinson (eds) Rethinking Linguistic Relativity (pp. 118). Cambridge: Cambridge University Press. Hakulinen, A. (1986) Particles and constituency. Panel on Linguistic Resources and Interactional Practices. 8th International Pragmatics Conference. Mexico City, 6 July 1986 (unpublished pre-conference paper). Hansen Mosegaard, M.-B. (1998) The Function of Discourse Particles. A Study with Special Reference to Spoken Standard French. Amsterdam/Philadelphia: John Benjamins. Heritage, J. (1984) A change-of-state token and aspects of its sequential placement. In J.M. Atkinson and J. Heritage (eds) Structures of Social Action. Studies in Conversation Analysis (pp. 299346). Cambridge: Cambridge University Press. James, D. (1978) The use of oh, ah, say, and well in relation to a number of grammatical phenomena. Papers in Linguistics 11 (34), 517 535. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research. Theory, Method and Case Studies (pp. 324). Amsterdam/Atlanta, GA: Rodopi. Jucker, A.H. and Ziv, Y. (1998) Discourse markers: Introduction. In A.H. Jucker and Y. Ziv (eds) Discourse Markers. Description and Theory (pp. 1 12). Amsterdam: John Benjamins. Kay, P. (1996) Intra-speaker relativity. In J.J. Gumperz and S.C. Levinson (eds) Rethinking Linguistic Relativity (pp. 96114). Cambridge: Cambridge University Press. Lenk, U. (1998) Marking Discourse Coherence. Functions of Discourse Markers in Spoken English. Tu¨bingen: Gunter Narr Verlag. Noe¨l, D. (2003) Translations as evidence of semantics. An illustration. Linguistics 41 (4), 757785. ¨ stman, J.-O. (1982) The symbiotic relationship between pragmatic particles O and impromptu speech. In N.E. Enkvist (ed.) Impromptu Speech: A Symposium ˚ bo: The Research Institute of the A ˚ bo Akademi Foundation. (pp. 147177). A ¨ stman, J.-O. (1995) Pragmatic particles twenty years after. In B. Wa˚rvik, S.-K. O Tanskanen and R. Hiltunen (eds) Organization in Discourse. Proceedings from the Turku Conference (pp. 96108). University of Turku: Turku, Finland. Pawley, A. and Syder, F.H. (1983) Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J.C. Richards and R.W. Schmidt (eds) Language and Communication (pp. 191226). London: Longman. Schiffrin, D. (1987) Discourse Markers. Cambridge: Cambridge University Press. Schourup, L. (2001) Rethinking well. Journal of Pragmatics 33, 1026 1060. Svartvik, J. (1990) The TESS project. In J. Svartvik (ed.) The London-Lund Corpus of Spoken English: Description and Research (pp. 6386). Lund: Lund University Press. Teleman, U., Hellberg, S. and Andersson, E. (1999) Svenska Akademiens Grammatik. Stockholm: Norstedts.
116
Incorporating Corpora
Teubert, W. (2002) The role of parallel corpora in translation and multilingual lexicography. In B. Altenberg and S. Granger (eds) Lexis in Contrast. Corpusbased Approaches (pp. 189214). Amsterdam/Philadelphia: Benjamins. Traugott, E.C. (1995) Subjectification in grammaticalization. In D. Stein and S. Wright (eds) Subjectivity and Subjectivisation in Language (pp. 3154). Cambridge: Cambridge University Press. Van Baar, T. (1996) Particles. In B. Devriendt, L. Goossens and J. van der Auwera (eds) Complex Structures: A Functionalist Perspective (pp. 259301). Berlin: Mouton de Gruyter. ˚ . (2002) Polysemy and disambiguation cues across languages: The case Viberg, A of Swedish fa˚ and English get. In B. Altenberg and S. Granger (eds) Lexis in Contrast. Corpus-based Approaches (pp. 119150). Amsterdam/Philadelphia: Benjamins. White, P. (1999) A quick tour through appraisal theory. Background paper for Appraisal workshop University of Ghent, March 1999. White, P. (2000) Dialogue and inter-subjectivity: Reinterpreting the semantics of modality and hedging. In M. Coulthard, J. Cotterill and F. Rock (eds) Working with Dialogue (pp. 6780). Mu¨nchen: Niemeyer. Wierzbicka, A. (1976) Particles and linguistic relativity. International Review of Slavic Linguistics I (23), 327367.
Primary sources Brink, Andre´ (1984) The Wall of the Plague. London: Faber & Faber. Translated as Pestmuren, 1984. Stockholm: Forum (Nils Olof Lindgren, trans.). Dahl, Roald (1988) Matilda. London: Puffin Books. Translated as Matilda, 1989. Stockholm: Tidens fo¨rlag (Meta Ottosson, trans.). Davies, Robertson (1985) What’s Bred in the Bone. Harmondsworth: Elizabeth Sifton Books. Viking. Translated as I ko¨ttet buret, 1990. Stockholm: Wahlstro¨m & Widstrand (Rose-Marie Nielsen, trans.). Lodge, David (1988) Nice Work. London: Sacker & Warburg. Translated as Snyggt jobbat, 1990. Stockholm: Trevi (Sonja Bergvall, trans.). Smiley, Jane (1991) A Thousand Acres. London: Flamingo Harper Collins. Translated as Tusen tunnland, 1998. Stockholm: Norstedts (Ylva Sta˚lmark, trans.). Tyler, Anne (1985) The Accidental Tourist. New York: Alfred A Knopf. Translated as Den tillfa¨llige turisten, 1986. Stockholm: Trevi (Sonja Bergvall, trans.). Walters, Minette (1994) The Scold’s Bridle. London: Pan Books/Macmillan General Books. Translated as Blomsterkronan, 1996. Stockholm: Bonnier (Elizabeth Holms and Manni Ko¨ssler, trans.). Weldon, Fay (1987) The Heart of the Country. London: Hutchinson. Translated as Landets hja¨rta, 1988 Ho¨gana¨s: Bra Bo¨cker (Rose-Marie Nielsen, trans.).
Chapter 7
The Translator and Polish English Corpora TADEUSZ PIOTROWSKI
Introduction This chapter will discuss corpora, and associated tools, that would be of interest to a translator working with Polish and English. The focus will be on Polish, as information about English is more easily available. The account is based on two papers of Piotrowski (2003a; 2005), which also contain relevant references in Polish, which will not be repeated here. On the whole both corpora and corpus tools are not very well developed for Polish, a situation all the more evident when compared with what is available for Czech, a closely related language. Until recently Polish linguists in general have not been interested in corpus linguistics; instead interest in the subject has more commonly been found among information technology specialists, both academics and commercial companies, working in the field. Some of the largest companies, such as Microsoft and Xerox, have been developing or adjusting their tools to accommodate Polish. However, the results have not been made available for commercial use. More recently the situation has changed. Linguists are now generally beginning to express their interest in corpus linguistics. It seems that, having seen what can be done with corpora in other languages, they are becoming more enthusiastic about text collection. In addition, Poland is now in the European Union, and constitutes a potentially more lucrative market for translation tools more translation will need to be done, and differences in renumeration between older and more recent members will gradually disappear. In short, translators will have more work and they will be better paid, therefore they will be able to afford the purchase of more sophisticated tools. At present I can only conjecture, as the situation is in a flux and it is quite likely that in a year or two there will be far more resources available, either from large multinational companies or from small, fast growing firms in Poland. However, it is fair to say that the use of corpora in Poland, and with it the availability of corpora and tools, is now in its infancy.
117
Incorporating Corpora
118
A Corpus: Needs of the User Without applications that allow the user to search and query, a corpus is of limited use. While until fairly recently texts available for use as a corpus for translation could only be found with some difficulty, at the present moment the Internet holds far more quantities of text than any static corpus. Most of these texts can be downloaded, and anyone interested can build their own corpus, to suit the particular needs of the task in hand. Therefore what is becoming increasingly more important at present is applications, rather than large static corpora, in particular those that can be used to search the dynamic contents of the Internet. On the other hand, this does not mean that I am not denying the usefulness of large static corpora, in particular when they are supposed to be representative of some categories of text, but I would say that their chief function now is probably to serve as reference, as a yardstick, for comparison with dynamic corpora, in particular the Internet. As I see it, translators need corpora to: . . .
verify whether a given form is used in the language into which the translation is being done, verify that the form is used in certain context(s), linguistic or social, and check in what collocations the given form is used, and in what contexts.
These points, however, are not dissimilar to the uses that linguists make of corpora. However, the needs of translators are not the same as those of linguists, in particular corpus linguists. Translators are usually hard pressed for time: they do not have the time to analyse in great detail the huge amounts of data that an analysis of corpora yields. They look for quick solutions, and in order to work efficiently with texts they need efficient applications.
Polish and Corpus Tools Uses are, however, on a different level of efficiency when English and Polish are compared. In fact, when working with English, it is often sufficient to work with types (groups of tokens), and there are in fact linguists, including John Sinclair (1991), who suggests that this is a sufficient level of abstraction, that when more abstract units are used for corpus analysis valuable insights may be lost. This type of approach is not efficient enough for the translator into Polish. To appreciate this fact fully it is necessary to discuss the linguistic features of Polish. Together with other Slavonic languages (with the exception of Bulgarian), Polish is a language with a rich and complex inflectional system; inflection chiefly serves to indicate both grammatical categories
The Translator and PolishEnglish Corpora
119
and syntactic functions. A verb can have more than 40 different wordforms, while a noun can have 14. Some of those forms are more frequent, others are less often found, but even so, this means that the most efficient method for a translator is to work with lemmas, which could be defined as clusters of word-forms, understood as morphological units, with ambiguation of part-of-speech homonymy resolved. If the software does not make it possible for the translator to work with lemmas, in a large number of cases it would be sufficient to use wildcards in a search, as in Polish it is predominantly suffixes that are used in inflection. However, it must be remembered that aspect, a morphological feature of most verbs, is often indicated by a prefix, which makes search by use of wildcards problematic, because too many results are returned. Unfortunately, for Polish search by wildcards frequently will not work satisfactorily, as there are extensive changes within the word, even in the stem itself. In Table 7.1, the actual forms and their frequencies on the Internet (as found by Google) of two words are shown: the noun dech ‘breath’ and the verb is´c´ ‘go, walk’ (as retrieved 15 January 2005; found by Lexware Concordance Culler, which will be described below). The forms are listed in the order of their frequency. However, it should be borne in mind that the form dech is ambiguous: it can be either the nominative singular of the noun dech ‘breath’, with the genitive tchu, or it can be the genitive plural of the noun ‘plank’ decha, which indicates another area of difficulty. For the verb, the third person singular in the present tense is the most frequent form, however, some of these verb forms are ambiguous: ida˛, ide˛ may be, although less frequently, forms of the proper name Ida. These ambiguities are not resolved in the list below. The name of the lemma is given in capital letters. Figures for these forms will be provided in further discussions of the corpora. There are nine nominal and forty verbal forms, and at times, in the verbal forms only, a word-initial consonant cluster is repeated, which means that a search for such forms by means of wildcards is not practicable, because the user would have to analyse the particular occurrences.
Corpus Tools and Static Corpora of Polish Apart from Lexware Culler, which is available as a web interface, there are other applications at present that allow the user to look for lemmas. However, none of these make it possible to solve ambiguities, as they would have to incorporate semantic or contextual disambiguation. The chief problem with these applications is that they are at present available, or can be used, as packages, with specific corpora. They cannot be used to analyse any text, any corpus, which is what a translator would need. These applications will be examined in the discussion of the corpora below. The popular text-analysis package, WordSmith, can be adequately
Incorporating Corpora
120
Table 7.1 Two Polish words: word-forms and their statistics Dech dech
69,900
tchu
52,700
tchem
49,700
tchy
167
tchowi
56
tcho´w
35
tchami
19
tchom
13
tchach
7 Is´c´
idzie
634,000
is´c´
403,000
idz´
356,000
ida˛
243,000
ide˛
231,000
idziemy
188,000
ida˛c
156,000
szedl
114,000
szlo
102,000
szla
92,300
idziesz
91,400
szli
77,400
szly
51,800
idz´cie
46,800
szlam
40,200
szlis´my
34,200
ida˛cy
30,300
szedlem
29,300
The Translator and PolishEnglish Corpora
121
Table 7.1 (Continued) Is´c´ idziecie
16,800
idz´my
12,500
szlys´my
6,870
szlas´
1,890
szedlbym
994
szloby
872
szedles´
813
szedlby
728
szlis´cie
492
szlaby
445
szedlbys´
361
szlyby
208
szlabym
186
szliby
131
szlom
66
szlibys´my
64
szlys´cie
18
szlabys´
8
szlos´
6
szlibys´cie
4
szlybys´my
2
szlybys´cie
2
used for Polish text. As it supports Polish diacritics, it can also lemmatise its wordlists, using an external file that identifies which orthographic words are to be grouped together in a lemma. Unfortunately, there is no resource that would make this possible for Polish. In what follows I will describe three static corpora, which are widely available. Although there are some additional ones, these are small, not available or not properly documented.
122
Incorporating Corpora
´ dz Lo ´ Corpus of Polish There is no Polish National Corpus comparable to the British one, or to the Czech National Corpus (http://ucnk.ff.cuni.cz/english/index.html), which has the same structure as the British National Corpus. Several years ago, a few researchers at the English Department, Lodz University, announced that they would produce a Polish National Corpus. The Polish National Corpus is designed to reflect the need to create a large referential corpus of Polish for research and other linguistics applications. The corpus is developed at the Department of English Language at Lodz University in co-operation with the Department of Linguistics and Modern English Language at Lancaster University co-operators in the creation of the famous British National Corpus. The Polish and English Language Corpora for Research and Applications (PELCRA) aims to develop a fully annotated corpus of native Polish, mirroring the BNC in terms of genres and its coverage of written and spoken language. The corpus is intended to display all features of a fully professional corpus, including part of speech tagging and TEI compliant annotation. The data collection began over 2 years ago and now includes over 130,000,000 words of running text. Part of it (at present 30 million words) has already been compiled into a balanced corpus and comprises genres and styles comparable in proportions to those included in the BNC. In the near future, access to the PNC sampler should be available via the PELCRA website (http://www.uni.lodz.pl/pelcra/corpora. htm1). This announcement dated back to 2001 and was repeated in some papers, e.g. Lewandowska-Tomaszczyk (2004) (which contains further references), however the page was recently deleted. There is, however, another site: http://korpus.ia.uni.lodz.pl/, under construction, which does provide access to the corpus, and gives some statistics, which are reproduced in Table 7.2. The site allows access to at least two types of corpus, main and spoken. The main corpus is a collection of printed texts and the spoken corpus is a collection of spoken texts.2 To make a comparison with the data found on the Internet (cf. Table 7.2), I will also show the frequency of four word-forms dech, tchu, is´c´, idzie, in the main corpus (Table 7.3). The search tool at the page seems very promising. It has a very clear interface (both in English and in Polish), and appears to allow various types of search: to produce a concordance of a word, a phrase, a collocation, and to search the sources. There is no explanation, however, of the available functions. At this point in time it is possible to produce a concordance of a keyword, a phrase and to see the resulting frequency word list (Table 7.4). Judging from the contents of the list, the texts were morphologically tagged, although tagging applied only to the part of
The Translator and PolishEnglish Corpora
123
Table 7.2 Lo´dz´ Corpus of Polish: statistics (2005) Main corpus Number of tokens
66,303,775
Number of types
395,627
Number of texts
80,509
Number of authors
13,410 Spoken corpus
Number of words
667,776
Number of unique words
44,825
Number of texts
161
Earliest text
2000
Latest text
2003
Table 7.3 Lo´dz´ Corpus: statistics of two words (2005) Lo´dz´ Corpus Type
Tokens
dech
490
tchu
275
is´c´ idzie
no statistics 5613
speech, not to the relevant word form (cf. V for verb). What the list includes is types, not lemmas. Again, even this short extract shows how widespread interparadigm homonymy is. After looking carefully at some of the results of concordancing, however, some reservations may be expressed. First, even though the corpus is said to be balanced, what is found at the site is certainly not representative: most of the texts come from very specific sources; thus, in the newspaper section I did not succeed in finding any newspaper other than the Gazeta Wyborcza: this subcorpus seems to contain only texts from this particular paper, which is known to follow its style guide, with a style that is generally semi-formal. Secondly, in the books subcorpus there are predominantly scholarly publications and I failed to find any
Incorporating Corpora
124
Table 7.4 Lo´dz´ Corpus: word frequency list (2005) Main corpus w
1,679,531
2.61% Prep
i
1,164,655
1.81% Conj
sie˛
967,245
1.50% Pron
na
928,017
1.44% Prep
z
924,467
1.44% Prep
nie
916,032
1.42% other, Pron
do
616,519
0.96% other
z˙e
562,970
0.87% Conj, other
to
501,722
0.78% Pron
jest
376,607
0.58% V
o
357,307
0.55% Exclam, Prep
ale
251,941
0.39% Conj, other
po
245,553
0.38% Prep
jak
216,315
0.34% NoC, Pron
examples of recent fiction; on the other hand, there were novels and similar texts from the 19th century (Faraon by Prus published in 1897), and, in addition, some poems, including the ‘national’ poem, Pan Tadeusz by Adam Mickiewicz (1834) as well as others from the early 20th century such as Chlopi by Reymont. Thus, for a translator, the corpus is too limited in its coverage and variety, even though it is quite large. Furthermore, its claims do not seem to be fully substantiated, and the documentation is not altogether adequate.
IPIPAN Corpus of Polish The next corpus that I would like to discuss is that prepared by the Institute of Computer Science (IPI), Polish Academy of Sciences (PAN). The name of the corpus is derived from the Polish acronym of the organisation: the IPIPAN corpus. It was prepared between April 2001 and March 2004. The main task of the project was to produce tools for working with a corpus, specifically to produce a morphological analyser and a query tool. In addition, the researchers compiled a large corpus, which was treated as the testing material. The targets were achieved, and the project itself has excellent documentation, both in a printed
The Translator and PolishEnglish Corpora
125
(Przepio´rkowski, 2004) as well as an electronic form (http://korpus.pl/). The corpus and related resources are freely available to download. There are two downloadable versions of the corpus, the main one and a sample one, the latter of which can also be accessed through an Internet interface. Figures relating to both corpora are given in Table 7.5 and Table 7.6. The corpus has been morphologically annotated, through the use of Morfeusz, a morphological analyser, making possible searches for lemmas or individual word-forms in the corpus. The texts in the corpus can be searched only by means of POLinterpretation Indexing Query and Retrieval Processor (Poliqarp), which is a search and concordance engine, written in Java, and, as a result, can be used in Unix, Linux and Windows environments. It was designed to function as a universal search engine and concordancer, and in principle it can be used with other marked-up corpora, also of other languages, not only Polish. Unfortunately, the authors of the corpus have ‘to be contacted’ in order to make Poliqarp work with other corpora, making its use potential rather than actual, and certainly of little help to a practising translator, who has to work with ad hoc compiled corpora. The application is quite sophisticated; it allows searches for segments at various levels: a word, an inflectional wordform or a lemma, in a defined chunk of the text, such as the sentence or Table 7.5 IPIPAN Corpus: statistics Sample
Main
15,252,012
70,368,788
Number of types
498,438
762,169
Number of lemmas
216,983
364,366
Number of tokens
Table 7.6 IPIPAN Corpus: statistics of two words Tokens Segment DECH
Sample corpus
Main corpus
194
465
dech
56
194
tchu
91
212
´ IS´C
5102
17,502
is´c´
764
5102
idzie
972
4593
126
Incorporating Corpora
the paragraph. The default mode of the search is for a word-form; in the search for lemmas, special syntax constructions and regularly occurring expressions have to be used. Use of expressions is possible, which makes it possible to find collocations and other recurrent word combinations, and this is a powerful feature, given that it can handle Polish inflection. One can also search for various attributes of the text, based on the information in the description of the given text segment, called metadata, by using only the name of the author or the title of the text, and there is no need to use sociolinguistic data, such as the age, sex and profession of the author. The major drawback of the search machine from the point of view of the translator is that the so-called query syntax (that is, the way one queries the corpus) is complex. Reminiscent of Unix regular-expression syntax, there are numerous symbols and conventions to remember. Unless the software is used on a regular basis, it is far from user-friendly. When searching the main corpus the application is also very slow, its drag on the resources is enormous, the application itself uses up about 300 MB of the memory, and on a reasonably state-of-the-art PC system running Windows XP it slows down the other applications considerably, making it not very convenient for the translator, who usually has to have several applications open when working, including a word-processing system, an Internet browser and a couple of dictionaries. These demands are noticeably not as stringent when the sample rather than the main corpus is used. In short, the application is meant to be used by a dedicated researcher, rather than by an advanced user like a translator. The corpus is just an addition, as it were, to the software, containing extensive amounts of data to be tested when experimenting with the morphological analyser and the query application, circumstances that explain its shortcomings. It is a corpus with a wide scope. The compilers have included any text that could be legally used. Concern over copyright issues also made them distribute the corpus in a binary format, not directly accessible to users, rather than a textual one. There are also some internal gaps in the documentation of the text: a number of texts are not described at all, there are no metadata present. The corpus is in XML format, conforms to the TEI guidelines and is marked up. Preparing the sample corpus, the compilers did make some effort to include a wide variety of texts that would represent different styles and genres, even though the corpus is not balanced in the traditional sense of the term. Though the corpus is predominantly synchronic, including texts written during the 15 preceding years, there is a category called ‘older prose’, which includes literary works of art in the so-called canon of Polish literature. These are texts that are widely read at school, heard on the stage and screened in film adaptations. They were written in the
The Translator and PolishEnglish Corpora
127
Table 7.7 IPIPAN Corpus: structure Genre
Percentage
Contemporary prose
10
Older prose
10
Science
10
Newspapers
50
Parliamentary proceedings
15
Law
5
late 19th and the early 20th century. The structure of the corpus is shown in Table 7.7. It is worth noting, however, that the sample IPIPAN is the only corpus of significant size that is available with a sophisticated search tool and can be used free of charge offline. As a result, its value for a translator cannot be underestimated.
PWN Corpus of Polish The third static corpus to be discussed here is the one prepared by a commercial publishing house, PWN (www.pwn.pl), which publishes academic textbooks and reference works, including Polish dictionaries. In fact, it is a market leader in this area. For the purpose of lexicographic analysis a corpus of Polish was prepared, which has about 100 million words, including a subcorpus of 65 million words: at the present time, the subcorpus is the only truly balanced, large corpus of Polish available. Most of the texts are not included in full; the corpus only contains substantial samples. The corpus is marked-up, although the marking is not fully standard, as it does not conform to the TEI guidelines. The balanced subcorpus is not strictly speaking synchronic: its earliest texts date from 1918 roku (this is a symbolic date, when Poland regained her independence), but 66% are from the 1990s. The sources are books (65% of all tokens), newspapers and magazines (25%), speeches and nonofficial publications such as leaflets (10%). As for genres, fiction constitutes 25% (of all tokens), non-fiction 41%, newspapers 26%, spoken texts 6% and occasional publications 2%. The compilers of the corpus acknowledge that, in comparison to corpora of other languages, the PWN corpus has a higher percentage of fiction. The full corpus can be used, for research purposes, only at the premises of the publishers in Warsaw, and, as a result, is of very little use to a translator. There is also a scaled-down Internet version of
Incorporating Corpora
128
the balanced corpus (http://korpus.pwn.pl/).4 There are actually two versions: a smaller corpus (7.5 million words), which can be used free of charge, and a larger corpus (40 million words), which can be used after payment of money. These corpora can be useful to a translator. Moreover, the smaller corpus of the two has been added, on a CD-ROM, to the luxury edition of the new dictionary of Polish Uniwersalny slownik je˛zyka polskiego (Dubisz, 2003), and can be used offline. It cannot be bought separately. The search engine that is included in the CD-ROM is based on the Microsoft Internet browser, which has to be present in the system. The smaller corpus has two sections: there are 3,708,000 words from various genres, and there are 3,558,000 words from the daily Rzeczpospolita, the most serious newspaper (a broadsheet) in Poland. For copyright reasons, the two sections are kept apart; in the online version both can be accessed at the same time, while in the offline version there have to be two search routines. Together then there are 7,260,000 words. The larger corpus has 40 million words, with the texts taken from 356 books, 121 different press publications, 84 transcribed spoken texts, 46 websites and several hundred advertising leaflets and other ephemera. Unfortunately more detailed figures are not available, in particular the number of words in each genre. Therefore, only Table 7.8 can give some idea of the actual size of the two corpora. The search engine is quite advanced. It allows searches for lemmas, the default mode of search, for particular word-forms, and it produces concordances. In the web version it is also possible to define a collocation that the engine should look for. These corpora should be very useful to the translator, and the fee for using the full online version is not high. In 2005 and in 2006 it stood at PLN 366 (approximately t90) per year. However, the instructions on how to pay the fee do not seem to have been adjusted to accommodate a user from outside Poland. Table 7.8 PWN Corpus: statistics of two words Tokens Segment DECH
Small corpus
Full corpus
108
481
dech
32
113
tchu
23
154
´ IS´C
2794
12,641
is´c´
422
1789
idzie
646
2444
The Translator and PolishEnglish Corpora
129
The Web as a Corpus and the Polish Language In the following section I will describe the use of the web as a corpus. The idea of using the web as a distributed corpus is most promising, as it is a corpus that not only exceeds in size manually collected corpora it is estimated that Google alone has indexes to about 1013 billion pages; the index of the Polish search engine, www.szperacz.pl (with an alias: www.szukacz.pl), is based on 26 million documents from 524,000 Polish sites (22 January 2005) but it is also a dynamic corpus, changing constantly. An example of how the Internet was used to establish the existence of new derivatives in Polish, based on a borrowing from English, sorry, and their distribution in various texts, can be found in Piotrowski (2003b). Apart from the advantages of the web corpus, there are obviously many disadvantages, first of all the fact that the number of textual units in the Internet is unknown, which makes statistical estimates only relative, or the lack of control over the texts, which may disappear overnight. For any fast and efficient search, the staggering amount of text contained in Internet pages is also a huge problem. The tools that are most often used to find textual occurrences in the Internet are search engines such as Google, AltaVista and Yahoo, with Google usually given priority (an excellent ongoing discussion of Google in linguistic research can be found in Language Log http://itre.cis.upenn.edu/myl/languagelog/). However, even though search engines are very fast, most of them clearly work best with languages that do not use diacritics. In an extreme position, the meta search engine www.metacrawler.com actually strips Polish words of any diacritics, and is´c´ becomes isc; this can produce numerous homographs. The search engines show little awareness of the complexities of Polish grammar. Recently, however, there appeared a search engine that does incorporate some morphological analysis, not surprisingly developed in Poland: www.szukacz.pl. Search engines in Poland that tried to be morphologically aware had been developed before, however they ran into difficulties, as the search yielded too many results and was too slow, making the future of this feature less than certain. Szukacz, like other morphological analysers discussed here, returns lemmas, and it is also reasonably fast.
Web Concordancers. Lexware Culler The product of an Internet search by a Google-like engine is a list of occurrences ordered by some internal criteria, the most important being the relevance of the given page (for example, http://www.google.com/ help/interpret.html), though relevance is not defined or explained. However, the placement of a site in the list of results called priming can be manipulated in various ways, which predictably tends to make
130
Incorporating Corpora
some sites prominent (for example pornographic ones), and is not immediately helpful to a practising linguist, that is to a translator, who needs some sort of linguistic ordering of the search results, for example by the part of speech, by word ending, etc. A logical development of search engines therefore is web concordances, which produce concordances of textual data. Some examples are KWiCFinder (http:// www.kwicfinder.com/KWiCFinder.html), WebConc (http://www.nie derlandistik.fu-berlin.de/cgi-bin/web-conc.cgi) and WebCorp (http:// www.webcorp.org.uk/). Most of them, however, are again not very useful for highly inflected languages, and some do not support adequately non-Western diacritics; KwicFinder explicitly does not ‘support search reports on languages with non-Western-European character sets’ (http://miniappolis.com/KWiCFinderVSWebKWiC.html). Some of them are also extremely slow (for example WebCorp). The only web concordancer that, to my knowledge, is suited to Polish is Lexware Culler (http://82.182.103.45/lexware/concord/culler.html), or rather was, because unfortunately the service was discontinued3, although it was very useful. What follows is thus a historical description. The software was a demo version, inadequately documented (at least in English), its full functionality therefore not fully known. It was reported to work with English, Swedish, German, French, Spanish, Russian and Polish. In contrast to other web concordances, it worked only with the short text extracts that Google search produces (the technicalities are described in a number of documents, e.g. Dura, 2004), which, on the one hand, made it very fast, but on the other, these extracts can provide too little context for the translator. It also searched for all inflectional variants of a lemma, and produced statistics based on Google figures, making it extremely useful to anyone working with Polish. Neither the szukacz engine, nor the Culler application, offers the functionality of Google, that is the search criteria cannot be narrowed down to more precise categories.
Conclusion At present there are three static large corpora available for Polish, which all exhibit certain similar features. They are of similar size, around 7090 million running words. They are marked-up, and this also includes morphological annotation, although with varying precision. They seem to follow traditional solutions, similar to those used in handcompiled collections of text extracts for dictionary compilation in Poland. The word ‘traditional’ refers to the typical preoccupation with the language of (great) writers, rather than with the language of the person in the street. These corpora include a very high number of literary texts, up to 25%, which, however, makes it disputable to what extent a
The Translator and PolishEnglish Corpora
131
translator of non-fiction can rely on their data. There is also a traditional understanding of the term contemporary: all three corpora include pre20th-century texts, written by ‘great’ writers, presumably because the language of an average Pole is influenced, through education, by these texts. The size of these literature texts ranges from 3 to 10% (there are no figures for the Lo´dz´ corpus), and the translator who would like to use the corpora has to be aware of this fact. The three corpora can be accessed through the web interfaces, although in various scaled-down versions, either because of the performance of the search engine, or because a commercial company aims to make money on it. There is only one corpus that can be used offline, with a sophisticated search engine, and if a translator buys a (printed) dictionary of Polish then another offline corpus can be obtained. The two corpora then would amount to either 23 million words (in the faster version) or 80 million words (in the full version). That is a considerable amount of text. Working with language that is always changing, one can also use the web as a corpus. There is at least one search engine that allows the user to search for lemmas rather than just occurrences (Szperacz), and one concordancer that produces concordances of web texts, using a morphological analyser. These are just mere beginnings, it would seem, and it is most likely that in the near future the range will be considerably extended. Notes 1. Retrieved April 21, 2005, the address is no longer valid. 2. This description was valid in early 2005, when the text was written. At present the corpus is called the Reference Corpus of Polish, and it has a component called the Conversational Corpus. The reference corpus on September 24, 2006, had 93,129,588 tokens, while the conversational one had 667,776 tokens. 3. The page displays just the following information: Web Culler demo has been replaced by a new Culler application-Corpus Culler. Here is the new Corpus Culler demo [http://www.nla.se/culler]. Corpus Culler has similar interface as Web Culler but it has superior functionalities, e.g. freqency calculations are available for almost every type of selection. If you wish to test the old Web Culler application please contact us at
[email protected] 4. It should be pointed out here that the English version of the page http:// en.pwn.pl/dictionaries.php contains outdated information.
References Dubisz, S. (ed.) (2003) Uniwersalny slownik je˛zyka polskiego. Luxury edition. Vol. 1-VI. Warszawa: Wydawnictwo Naukowe PWN.
132
Incorporating Corpora
Dura, E. (2004) Concordances of snippets. Coling Workshop on Using and Enhancing Electronic Dictionaries. Geneva. On WWW at http://82.182.103.45/Lexware/ English/publications/coling04.pdf. [retrieved September 24, 2006] Piotrowski, T. (2003a) Je˛zykoznawstwo korpusowe: wprowadzenie do problematyki. In S. Gajda (ed.) Je˛zykoznawstwo polskie. Stan i perspektywy (pp. 143 154). Warszawa, Opole: PAN, Uniwersytet Opolski. Piotrowski, T. (2003b) Internacjonalizm sorki jako element polskiego systemu je˛zykowego. In Z. Ticha´ and A. Rangelova (eds) Internacionalizmy V Nove´ Slovnı´ Za´sobeˇ. Sbornı´k prˇı´speˇvku˚ z konference Praha, 1618. cˇervna 2003 ˇ R Lexikograficko-termino´ stav pro jazyk cˇesky´ AV C (pp. 182 193). Praha: U logicke´ oddeˇlenı´. On WWW at http://www.tadeuszpiotrowski.neostrada.pl/ sorki.pdf. [retrieved September 24, 2006] Piotrowski, T. (2005) Komputerowe korpusy tekstowe polszczyzny. In M. Czermin´ska, editor in chief, S. Gajda, K, Klosin´ski, A. Legez˙ynska, A. Z. Makowiecki, R. Nycz (eds) Polonistyka w przebudowie. Literaturoznawstwowiedza o je˛zyku-wiedza o kulturze edukacja (Vol. II) (pp. 726 735). Krako´w: Universitas. On WWW at: http://www.tadeuszpiotrowski.neostrada.pl/ krak2004.pdf. [retrieved September 24, 2006] Przepio´rkowski, A. (2004) Korpus IPI PAN. Wersja wste˛pna/The IPI PAN Corpus: Preliminary version. Warszawa: IPI PAN. On WWW at http://dach.ipipan. waw.pl/ adamp/Papers/2004-corpus/book_pl.pdf. [retrieved September 24, 2006] Sinclair, J. (1991) Corpus Concordance, Collocation. Oxford: Oxford University Press.
Chapter 8
The Existential There-construction in Czech Translation ˇ´I RAMBOUSEK and JANA CHAMONIKOLASOVA ´ JIR
Introduction Existential sentences sentences announcing the existence or nonexistence, or occurrence or non-occurrence of a particular phenomenon have different forms in English and Czech. An English existential sentence1 most typically consists of the existential particle there, followed by the verb to be, and an indefinite noun phrase functioning as the notional subject (for example, there was silence). In addition, the existential construction may contain an adverbial of place or time (there was silence in the house). The existential there has the status of a dummy subject fulfilling the grammatical but not the semantic function of the subject. Czech existential sentences, by contrast, contain only one subject, that is, the notional subject. The Czech language does not possess any formal means comparable to the existential particle there. Instead, existence or occurrence is suggested by the intransitive character of the verb and the final position of the notional subject, for example: bylo ticho (‘was silence’), v domeˇ bylo ticho (‘in the house was silence’). The present study examines existential sentences from the point of view of translation practice. Using a corpus of parallel texts, it examines how Czech translators deal with English there-constructions, and what syntactic and semantic patterns they employ to achieve functional equivalence in Czech.
Existential Sentences in English English existential sentences have attracted the attention of numerous scholars and have been extensively described. Of particular interest has been the use of the existential particle there, the grammatical or dummy subject of an existential sentence. The existential there is also often discussed in comparative language studies. Haiman (1974: 90) claims that the use of dummy subjects has developed only in Germanic languages, French and Romansch, while in the remaining Indo-European languages and most non-Indo-European languages, this category is not found. According to Breivik (1983: 358403), all languages that make use of dummy subjects have or have had the verb-second constraint. 133
134
Incorporating Corpora
In creating the framework for our analysis we have relied on the interpretations of English existential sentences presented in the Comprehensive Grammar of the English Language (Quirk et al., 1985), the Longman Grammar of Spoken and Written English (Biber et al., 1999), Mluvnice soucˇasne´ anglicˇtiny na pozadı´ cˇesˇtiny [A grammar of contemporary English against the background of Czech] (Dusˇkova´, 1988) and in the monograph Existential THERE (Breivik, 1983). According to these sources, the main discourse function of existential sentences is to introduce a new phenomenon into the discourse, expressed by a noun phrase (NP) in post-verbal position. This arrangement is in agreement with the general information principle, that is, the principle of end focus, operating in Indo-European languages: the most natural way of presenting ideas is arranging elements in accordance with a gradual rise in communicative importance (communicative dynamism), and mentioning rhematic, that is, context-independent and unpredictable ideas after thematic, that is, context-dependent and predictable ideas.2 Other word-order principles operating in Indo-European languages are the grammatical, the emphasis and the rhythmic principles (cf. Firbas, 1992: 117140; Mathesius, 1975: 153 163; Vachek, 1994: 3240). While the emphasis and the rhythmic principles are of less importance, the grammatical principle is very prominent. In English, it overrides the information principle; it enforces the sequence subject verb other sentence elements, namely object, adverbial and complement. The grammatical and the information principles can both be satisfied in sentences whose subjects represent known information, for example, she bought a new car. In sentences with a new subject, however, the information principle is often violated, for example, an old lady entered the shop. The special feature of English existential there-constructions is that they satisfy both principles even though the notional subject represents new information. This is possible owing to the occurrence of the existential particle there in the position usually occupied by the subject. The historical development and grammaticalisation of the existential there is closely related to the requirements of the information principle and to major typological changes in the English language (cf. Breivik, 1983). There are various structural types of existential sentences, for example, sentences containing in addition to there be NP a local or temporal adverbial (Adv), for example, there were two pictures on the wall, or sentences in which the NP contains a post-modifying phrase, clause or semi-clause, for example, there are many people applying for this job. There are certain differences in the classifications of existential sentences presented in the grammars mentioned above. Biber et al. (1999) and Quirk et al. (1985) classify them primarily according to syntactic criteria and deal with semantic aspects within the syntactic subtypes. Dusˇkova´ (1988: 353 356), by contrast, provides a classification primarily
The Existential There-construction in Czech Translation
135
Table 8.1 Different types of existential there-constructions Existential sentence type
Semantic structure
Syntactic structure
Existential
Existence
there be NP ( Adv)
Existential-locative
Existence in a location
there be NP Adv Adv there be NP there be Adv NP
Action/perception
Existence of action or perception
there be NP
Modal
Existence of possibility or necessity
there be NP
based on semantic criteria, making a comparison with Czech. For the purposes of this study, we have modified Dusˇkova´’s classification but have included some features from the other two classification systems. The major semantic subcategories of existential sentences distinguished in this study are presented in Table 8.1. The differences between the individual subcategories of existential sentences and their semantic structures will be illustrated by examples from the material analysed (see the section on ‘Analysis’). Instead of the verb to be, some existential sentences contain an intransitive verb (Vi) expressing existence or occurrence, for example, hang, stand, sit; these verbs, however, are rare in existential sentences. An alternative to the existential-locative structure Advtherebe/Vi NP (for example, between the windows there was/hung a large picture), is a locative subject verb inversion pattern without there: Advbe/ViNP (for example, between the windows was/hung a large picture) (cf. Chamonikolasova´, 2004). Sentences of this type have not been included in the present study.
Existential Sentences in Czech Czech existential sentences do not attract as much attention as their English counterparts because they do not represent a distinctive structural type. Czech belongs to a group of languages that make no use of dummy subjects and have no formal means comparable to the existential particle. Breivik (1983: 358403) suggests that the distinction between languages with and without dummy subjects is closely related to the distinction between subject-prominent languages (for example, English) and topic-prominent languages (for example, Czech). In grammars of the Czech language, existential sentences are mentioned
Incorporating Corpora
136
or exemplified in chapters dealing with word order and information structure but are usually not considered a special sentence pattern. The term ‘existential sentence’ is not very common in Czech. Danesˇ et al. (1987: 607) use the expression ‘scenic sentence’, but most other grammars do not use a specific term. Czech word order is much more flexible than word order in English. In Czech, the overriding word-order principle is the information principle (cf. Danesˇ et al., 1987: 549 614; Karlı´k et al., 1995: 633651); the remaining word-order principles mentioned above are less powerful. In unmarked Czech sentences, rhematic elements (elements constituting the focus) always come after thematic elements (elements constituting the topic). As a result, rhematic subjects conveying new information most naturally occur in post-verbal position (cf. bylo ticho (‘was silence’), v domeˇ bylo ticho (‘in the house was silence’), while the pre-verbal position may be left unoccupied.
Materials Used in the Present Study There are several parallel English Czech corpora available for linguistic and translation research: (1)
(2) (3)
George Orwell’s novel Nineteen Eighty-Four, part of the multilingual parallel corpus created within the EU Multext-East Project; the Czech version of the novel was processed by the Institute of the Czech National Corpus (CNC),3 Dominik Lukesˇ’s aligned corpus of about 100,000 words accessible at www.bohemica.com,4 and a parallel subcorpus of about 1.9 million words, included in the Prague Dependency Treebank (PDT).5
The present contribution is based on the analysis of four pairs of parallel English and Czech texts selected from another parallel corpus, Kacˇenka, which was compiled by the Department of English and American Studies at Masaryk University’s Faculty of Arts in Brno. Its first version, containing over 3.2 million words, includes ten novels (complete texts) and two sets of non-fiction texts (a collection of stock exchange reports and a computer manual), and their translations into Czech. The texts are presented in several formats and are paragraphaligned. A second version (2005), under the title ‘K2’, contains another 14 novels, that is, an additional 3 million words. This version has been prepared in co-operation with the Department of Information Technologies, Faculty of Informatics, Masaryk University.6 Looking for texts that are relatively recent and that represent different styles of writing, we selected the novels Lucky Jim by Kingsley Amis, Love
The Existential There-construction in Czech Translation
137
Table 8.2 The structure of the material analysed Total no. of words (English texts)
Total no. of there-constructions
Sample of there-constructions analysed
Amis
91,089
206
100
Erdrich
90,160
209
100
Toole
126,787
182
100
Heller
174,050
423
100
Medicine by Louise Erdrich, A Confederacy of Dunces by J.K. Toole and Catch-22 by Joseph Heller, and their translations into Czech. We determined the total number of existential there-constructions in the entire texts but did not analyse all of them. From each text, we used a sample consisting of the first 100 occurrences of existential thereconstructions. Table 8.2 shows how these samples related to the entire texts. Locating there-constructions and corresponding equivalents in Czech was relatively easy owing to the aligned format of the electronic corpus. However, the actual classification of different types of existential sentences in accordance with semantic criteria had to be done without the aid of computer software, limiting the size of the samples available for analysis.
Analysis The aim of the study is to map strategies applied in translating English existential there-constructions into Czech. Differences between English and Czech in expressing existence or appearance have been observed both at the syntactic and the semantic levels (cf. for example, Dusˇkova´, 1988: 353356). While the distinctions at the syntactic level are relatively clear and easy to describe, as we have seen, a comparison at the semantic level is much more complex. The focus of our attention is therefore semantic differences between English and Czech existential constructions, and lexicosemantic shifts in translations into Czech. Verb phrase variation English and Czech existential sentences in the material examined display striking differences in the distribution of different verbs. While the most common verb occurring in the sample of English existential sentences is the verb to be, their Czech translations contain a variety of
138
Incorporating Corpora
other verbs as well as the verb by´t (‘to be’). These alternative verbs are usually intransitive with presentational meaning, that is, their function is to introduce a new element into the discourse. In a number of cases, however, an English existential construction is translated by a nonexistential sentence containing a transitive, non-presentational verb. Below are examples of verb phrase variation in the chosen texts. The Czech sentences are followed by an English gloss illustrating their syntactic and morphological structure; two or more English words are connected if they correspond to a single Czech word, for example, ‘I_know’ corresponds to ‘vı´m’. Translations preserving the existential (presentational) meaning
English to be Czech ellipted verb [1] (Toole, chap. 4, section 3, par. 1) There was no television. There were no complaints. ˇ a´dna´ televize. Z ˇ a´dne´ stı´zˇnosti. Z [No television. No complaints.] English to be Czech by´t (‘to be’) [2] (Heller, chap. 4, par. 4) I know there’s a war on. Vı´m, zˇe je va´lka. [I_know, that_is war.] English to be Czech presentational verb other than by´t (‘to be’) [3] (Erdrich, story Crossing the water, section 4, par. 54) There was a moment when the car and road stood still . . . Nastal okamzˇik, kdy vu˚z i silnice zu˚staly bez hnutı´, . . . [Occurred a_moment, when the_car . . .] English presentational verb other than to be Czech presentational verb other than by´t (‘to be’) [4] (Heller, chap. 3, par. 42) There then followed a hectic jurisdictional dispute between these overlords . . . Mezi obeˇma mocipa´ny dosˇlo proto k va´sˇnive´mu pra´vneˇ kompetencˇnı´mu sporu . . . [Between both overlords came therefore to a_hectic . . .] A number of Czech translations preserving the existential meaning contain intransitive verbs like nastat (‘to occur’) and dojı´t k (‘to come to something’) (cf. examples [3] and [4]). Our conception of presentational verbs is wide: it includes all verbs that have a prevailingly presentational meaning in particular instances even if they can otherwise be used in
The Existential There-construction in Czech Translation
139
non-presentational predications. In the elliptical translations, we distinguished between verbless sentences consisting only of the notional subject of the original existential sentence (cf. example [1]), and translations taking the form of adverbials (cf. example [6]), in which the existential meaning is lost. Borderline (transitional) cases
English to be Czech mı´t (‘to have’) [5] (Amis, chap. 24, par. 28) But there were two bottles, . . . Meˇla prˇece dveˇ lahvicˇky . . . [She_had after_all two bottles . . .] Sentences containing the verb mı´t (‘to have’) (cf. example [5]) represent a borderline case; the verb shares the existential feature with presentational verbs and transitivity with non-presentational verbs (cf. Biber et al., 1999: 955; Dusˇkova´, 1988: 354; Quirk et al., 1985: 1411). Sentences containing this verb are treated as a transitional type (cf. for instance, Table 8.3). Translations losing the existential (presentational) meaning
English to be Czech non-verbal element [6] (Heller, chap. 5, par. 43) There’s a rule saying I have to ground anyone who’s crazy. Podle prˇedpisu jsem povinen kazˇde´ho, kdo je stizˇen dusˇevnı´ poruchou, vyrˇadit z letove´ sluzˇby. [According to_regulations I_am obliged everyone who is struck with_ mental disorder to_exclude from flight duty.] English to be Czech by´t (‘to be’) functioning as copula [7] (Amis, chap. 24, par. 20) I don’t think there can be any doubt of that. To je docela jasne´. [That is quite clear.] English to be Czech non-presentational verb [8] (Erdrich, story Crossing the water, section 2, par. 45) There ain’t a prison that can hold the son of Old Man Pillager, a Nanapush man. Takovej lapa´k jesˇteˇ nepostavili, z ktery´ho by neupla´ch syn stary´ho Pillagera, muzˇe z rodu Nanapushu˚. [Such prison yet they_built_not, from which would not_escape the_son of_old . . .]
To be
Total
400
7
393
Verb
Other verbs
Number of occurr.
English thereconstruction
15
0
15
Ellipted verb
252 (63%)
107
0
107
Existential by´t (‘be’)
Czech existential sentences
130
6
124
Other verbs
40 (10%)
40
0
40
mı´t (‘have’)
Borderline (transitional) cases
7
1
6
Non-verbal element
108 (27%)
21
0
21
Copulative by´t (‘be’)
Czech non-existential sentences
Table 8.3 The distribution of verbs in English there-constructions and their Czech translations
80
0
80
Other verbs
140
Incorporating Corpora
The Existential There-construction in Czech Translation
141
To sum up, the distribution of verbs in English existential sentences and their translations into Czech is presented in Table 8.3. English existential sentences in the corpus are constructed almost exclusively using the existential verb to be, while other presentational verbs represent a negligible minority. The situation is quite different in Czech. In the Czech translations, the existential character of the original sentence was preserved in only 63% of cases, while 27% of English thereconstructions were translated using non-existential sentences. The remaining 10% of Czech translations contain the verb mı´t (‘to have’), which represents a transitional type of construction. In Czech existential constructions, the verb by´t (‘to be’) occurs less frequently than other presentational verbs. Most of the Czech non-existential sentences contain non-presentational transitive verbs. The discrepancy between the high frequency of the verb to be in English there-constructions and the low frequency of its counterpart by´t (‘to be’) in the Czech translations indicates differences in the semantic structure of the sentences studied. In existential sentences containing to be or other presentational verbs, the semantic load is carried by the notional subject. Presentational verbs other than to be, however, express additional shades of meaning (for example, to come, to stand) and are therefore semantically stronger. Non-presentational verbs are stronger still (cf. examples [10], [12] and [14]) and may become carriers of the highest semantic and communicative load within the sentence (cf. examples [8] and [16]). The data in Table 8.3 indicate that in translating existential sentences, Czech translators tend to use verbs carrying a higher semantic load than the verb to be. Semantic type variation Although most English existential sentences have the same syntactic structure, that is, thereto benotional subject (Adv), certain semantic types can be distinguished. Following Dusˇkova´’s classification mentioned earlier, we divided the English existential constructions in our sample into four semantic categories. Their distribution in the English material is presented in Table 8.4. The boundaries between the individual semantic types are, however, not clear-cut and there are borderline cases potentially allowing dual classification. The most frequent semantic type of existential sentence in English is the purely existential type, representing about 69% of all cases. Existential-locative, modal and action/perception sentences occur with lower frequencies within the remaining 31% of sentences under examination. The different semantic types are illustrated by the examples below. Sentences [9], [11], [13] and [15] are translated into Czech using existential patterns; sentences [10], [12], [14] and [16] are translated into Czech using non-existential patterns.
Incorporating Corpora
142
Table 8.4 Distribution of different existential sentence types in the English texts Semantic type Existential
Number of occurrences 277
69.2%
Existential-locative
52
13.0%
Modal
41
10.3%
Action/perception
30
7.5%
400
100.0%
Total
Existential type of source-text sentence
Purely existential sentences, referred to in Quirk et al. (1985: 1406) as ‘bare’ existential sentences, postulate the existence or appearance of a concrete or less frequently abstract entity. They do not usually contain a local or temporal adverbial. The location of existence, however, may be implied (for example, ‘here’, ‘in the Universe’, etc.); Dusˇkova´ (1988: 354) includes in the purely existential type also sentences containing an optional local or temporal adverbial. This conception does not correspond exactly to Quirk et al.’s category of ‘bare’ existential sentences; we have adopted Dusˇkova´’s slightly wider conception because, semantically, sentences containing an optional adverbial resemble sentences without an adverbial in that the location is not accentuated. [9] (Heller, chap. 42, par. 127) There is hope, after all. Prˇece jenom je neˇjaka´ nadeˇ. [After all is some hope.] [10] (Amis, chap. 24, par. 8) There was the most shattering scene. Udeˇlala mi prˇ´ısˇernou sce´nu. [She_made me a_shattering scene.] Existential-locative type of source-text sentence
In existential-locative sentences, the location of existence or appearance is more prominent than in purely existential sentences. The sentences contain an obligatory local or temporal adverbial. Dusˇkova´ (1988: 354) claims that existential-locative sentences differ from purely existential sentences in that they may be transformed into a locativeinversion existential pattern without there (cf. example [11]: There’s a little
The Existential There-construction in Czech Translation
143
junk-room at the end of the passage At the end of the passage is a little junkroom), and into a locative non-existential pattern (The little junk-room is at the end of the passage). [11] (Amis, chap. 6, par. 94) There’s a little junk-room at the end of the passage . . . Na konci chodby je komora, [At the_end of_the_passage is a_junkroom . . .] [12] (Erdrich, story Crossing the water, section 2, par. 97) There was the skin of a real live alligator nailed on the closet door. Na dverˇe sˇatny prˇitloukli ku˚zˇi opravdove´ho aliga´tora. [On door of_the_closet they_nailed the_skin of_a_real alligator.] Modal type of source-text sentence
The notional subject of the modal type of existential sentence contains an element that has the form of an infinitive or a gerund (not illustrated here) expressing possibility or necessity. The name of this category reflects the fact that these sentences express modality usually expressed by non-existential modal sentences. [13] (Toole, chap. 14, section 1, par. 5) There was only one way to save him. Je jen jedna mozˇnost, jak ho zachra´nit. [Is only one possibility, how him to_save.] [14] (Toole, chap. 13, section 13, par. 2) There was a human life to consider. Musı´ bra´t v u´vahu lidsky´ zˇivot. [He_must take into consideration human life.] Action/perception type of source-text sentence
The action/perception type includes existential sentences whose notional subjects denote action or phenomena involving sensory perception. [15] (Amis, chap. 4, par. 122) There was a pause. Chvı´li bylo ticho. [For_a_while was silence.] [16] (Amis, chap. 1, par. 6) There was the most marvellous mix-up in the piece they did just before the interval.
144
Incorporating Corpora
Poslednı´ cˇ´ıslo prˇed prˇesta´vkou na´dherneˇ popletli. [The_last piece before interval marvellously they_mucked_up.] Focusing now on the Czech translations, we examined whether they reflected the differences between the semantic types of English existential sentence. Our analysis suggests that there are certain tendencies in translating the subtypes that indicate the validity of Dusˇkova´’s classification. These tendencies are expressed by the data presented in Table 8.5. The data in Table 8.5 suggest that Czech translators use existential patterns more often when translating existential-locative sentences (79% of cases), purely existential sentences (62%) and action/perception sentences (73%) than when translating the modal type of existential sentence (41%). So in the category of existential-locative sentences, the existential character is lost only in 13% of cases. For purely existential sentences and sentences expressing action/perception, the loss is higher (27%). The loss of existentiality is highest with existential sentences of the modal type (44%). The different distributions of existential and non-existential patterns within the range of semantic types in the Czech translations may reflect the different degrees of ‘presentational power’ that these types display. Existential-locative sentences express the existence or appearance of a certain phenomenon in an explicit location; as this phenomenon is usually a concrete person (persons) or thing (things), whose existence or appearance is independent of any agent, in approaching this sentence type translators tend to preserve the presentational meaning and choose existential or more precisely presentational patterns. The fact that the presentational meaning is preserved less frequently in purely existential sentences and sentences expressing action/perception may be due to the higher proportion of abstract notional subjects within these two types. Many abstract phenomena imply predication and are therefore more dynamic than concrete phenomena. For example, in there’d be no objection, the notional subject ‘objection’ implies that ‘someone objects to something’; in there was a most marvellous mix-up, the subject implies that ‘someone mixed something up’. In translation into Czech, this may lead to the use of non-presentational patterns, as in example [16] above: [poslednı´ cˇı´slo] na´dherneˇ popletli. The proportion of non-presentational sentences in Czech is highest within the modal type, in which the implied relationship between agent and action is quite obvious, for example, the sentence there was a human life to consider conveys clearly that ‘someone has to consider a human life’. Looking at the existential and the action/perception categories, we see that the distribution of existential versus non-existential translations is similar (62% versus 27%, and 73% versus 27% respectively). The two categories, however, are clearly distinguished by the verb occurring
277
52
41
30
400
Existential
Existentiallocative
Modal
Action/ perception
Total
Semantic type in English source
15
1
0
0
14
Ellipted verb
107
1
8
24
74
130
20
9
17
84
252
22
17
41
172
Total
63
73
41
79
62
%
40
0
6
4
30
10
0
15
8
11
%
mı´t (‘have’)
Exist. by´t (‘be’) Other verbs
Transitional pattern
Czech existential pattern
7
0
0
0
7
Nonverbal element
Table 8.5 The distribution of verbs within different semantic types in English and Czech
21
0
1
0
20
Copul. by´t (‘be’)
80
8
17
7
48
Other verbs
Czech non-existential pattern
108
8
18
7
75
Total
27
27
44
13
27
%
The Existential There-construction in Czech Translation 145
146
Incorporating Corpora
within them. The verb by´t (‘to be’) appears more frequently in translations of the purely existential type both within the existential and the non-existential translations (27% or 74/277, and 7% or 20/277 respectively) than in translations of the action/perception type, which more often contain either presentational verbs other than the existential by´t, or non-presentational verbs; the verb by´t represents only 5% (or 1/22) in existential translations, and 0% (or 0/8) in non-existential translations.
Stylistic variation Analysing the translations of our four novels, we have observed certain tendencies in translating English existential sentences into Czech. In the individual texts, however, these tendencies do not assert themselves with equal force because the treatment of the language material is subject to the translator’s individual style and each has their own preferences. In the Czech translations of the novels by Heller and Toole, for instance, the variation in the Czech verb phrase is much greater than in the other two translations, as shown in Table 8.6. This is manifested in the instances where the source sentence was translated with an existential sentence: the ratio of by´t ‘be’ to other (semantically stronger) presentational verbs in Toole and Heller is 26:39 and 22:34 respectively, while the translations of Amis and Erdrich display the ratios 27:30 and 32:27.7 In addition, of course, there are stylistic differences in the ways existential sentences are used in the original texts. As an example, Table 8.7 relates the total number of occurrences of therebe and there another verb to the total number of words in each novel. Although the focus of our study is translating existential sentences from English into Czech, we also decided to include, for purposes of comparison, an additional text, but one which is a translation from Czech into English the novel The Joke by Milan Kundera. We expected that in this novel, the frequency of there-constructions would be lower, as there is no formal signal in Czech that would suggest to the translator the use of an existential construction. To our surprise, however, the frequency was higher than in all of the original English texts under examination. While the use of the existential construction is most frequent in the Kundera translation, variation in the use of verbs other than to be is also highest in the Kundera translation (3.6% or 11/302), followed by Toole (2.7% or 5/182), Erdrich (1% or 2/209), Heller (0.7% or 3/423) and Amis (0.5% or 1/206). The following verbs were found to have been used once (unless otherwise stated) in the individual English texts in combination with existential there8: Amis: appear; Erdrich: come (2 occurrences); Toole: appear, issue, come, begin, lurk; Kundera/Hamblyn/Stallybrass: lurk, stand
Sample
100
100
100
100
400
Author
Erdrich
Toole
Heller
Amis
Total
English source sentences
15
7
1
7
0
Ellipted verb
252
107
27
22
26
32
Existential by´t (‘be’)
Czech existential pattern
130
30
34
39
27
Other verbs
40
40
9
10
9
12
mı´t (‘have’)
Transitional pattern
Table 8.6 The distribution of verbs within the Czech translations by source text
7
1
1
0
5
Non-verbal element
108
21
6
8
3
4
Copul. by´t (‘be’)
80
20
24
16
20
Other verbs
Czech non-existential pattern
The Existential There-construction in Czech Translation 147
therebe 291 420 207 205 177
English text
Kundera (translation)
Heller (original)
Erdrich (original)
Amis (original)
Toole (original)
5
1
2
3
11
thereother verbs
182
206
209
423
302
Total number of there-constructions
Table 8.7 The frequencies of there-constructions in original and translated texts
126,787
91,089
90,160
174,050
113,323
Total number of words
697
442
431
411
375
Words per 1 there-construction
148
Incorporating Corpora
The Existential There-construction in Czech Translation
149
(2 occurrences), rise, exist (2 occurrences), come (2 occurrences), remain, float, emerge. The range of verbs that appear in the Czech translations is much wider. There are verbs that could be called ‘presentational verbs proper’, for example, existovat (‘exist’), nastat (‘occur’), konat se (‘take place’), objevit se (‘appear’), sta´t (‘stand’), viset (‘hang’), lezˇet (‘lie’), nacha´zet se (‘be found’), zu˚stat (‘stay’), zby´t (‘remain’) and many others. In addition, there are intransitive presentational verbs whose use is motivated by the meaning of the notional subject of the original English sentence or its post-modification, for example: there was no grass nerostla tam tra´va [grew_not there grass]. As translators also use non-presentational constructions containing transitive lexical verbs, the number of verbs that can appear in translations of there-constructions is almost unlimited, cf. examples [8], [10], [12] and [16]. The dynamic semantic properties of different verbs and their applicability in presentational and nonpresentational sentences are discussed in detail in Firbas (1992). Although the focus of our analysis has been prose fiction, we also considered the use of existential there-constructions in two non-fiction texts that are part of the K2 corpus. Although these were not examined in detail, they were checked for the frequency of the there-construction. The texts were (1) four chapters from The Elegant Universe by Brian Greene (32,115 words, approximately one quarter of the whole book), and (2) a set of stock exchange reports from a newsletter for investors (49,649 words). The number of words per one there-construction (corresponding to the last column in Table 8.7) is 554 and 1551, respectively. These figures, in particular the latter, support the data presented in Biber et al. (1999: 945), which show that there-constructions are less frequent in nonfiction than in fiction; the figures also show that there is a high fluctuation within individual texts and text types. In the present study, it was not our aim to draw conclusions about the styles of the authors or translators of the texts. Such a study would have to be based on several different texts by the same author or translator and on the analysis of a much wider variety of language phenomena. It would have to take into consideration the cultural backgrounds and the norms of the two languages involved. Stylistic factors are sufficiently complex not to be examined in detail within the scope of this study.
Conclusion The present analysis of four novels written in English and their Czech translations supported by other data outlines certain tendencies in the translation of English existential there-constructions into Czech. In spite of the variations in style between the authors of the novels, as well as
Incorporating Corpora
150
between the individual translators, it is possible to draw the following conclusions. When translating existential there-constructions, Czech translators did not always preserve the presentational character of the original sentence. In 27% of cases, the existential construction was replaced by a nonpresentational pattern in Czech. In addition, 10% of cases were translated using the verb mı´t (‘to have’), which is a transitional type. The tendency to use non-existential (non-presentational) constructions was strongest with sentences whose notional subjects expressed necessity or possibility, that is, the modal type of existential sentence (44%). It was less strong with sentences postulating existence, that is purely existential sentences (27%), and with sentences presenting some action or perception, that is the action/perception type (27%). This tendency was weakest with sentences focusing on the presentation of phenomena in a particular location, that is, the existential-locative type (13%). In the Czech translations, there was a much higher degree of variation in the verb phrase than in the original English existential sentences, in which the percentage of verbs other than to be was negligible. Most of the non-presentational equivalents in Czech contain lexical non-presentational verbs. The variation in the verb phrase is greater also among sentences preserving the presentational character; in this category, the verb by´t (‘to be’) occurs in only 107 out of 252 cases (42%); the remaining 130 are verbs of presentation other than by´t, and 15 are cases of ellipsis. These tendencies reflect the relative semantic weakness of the verb by´t and the demands of Czech stylistic norms. As this study is based mainly on the analysis of texts from a single genre, namely prose fiction, it is not possible to draw more general conclusions about the use of the existential there in English, nor about existential sentences in Czech. Instead the study provides an overview of Czech equivalents that are used and the approaches employed by Czech translators to deal with this systemic difference between the two languages. Notes 1.
2.
In the present paper, the term sentence will be used as a general term denoting simple sentences (clauses consisting of non-clausal sentence elements) as well as complex sentences (sentence structures containing simple sentence elements and subordinate clauses); cf. Quirk et al. (1985: 40). Existential sentences are discussed within various theories of information structure. The terms theme, rheme and context-dependence are used within the theory of functional sentence perspective (see Firbas, 1992; Svoboda, 1981). Closely related to this theory is the framework of topic/focus articulation, in which the terms topic, focus, contextual boundness and presupposition are applied (see Hajicˇova´ et al., 1998; Sgall et al., 1986). As the focus of the present
The Existential There-construction in Czech Translation
3.
4.
5.
6. 7.
8.
151
study is the lexicosemantic structure of existential sentences, we will not discuss information structure in detail. CNC is the main corpus of the Czech language, a non-commercial academic project run by Charles University’s Faculty of Arts in Prague (http:// ucnk.ff.cuni.cz/english/). CNC is a set of subcorpora focussing on different aspects of the Czech language. Recently the Institute of the Czech National Corpus, together with other departments, launched a new project, Intercorp, aiming at compiling parallel corpora of languages studied at Charles University in combination with Czech. Click on Linguistics, then Parallel Corpus. The corpus consists of 24 samples of parallel fiction and non-fiction texts in English and Czech. The corpus is paragraph-aligned and sentences are marked; the files have been tested to work with ParaConc (http://www.athel.com/para.html). PDT (http://quest.ms.mff.cuni.cz/pdt/) is a project of Charles University’s Faculty of Mathematics and Physics. It offers a collection of parallel journalistic texts, consisting of over 50,000 parallel sentences selected from the Reader’s Digest and its Czech version Vy´beˇr. The corpus is morphologically tagged and includes notations of syntactic dependencies. For more details about Kacˇenka and K2 see http://www.phil.muni.cz/ angl/k2. It would be rather problematic to consider the ratio of by´t ‘be’ to other verbs in the category of Czech non-existential sentences, because here the verb by´t acts as a copula closely tied to a complement whose semantic load brings the whole structure close to other (lexical) verbs. The numbers in Table 8.7 and the lists of English verbs are taken from the complete texts of the books, rather than from the samples used in the analysis elsewhere in the paper; all forms of the verb to be were included in the category therebe, including could be, seemed to be, ain’t etc.
References Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar of Spoken and Written English. London: Longman. Breivik, L.E. (1983) Existential THERE. Bergen: University of Bergen. Chamonikolasova´, J. (2004) Gradation of meaning in an English sentence. Anglica Wratislaviensia 42, (pp. 111 118). Wroclaw: Wroclaw University. Danesˇ, F., Grepl, M. and Hlavsa, Z. (1987) Mluvnice cˇesˇtiny [A Grammar of Czech]. Prague: Academia. Dusˇkova´, L. (1988) Mluvnice soucˇasne´ anglicˇtiny na pozadı´ cˇeˇsˇtiny [A Grammar of Contemporary English Against the Background of Czech]. Prague: Academia. Firbas, J. (1992) Functional Sentence Perspective in Written and Spoken Communication. Cambridge: Cambridge University Press. Haiman, J. (1974) Targets and Syntactic Change: Janua Linguarum, Series Minor 186. The Hague: Mouton. Hajicˇova´, E., Partee, B. and Sgall, P. (1998) Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Dordrecht: Kluwer Academic Publishers. Karlı´k, P., Nekula, M. and Rusı´nova´, Z. (eds) (1995) Prˇı´rucˇnı´ mluvnice cˇesˇtiny [A Handbook of Czech Grammar]. Prague: Nakladatelstvı´ Lidove´ noviny. Mathesius, V. (1975) A Functional Analysis of Present-day English on a General Linguistic Basis (L. Dusˇkova´, trans., J. Vachek, ed.). Prague: Academia. Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. (1985) A Comprehensive Grammar of the English Language. London: Longman.
152
Incorporating Corpora
Sgall, P., Hajicˇova´, E. and Panevova´, J. (1986) The Meaning of the Sentence in its Semantic and Pragmatic Aspects. Dordrecht: Reidel/Prague: Academia. Svoboda, A. (1981) Diatheme. Brno: Masaryk University. Vachek, J. (1994) A Functional Syntax of Modern English. Brno: Masaryk University.
Texts analysed Amis, K. (1962) Lucky Jim. Harmondsworth: Penguin Books. Amis, K. (1959) Sˇtastny´ Jim (Jirˇ´ı Mucha, trans.) Prague: SNKLHU. Erdrich, L. (1989) Love Medicine. New York: Bantam Books. Erdrichova´, L. (1994) Cˇarova´nı´ s la´skou. (Alena Jindrova´-Sˇpilarova´, trans.) Prague: Odeon and Argo. Greene, B. (2000) The Elegant Universe. New York: Vintage Books, Random House. Greene, B. (2001) Elegantnı´ vesmı´r. (Lubosˇ Motl, trans.) Prague: Mlada´ fronta. Heller, J. (1990) Catch-22. New York: Laurel. Heller, J. (1985) Hlava XXII. (Miroslav Jindra, trans.) Prague: Nasˇe vojsko. Kundera, M. (1970) The Joke. (David Hamblyn and Oliver Stallybrass, trans.) Harmondsworth: Penguin Books. Toole, J.K. (1981) A Confederacy of Dunces. London: Penguin Books. Toole, J.K. (1985) Spolcˇenı´ hlupcu˚. (Jaroslav Korˇa´n, trans.) Prague: Odeon.
Chapter 9
Corpora in Translator Training and Practice: A Slovene Perspective SˇPELA VINTAR
Introduction In the past, the stereotypical image of a translator would most likely be one of an overworked, slightly grey female or balding male nailed to a desk under a heap of dictionaries and encyclopaedias, leading a rather solitary life. Today, a more realistic picture of a translator at work would inevitably feature a computer with Internet Explorer minimised on the task bar and the heap of dictionaries similarly replaced by an array of desktop icons. It is a fact, yet to be acknowledged by many practising translators and translation scholars, that the digital age brought to the translation business a revolution much more profound than merely switching from paper to the computer screen. The abundance of electronic texts on the web, available in most languages of the world and often multilingual, is just one of the reasons paper dictionaries can no longer be considered the primary source of translation-relevant information. Another reason, closely related to the main credo of corpus linguistics that the primary element of analysis in language is the text, is that translators extremely rarely translate words in isolation. Any reference work that presents words devoid of textual context is thus of limited value in a translation environment. If in the early years of corpus linguistics electronic text collections were still considered a luxury for various reasons, among them the cost of computer storage and the complexity of processing large amounts of textual data, in the past decade the situation has changed radically. Indeed, it now seems obsolete to even compile corpora; instead of fixed collections of texts we are entering an era of tools for dynamic corpus creation in accordance with specific and individual requirements. Naturally, there are still arguments in favour of proper corpora versus ad hoc text collections created dynamically by trawling the Web. The difference between ‘real’ language, although no corpus could ever claim to truly represent it, and the language of the Web, may best be illustrated by comparing a page of concordances obtained from a site such as WebCorp (http://www.webcorp.org.uk/index.html [24.9.06]) with one 153
154
Incorporating Corpora
from the Bank of English or any other large monolingual corpus. It seems that certain language varieties, for example literary language, are virtually non-existent or seriously under-represented on the Internet, while others like commerce, computers or the informal chatty style of web forums claim a substantial part of bandwidth. The availability of corpora did not go unnoticed by progressive translators. The most valuable corpus type, often designed explicitly for translators, is of course the parallel corpus, giving each language segment in two or more languages. By offering translations of segments instead of equivalents of words, a parallel corpus shifts the translator’s attention from a lexical item to an item of meaning. It is impossible to study an expression without its context, and no given translation can be ‘inadequate for the context in question’ as is often the case with dictionary-based equivalents because they are all already embedded in their proper textual contexts. Another important feature of corpora is the imperfection of language use. Any collection of real texts will contain typos, passages of bad style in the original or translation, and of course translation solutions that run the gamut from excellent to misleading or simply wrong. A corpus user must find a way to critically judge the solutions proposed by the corpus and evaluate them according to the type and contents of the corpus. This point is especially crucial in translator training. A monolingual corpus is an equally valuable resource, though usually for different purposes. As monolingual corpora are generally larger and, in some cases, may be considered representative, they are able to offer information on more or less standard language use on the basis of quantitative data. Moreover, a monolingual corpus can be an important source of translation equivalents for specific expressions, technical terms or recent borrowings, naturally requiring different search strategies. Unlike the dictionary, a concordance leaves it to the user to work out how an expression is used from the data. This typically calls for more indepth processing than does consulting a dictionary, thereby increasing the probability of learning. In more general terms, by drawing attention to the different ways expressions are typically used and with what frequencies, corpora can make learners more sensitive to issues of phraseology, register and frequency, which are poorly documented by other tools (Aston, 1999). This contribution on corpora in translation and their use in translator training in Slovenia first presents an overview of corpus resources, concluding that Slovene English is currently the only language pair where parallel corpora can effectively be used. This is followed by an overview of mono- and multilingual corpora for Slovene. Next an account is given of the role of corpora in translators’ training based on the experience gained by students of the Department of Translation at the
Corpora in Translator Training and Practice: A Slovene Perspective
155
University of Ljubljana. In conclusion, the author expresses the view that considering the number of speakers of the Slovene language, one of the smallest language communities in Europe, in the field of bilingual freely available language resources, Slovene is comparatively well provided for.
Overview of Mono- and Multilingual Corpora for Slovene In fairness to various teams and researchers working on Slovene corpora, it should be noted that this section attempts to include only corpora that can be accessed on the Web and that may be considered a translation resource. We therefore will not be concerned with speech databases and collections that have been assembled with the intention of speech technologies, privately owned mono- and multilingual corpora that have been compiled for the purposes of developing a Machine Translation system, as well as all other text collections that cannot be accessed and are not distributed. Monolingual corpora Among the first Slovene electronic text collections was an online repository of Slovene literature compiled by Miran Hladnik (http:// www.ijs.si/lit/leposl.html-l2 [24.9.06]). This digital library was founded in 1995 and is still being updated, however the literary texts today form part of a much larger corpus project, Nova beseda (http://bos.zrc-sazu.si/ a_beseda.html [24.9.06]), at the time of writing containing 162 million words. Nova beseda is being compiled at the Slovene Academy of Sciences and Arts and is freely available for online querying. The corpus is composed mostly of the Slovene daily newspaper Delo (120 million words), while the rest is taken from the above-mentioned literary collection, the computer monthly Monitor and some other minor text sources. The interface to Nova beseda is rather simplistic and does not offer many advanced options of corpus querying or processing the hits. The texts of Nova beseda have not undergone any linguistic analysis, hence only word form search is possible, with the wildcard characters * and ?. The corpus does not claim to be balanced or representative in any respect, as it contains a very narrow selection of text sources (see above). However, the user has the possibility of restricting the search to an individually defined subcorpus through an easy-to-use bibliographic tree. The other large Slovene corpus is FIDA (http://www.fida.net [24.9.06]), a 100-million reference corpus of Slovene, which was compiled within a commercially funded project launched in 1998. The name of the corpus is an acronym for the project partners (Filozofska fakulteta/ Faculty of Arts, Institute Jozˇef Stefan, DZS Publishing and Amebis Ltd.).
156
Incorporating Corpora
As the main objective of a reference corpus is to be as representative of the language in all its varieties as possible, considerable efforts were invested into building a balanced corpus out of a much larger text collection. The total number of texts collected for this project was therefore 60,416, of which only 29,177 were chosen for inclusion into the corpus (Gorjanc & Vintar, 2000). The corpus was morphosyntactically annotated by Amebis (http:// www.amebis.si [24.9.06]), a Slovene language technologies company responsible for most commercially available language tools for Slovene. The annotation includes complex morphosyntactic descriptions, i.e. not just part-of-speech tags but an array of all grammatical categories associated with the word form. Furthermore, to each word form was added its lemma or, in case of morphological homography, lemmas. The corpus can be accessed via a web concordance engine ASP, however only upon purchasing a license. For demo purposes the corpus can be accessed with a limited number of concordance lines displayed. From today’s perspective the main drawback of the FIDA corpus is the fact that the project was finished in 2000 and that since then no texts have been added to the collection. A nationally funded project to update and expand the corpus into a ‘‘FIDA’’ version is already underway, and its preliminary results can be used at http://www.fidaplus.net [24.9.06]. Multilingual corpora As a small language with close contacts with other linguistic communities, Slovene has a high level of translational activity. Accordingly, the need for and appreciation of multilingual resources have fuelled several projects of compiling parallel corpora. The most interesting as well as easiest to obtain is the language pair SloveneEnglish, which is by now very well served: the total size of freely available Slovene English parallel corpora amounts to over 12 million words. Other languages lag far behind, with the noble exception of the MULTEXT-East project covering a wide array of languages, however without online concordancing. MULTEXT-East
The name refers to a large initiative, within which a set of corpora and tools were built or made available, covering a large number of mainly Central and Eastern European languages (Erjavec, 2004). The most important component is the linguistically annotated corpus consisting of Orwell’s novel 1984 in the English original and translations. The resources are the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian and Russian; first release), and CONCEDE
Corpora in Translator Training and Practice: A Slovene Perspective
157
(validation, re-encoding; partial re-release). This dataset, unique in terms of languages and the wealth of encoding, is extensively documented (http://nl.ijs.si/ME/ [24.9.06]), and freely available for research purposes, upon signing the licence agreement. IJS-ELAN
The IJS-ELAN SloveneEnglish parallel corpus includes 15 texts from various domains; the total size of the corpus is 1 million words (Erjavec, 2002). The basic idea behind this project was to build as big a parallel corpus as possible, in the quickest way possible. The already existing MULTEXT-East corpus, consisting of Orwell’s 1984, was expanded through a further 14 texts, ranging from EU legislation and pharmacology to computer manuals and localisation files. As text availability was the main criterion in building this corpus, the selection is quite haphazard. An online concordancer was set up shortly after the texts had been preprocessed and the corpus has since been used for a variety of purposes, including as a translation resource (http://nl2.ijs.si/ index-bi.html [24.9.06], see Figure 9.1).
Figure 9.1 Search interface of the Slovene English bilingual corpora provided by the Jozˇef Stefan Institute
158
Incorporating Corpora
Trans
The Trans SloveneEnglish parallel corpus was compiled in a rather unspectacular manner as a student project at the Department of Translation, University of Ljubljana. It contains 1 million words and was compiled specifically for translation purposes, which meant that the number of domains covered by the texts was deliberately limited to five: medicine, geology, tourism, nuclear engineering and public administration. The corpus was made available for online search at the same address as the IJS-ELAN corpus (see previous section). Evrokorpus and SVEZ-IJS
The largest and most challenging translation project in Slovenian history is the translation of the aquis communautaire, a prerequisite for accession to the European Union and a foundation for all legal and administrative matters concerning EU. As the majority of translation work was performed by the Office of the Government of the Republic of Slovenia for European Affairs using Translation Memory tools, the resulting databases of bilingual segments could easily be converted into a searchable corpus. The first such collection was made available in 2002 under the name Evrokorpus (http://www.gov.si/evrokor [24.9.06]) ˇ eljko. and is being regularly updated to this day by its author Miran Z Roughly the same collection of translation memory segments was converted into the corpus SVEZ-IJS ACQUIS at the Department of Knowledge Technologies, Jozef Stefan Institute. The original translation memories were tokenised, tagged, lemmatised and encoded according to TEI P4 recommendations. The corpus is available online and represents the only EnglishSlovene corpus that can be searched using lemmas and tags. Thus, it is possible to perform, say, a search for all word forms of the base form rastlina (meaning plant) preceded by an adjective. A search of this type would produce the output shown in Figure 9.2. The fact that during the process of EU enlargement most texts produced were made publicly available as a parallel corpus is an unprecedented advantage for translators from and into Slovene. Combined with the terminology database Evroterm (http://www.gov.si/ evroterm [24.9.06]), this is a unique infrastructure ensuring consistent translations in all EU-related domains. Sublanguage corpora For terminology work, as well as mono- and multilingual, another corpus type is extremely useful: the domain-specific, special language or sublanguage corpus. For this corpus type it is important that it is representative of the domain in terms of the text types contained and the ‘freshness’ of the texts. Such a collection can almost never be entirely
Figure 9.2 Search by tag and lemma (adjective and base form of rastlina [plant])
Corpora in Translator Training and Practice: A Slovene Perspective 159
Incorporating Corpora
160
bilingual, because a special domain is best represented by a collection of crucial texts in one language. Several criteria should be considered when compiling a sublanguage corpus: .
.
.
Register. A special domain like, say, genetics, will typically be described in texts of various registers, e.g. scientific papers, college textbooks, articles in popular scientific journals etc. Register can have a considerable influence on the terminology used and the style. Quality. Although in itself a slippery issue, texts do differ in the amount of effort invested into all stages of their production, from authorship to typesetting and printing. Corpora generally should not impose normative restrictions, however for domain-specific corpora certain texts might be inappropriate on the grounds of poor quality. Original, translated or written in a foreign language. In most scientific domains researchers publish their work in languages other than their own, either as written by themselves or translated into the target language. Such texts should by no means be considered substandard because they too constitute the language reality in a given domain. We should, however, be aware of the characteristics of such texts and possible inconsistencies resulting from them.
In the context of corpora and translation from the Slovenian perspective, one sublanguage corpus project should be mentioned, namely the corpus of Information Science Proceedings of the Slovene Informatics Conference (DSI, http://nl2.ijs.si/index-mono.html [24.9.06]). The domain of Informatics is a highly productive and terminologically challenging one for all non-English languages, and a monitor corpus is the best way to follow language development fuelled by technology. The DSI corpus was compiled as support for Islovar, the interactive online Slovene English terminological dictionary of Informatics (Islovar, http://www.islovar.org [24.9.06]). To conclude this overview, corpus resources for Slovene are extensive but poor as far as the range of languages is concerned. SloveneEnglish is currently the only language pair where parallel corpora can effectively be used, for other language pairs translators are left with bilingual dictionaries and self-compiled text archives.
Corpora in Translator Training For a translator, a corpus is one of the sources of linguistic information, either on the lexical level when searching for translation equivalents or on other levels seeking to produce a functional translation. As the
Corpora in Translator Training and Practice: A Slovene Perspective
161
primary source of lexical information, most translators still rely on dictionaries, although in special domains term banks may be used much more often than general language dictionaries. If an item is not found in the dictionary, the next stage is usually Google. Clearly the Internet is a gold mine for translators, as it contains up-to-date documents on almost all subjects in almost all the languages of the world (Fletcher, 2004). However, most documents on the Web are not bilingual, and the quest for translation equivalents requires efficient and innovative search strategies. The Web is in itself a large multilingual corpus, and there are tools available that facilitate its use as a corpus, such as KwicFinder (http:// miniappolis.com/KWiCFinder/KWiCFinderHome.html, [24.9.06]). We might say that between these two extremes, dictionaries as static, normalised, structured data and the Web as unstructured, chaotic, abundant data, there are corpora, some that already exist and some we might build ourselves. In a translator training environment it is important to provide enough room for corpus-related courses and devote enough time to incorporating these resources into translation practice. The following sections describe experiences gained through several years of teaching corpus-related courses at the Department of Translation at the University of Ljubljana and other institutions of higher education. The curriculum of a university programme in translation should comprise at least the first two of the following: .
.
.
Introduction to Corpus Linguistics. There might not be enough time to include this as a separate course, however a module of at least 10 lecture hours should be provided to cover a brief history of corpus linguistics and explain the basic terms of corpus creation, annotation and exploitation. Overview of existing online corpora for the students’ mother tongue and the languages they study, including multilingual and special language corpora, if available. This overview should include the demonstration of the search facilities as well as practical corpus-based translation assignments. Demonstration of methods and tools for compiling corpora, both mono- and multilingual. By this we mean concordance software, tools for sentence alignment, tools for manual annotation and, possibly, methods or platforms needed for text manipulation like XSLT or Perl.
In Ljubljana, general issues related to corpus linguistics are partly dealt with in the Slovene course and partly in the Translation Tools course in the second year.
162
Incorporating Corpora
Teaching existing corpora The languages taught at the Translation Department of the University of Ljubljana are Slovene as L1, English or German as L2 and English, German, Italian and French as L3. The overview of existing corpora thus covers major monolingual corpora for these languages (FIDA and Nova beseda for Slovene, BNC and Bank of English for English, IDS Korpora for German, CORIS for Italian, several non-reference corpora for French) and all SloveneEnglish bilingual corpora presented above. By existing corpora we refer only to corpora available online free of charge. For first time corpus users it is important to draw attention to some key issues: (1) (2)
(3)
Corpora are not dictionaries. Texts may contain language usage that does not correspond to what is considered standard or correct. When using non-lemmatised corpora of highly inflectional languages, the search for the base form will return only a small portion of possible hits. The linguistic pattern of the base form may differ from the patterns of inflected forms. Results from a corpus require critical interpretation. Frequency information should be interpreted according to corpus composition.
A monolingual corpus of the mother language will naturally be used for different purposes than a corpus of a foreign language. An interesting feature is the search for translation equivalents in a monolingual corpus. An English word like spam will first occur in Slovene as a borrowing, so the search for spam in Nova beseda returns a considerable number of hits, where, if we examine the context, several possible translation equivalents occur near the borrowed word, e.g. nezazˇelena posˇta, elektronske smeti, nenarocˇena oglasna posˇta etc. In order to make corpus exercises as close to translation reality as possible, it is a good idea to work with texts instead of isolated examples. Thus, even a short domain-specific text like the one in Figure 9.3 may help illustrate the usefulness of bilingual corpora, especially if the domain in question is well represented in the corpus. The highlighted words cannot be resolved with bilingual dictionaries but are best handled with the use of mono- and bilingual corpora. Another exercise that may illustrate the potential usefulness of corpora compared to dictionaries is an exploration of polysemy and the lexical patterns associated with a word. The advantages of corpora with respect to phraseology and collocation patterns are indisputable. An example of an interesting word is bug in nominal and verbal senses, across different domains and registers. A simple word stem search (*bug*) in the IJS-ELAN corpus returns as many as 16 derived word forms, among them bug, debug, bugger, bug-free, debugging, buggy, and a look at the concordance quickly reveals at least four meanings: a fault in
Corpora in Translator Training and Practice: A Slovene Perspective
163
Stopping Spam at the Gateway September 17, 2003 By Steven J. Vaughan-Nichols I hate spam. You hate spam. We all hate spam. But, none of us hate spam as much as ISP and business network administrators hate spam. Alexis Rosen, president and coowner of Public Access Networks, which runs Panix, one of the oldest ISPs concedes that spam's "not as bad as Adolph Hitler," but "it is morally evil." Well, that's clear enough. Why such strong feelings? Rosen explains spam "chews up a lot of bandwidth and disk space." And the non-stop disk I/O sucks down system resources and significantly stresses the mail server. And why is this so annoying? Because it directly interferes with their ability to perform as an ISP and that, in turn, is slapping down the bottom line. This isn't just Panix's problem. All ISPs and corporate networks face it. So what can you do about it?
Figure 9.3 Example text for corpus-based translation
a computer program, an insect, a hidden microphone and in verbal use only to annoy or pester. Using a parallel corpus, the students may be encouraged to identify various meanings of bug first in a monolingual context, explore the collocational patterns of each sense, compare the list of corpus-based senses with a monolingual dictionary, and finally search for translation equivalents of each sense in the target language. Although we normally think of corpora as synchronic resources portraying language at a certain point in time, some interesting studies into term formation have been made for Slovene and English, for example for the field of mobile communications (Glavan, 2004). The corpus Nova beseda was used for diachronic research by building several subcorpora according to the year of publication. In this way the frequency of terms like mobitel, WAP, wapanje etc. could be explored year-wise and the tendencies of terminological development quantified.
Teaching corpus building For many languages, special domains or language pairs, there are no available corpora. On the other hand, the Internet is an infinite source of documents and texts on all possible subjects, some available in two or more languages. In addition, most people are in the habit of storing their
Incorporating Corpora
164
translation projects on hard disk, and if there were a systematic way of searching through all these files, the process of retrieving previously used items of information might be much faster and easier. Arguments in favour of compiling one’s own corpora are many, although to most people the effort seems too strenuous considering the potential benefits. Especially in view of translation memories and the idea of reusability behind them, it seems that the people who opt against Trados will also be sceptical about corpora. Of course, the purpose of these two types of resources differs to a great extent. While translation memories provide reusability only at the most rigid level of sentence similarity, (bilingual) corpora provide insight into language or translation solutions on almost any imaginable level. At the Department of Translation in Ljubljana we have undertaken several student projects of compiling bilingual corpora. Such corpus projects have certain limitations compared to corpora compiled within research projects: .
. .
All tools and methods demonstrated should be available to students inside as well as outside the classroom. The experiment should be completely replicable in any other out-of-the-classroom setting. All tools should be free and, if possible, run under Windows. Translation students generally cannot program, and all data manipulation must be performed using standard text processing software and non-exotic file formats.
The following sections briefly describe the stages involved in building a corpus and the tools available. Collecting and pre-processing texts
According to the definition given by Sinclair (1991), a corpus is ‘A collection of naturally occurring language text, chosen to characterize a state or variety of a language.’ The choice of texts should therefore be concerned with the representativeness of a corpus, even if only a small domain is to be represented (cf, K. Aijmer, this volume). Of course, in bilingual corpora it is even more difficult to maintain this criterion, nevertheless the composition of the corpus should at least be thoroughly discussed. The purpose of this discussion is to clarify issues of corpus size, number of domains included, text types, language(s) and/or language of the original, possible text sources, copyright etc. Once the project has a set of clearly defined objectives in terms of text collection, some technical questions also need to be resolved. Which file formats can be successfully handled? If the main source of texts will be the Internet, HTML will need to be handled; if on the other hand we expect text donors from translation agencies or private entities, MS Word
Corpora in Translator Training and Practice: A Slovene Perspective
165
is likely to be the most common format. Which character encoding should be used for diacritics? Probably Unicode or UTF-8, though older tools might have problems displaying them. Which encoding should be chosen for the entire corpus? If we are building a resource that should be used and distributed as widely as possible, we should probably choose the TEI encoding (Sperberg-McQueen & Burnard, 2002), however without appropriate computational knowledge this standard is not trivial to implement. Alignment
If we are building a parallel corpus, the texts will need to be sentence aligned. If we can get our hands on a licensed copy of Trados WinAlign, version 6.5 or higher, alignment is an easy task. A sentence alignment utility is offered by several other translation memory packages (such as ATRIL’s DejaVu), as well as by the parallel concordance tool ParaConc (http://www.athel.com/para.html [24.9.06]), obtainable for a relatively modest fee.Sentence alignment is usually a semi-automatic procedure, where the tool proposes sentence pairs, which must be manually corrected in the event of errors. The DejaVu alignment utility can handle various file formats, including HTML, Word or XML files. Conversion into XML
When a pair of sentence-aligned files is exported from DejaVu in the plain text format, each sentence pair is displayed on one line separated by the tab character. With a set of simple transformations performed within MS Word it is possible to convert this format into a simplistic XML, which is better suited for potential online applications or further processing. An input of the form ‘Source’
‘Target’
can easily be converted into XML with the elements Btu for translation unit and Boriginal and Btranslation for both languages: Btu BoriginalSourceB/original BtranslationTarget B/translation B/tu Offline concordancing
There are a number of tools available for concordancing at modest prices. A widely known toolkit for monolingual text analyses is Wordsmith Tools by Mike Scott (http://www.lexically.net [24.9.06]). While perfectly adequate even for advanced corpus linguists working with
166
Incorporating Corpora
Figure 9.4 ParaConc the parallel concordancer
monolingual corpora, it is of very limited use for querying bilingual corpora. For the latter, the above-mentioned ParaConc is a good option (Figure 9.4). With somewhat limited search options, but still functional, is the DejaVu demo version, which displays the corpus as a two-column table and displays concordances for the search string if used with the filter function. According to the experience gained, building bilingual corpora in the classroom is not only a useful exercise and a corpus-awareness-raising activity, but also an undertaking that produces extremely valuable resources for the entire translation community.
Conclusions A few years ago corpora were an unexplored terrain for many practising translators and translation tutors alike. This situation seems to be changing both because translators are required to produce highquality translations in a shorter time than before and because electronic language resources are more accessible than before. The aim of this contribution has been to present the situation in Slovenia and for the Slovene language, which with its just under 2 million speakers counts among the smallest language communities in Europe. Nevertheless, in the field of bilingual freely available language resources, Slovene is
Corpora in Translator Training and Practice: A Slovene Perspective
167
considerably well provided for. Not many languages can boast an online parallel corpus of over 12 million words, and corpus-related activities in the context of translator training by now have the status of a well established tradition. Still, there are a number of items on the wish list, many of which will hopefully materialise within the next few years. Among them is the updated version of the Slovene reference corpus FIDA , the creation of bilingual corpora for the language pairs SloveneGerman and Slovene Croatian as a start, and the provision of essential language resources, such as tokenisers and morphosyntactic taggers, for free online use and for all the major languages with which we work. References Aston, G. (1999) Corpus use and learning to translate. Textus 12, 289314. Erjavec, T. (2002) The IJS-ELAN Slovene English Parallel Corpus. International Journal of Corpus Linguistics 7 (1), 1 20. Erjavec, T. (2004) MULTEXT-East Version 3: multilingual morphosyntactic specification, lexicons and corpora. In: M.T. Lino (ur.), M.F. Xavier (eds) Fourth International Conference on Language Resources and Evaluation, Lisbon, Portugal, 26th, 27th & 28th May 2004. LREC 2004: held in memory of Antonio Zampolli: proceedings (pp. 15351538). Paris: European Language Resources Association. Fletcher, W.H. (2004) Facilitating the compilation and dissemination of ad-hoc web corpora. In: G. Aston, S. Bernardini and D. Stewart (eds) Papers from the Fifth International Conference on Teaching and Language Corpora. Amsterdam: Benjamins. On WWW at http://kwicfinder.com/Facilitating_Compilation _and_Dissemination_of_Ad-Hoc_Web_Corpora.pdf. Accessed 24.9.06. Glavan, S. (2004) Terminotvorje v slovensˇcˇini na primeru izrazja mobilne telefonije. Diploma Thesis, Faculty of Arts, University of Ljubljana. Gorjanc, V. and Vintar, Sˇ. (2000) Iskanja po korpusu slovenskega jezika FIDA. In: Proceedings of the Conference Language Technologies, ISJT 2000, 20 27. Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sperberg-McQueen, C.M. and Burnard, L. (eds) (2002) TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML Version. Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.
Chapter 10
NP Modification Structures in Parallel Corpora1 ´ S VA ´ RADI TAMA
Introduction Parallel corpora of source texts and their translations are widely held to be rich sources of data to be exploited both for language technologies and translation studies (Melamed, 2001, Olohan, 2004), their utility being limited only by the relative scarcity of their availability. The present contribution uses a richly annotated parallel corpus consisting of a 20th-century English novel and its translation into Hungarian to explore NP modification structures in English and Hungarian such as the water from the tap (‘a csapbo´l folyo´ vı´z’) or a fekete haju´ no˝ (‘the black haired woman’). The aim is to use state-of-the-art language technology and resources to explore linguistic structures that are brought into correspondence as translation equivalents. In addition to the methodological aspects discussed, we will present findings from analyses carried out at the lexical and syntactic level. The scope of the analysis Originally, our attention was to focus on adjectives, but it soon emerged that an analysis of textual equivalents must transgress the boundaries defined by part-of-speech category membership. While a traditional contrastive analysis of adjectives will face the problem that adjectives are often defined in terms of function (modification), contrasting languages at the level of textual equivalence further exposes the untenability of such neat parallels. Although the category of adjectives exists in both languages, this is no guarantee that in translation adjectives will appear in strict conformity. Indeed, this observation was corroborated by our findings, in that in many cases Hungarian AN structures corresponded to NPP structures in English. The approach advocated here does not confine itself to a study of textual equivalence in a given set of translation pairs. Any translation necessarily involves a choice from a list of equally appropriate candidates. Our aim has been to establish corresponding structures across languages whether or not they were exemplified in the actual translation pair. In a similar vein, the analysis will ignore the direction of 168
NP Modification Structures in Parallel Corpora
169
the translation. NP modification structures will be brought into a relationship of mutual textual equivalence, without regard to the question of which one is the original and which is its translated version. This simplifying assumption is made because our concern here is not confined to the analysis of the actual translation of any particular text but rather we want to map all potential translation equivalents that can be established between the two languages for the specific linguistic phenomena in question. Source data As source data for the analysis we used Orwell’s 1984 and its Hungarian translation by La´szlo´ Szı´jgya´rto´. This corpus was prepared at the Linguistics Institute of the Hungarian Academy of Sciences as part of the MULTEXT-East project in the mid-1990s (Dimitrova et al., 1998). The project produced a richly annotated, carefully aligned parallel corpus available in SGML and more recently in XML encoding. The structure of the text is captured in the annotation down to the level of the sentence in terms of parts, chapters both tagged as division (div) elements, paragraphs (p) and translation units (tu), the latter containing the corresponding sentence or sentences in each language wrapped up in a segment (seg) element. Accordingly, each tu contains exactly two seg elements, one for English and one for Hungarian, each of which consist of one or more sentences in the respective language. Sentences (s) are made up of a stream of word (w) and punctuation (c) units. Each word carries its morphosyntactic analysis and its dictionary form, tagged as msd and lemma attributes respectively. The corpus can be queried online at http://corpus.nytud.hu/orwell. The corpus query tool made available through this website is suitable mostly for looking up lexical equivalents, as the search expression covers surface lexical forms only. However, for the purposes of the present investigation, we needed to have recourse not only to morphosyntactic tagging but also to syntactic information. To carry out the analysis, the CLaRK tool (Simov et al., 2004) was used. The CLaRK system is an XMLbased linguistic development environment that integrates a finite state grammar engine with XML technologies. Its current version has been equipped with the functionality to link and process parallel corpora, making it especially suited to our purposes. The screenshot in Figure 10.1 illustrates the coding of the data as displayed in the CLaRK system. Syntactic analysis The primary purpose of our study was to identify and relate NP structures in both languages. This task does not call for in-depth syntactic analysis of each sentence in the corpus, which would not have been feasible to carry out in an automated manner. Instead, a partial syntactic
Figure 10.1 The encoding of the data in the CLaRK tool
170
Incorporating Corpora
NP Modification Structures in Parallel Corpora
171
AdjP (,(|)+)+, coord Figure 10.2 A sample regular expression grammar for co-ordinated adjective phrases (AdjPs)
analysis yielding top-level NP structures at the sentence level would suffice. The technology of the CLaRK system provides cascaded regular grammars defined over the features of the morphosyntactic descriptions. The system is termed ‘cascaded’ because the rules are applied successively, and rules applied later may use the outcome of earlier rules. In other words, once a syntactic unit has been defined, rules applied later can refer to it as if it were a basic unit like a word. Figure 10.2 shows the regular expression of co-ordinated adjective phrases using the syntactic unit AdjP defined earlier in the sequence of grammatical rules. The comma acts as a delimiter for three units, an obligatory AdjP head in the rightmost position is preceded by one or several AdjP’s optionally followed by either one or several punctuation marks (Bc) or conjunctions (B‘‘Con#’’). The fragment used in Figure 10.2 owes its origin to an earlier attempt to automatically identify maximal extension NPs in Hungarian (Va´radi, 2003). It turns out, however, that far simpler methods may yield equally useful results. As the outcome of the NP recogniser is meant for human analysis, it did not prove necessary to deploy a fully automatic, robust and highly reliable NP grammar to collect relevant data from the corpus. A few simple tricks proved effective in reliably isolating the data with minimal loss of coverage. In particular, we relied on the fact that NPs have an N head (a simplification but a reasonable and well defined one) and, however complex a pre-modification structure in Hungarian a NP may be, it always connects to the N head either via an Adj or an Adjectival Participle. Therefore, the simple bigrams AN or MIN (MI: Hungarian abbreviation for Adjectival Participle) will capture most of the pre-modified NPs in Hungarian. To establish their left boundary would be extremely difficult for reasons inherent in the internal organisation of Hungarian NPs that will be discussed later. Similarly, for English, simple grammars covering AN, NVing, N ‘with’ sequences will cover the overwhelming majority of the NPs we seek to examine.
Comparison of Main Linguistic Units Table 10.1 displays the main summary figures for the two subcorpora. The first six rows of columns EN and HU contain frequency figures for the units in the leftmost column. The rightmost column shows relative
Incorporating Corpora
172
frequency figures for Hungarian in terms of the corresponding English units. Rows 7 11 display frequency of occurrence ratios for the units in the leftmost column, for example S/w 15.5% means there was one sentence for every 15.5 words i.e. the average sentence length in English was 15.5 words. Again, the HU/EN column records the relative difference between Hungarian and English in comparable percentage terms. The parallel corpus consists of 6669 translation units. The number of sentences is only slightly higher than the number of translation units in both languages, suggesting that the majority of the translation units present one-to-one correspondences between sentences. The difference in the number of words (Hungarian: 80,604; English: 104,286) accords well with the different design of the two languages. Hungarian has a rich morphology, and uses productive compounding and derived forms; hence, what is expressed in English through paraphrasing may be rendered in Hungarian through the use of one word. Note that this observation applies mostly to the lexical elements expressing grammatical relations. When we look at the figures relating to nouns and adjectives, we find a much closer similarity in frequency distributions between the two languages: the cross-linguistic ratios are much closer to 100%, meaning equal relative occurrence in the two languages. Of particular importance in Table 10.1 are the NP figures. Only NPs having some modification structures have been included. As we shall see, beyond AN sequences, the scope of phenomena belonging here is Table 10.1 Comparative statistics of main units EN W
HU
HU/EN
104286
80694
77.4%
S
6737
6768
100.5%
N
21100
19752
93.6%
A
7385
6743
91.3%
NP
5027
4923
97.9%
S/w
15.5%
11.9%
77.0%
N/w
20.2%
34.5%
121.0%
A/w
7.1%
8.4%
118.0%
NP/w
4.8%
6.1%
126.6%
35.0%
34.1%
97.5%
A/N
w: word; S: sentence; N: noun; A: adjective; NP: noun phrase
NP Modification Structures in Parallel Corpora
173
far from straightforward to establish. Particularly difficult are postmodification structures involving prepositional phrases in English, such as the people in the room. Nevertheless, even the relatively clear-cut cases included in the figure in Table 10.1 show a rate of similarity second only to the number of sentences between the two languages. In other words, while there was a 23% difference in the number of words used, 98% of the NPs in the English texts were rendered with an NP in Hungarian. This reinforces our initial expectation about the similarity between the two language versions. Lexical frequency lists Table 10.2 displays the frequency lists of the 30 most frequent adjectives in the two subcorpora. It would be linguistically naı¨ve to expect a rigorous correspondence between rankings and well established lexical equivalents in the list, that is for an item to be translated in the same way throughout the text and for its translation equivalent to feature in the same position in the frequency list of the other language. Such fixed correspondence is rare, the best example in the corpus being ‘Big’ in the name Big Brother, which is consistently translated as ‘Nagy’ in Hungarian. In the more typical case of one-to-many lexical correspondence, the list could serve as a basis for compiling a form of equivalence profile for the lexical items concerned. Items with a narrow and fairly consistent set of equivalents may be distinguished from those with a more diffuse profile, that is with a wide range of equivalents. In addition, it would be possible to establish a frequency ranking among the equivalents, information that would be most valuable for bilingual lexicographical purposes. Even if we do not engage in such a close-up view of the lexical equivalents shown in Table 10.2, the list has immediate diagnostic value. Simply looking at items showing great disparity across the languages can reveal interesting cases. The item at the top of the Hungarian list, for example - ege´sz - heads the list because it is homonymous with the adverb and is often incorrectly tagged as an adjective. The item heading the English list same highlights another common phenomenon. Its apparent lack of a suitable equivalent in the Hungarian list is due to the fact that the corresponding lexical element is a bound morpheme ugyan. This sort of analysis could be usefully followed up with a look at similar lists of AN bigram equivalents. For example, the fairly wide discrepancy in number between occurrences of the English word human (71 occurrences) and the Hungarian word emberi (41 occurrences), which is intuitively felt to be the single obvious equivalent, is immediately noticeable. Our intuition about the straightforward equivalence between the two words is supported by lexicographic evidence. A medium-sized Hungarian English dictionary of 72,000 carefully compiled lexical items
Incorporating Corpora
174
Table 10.2 Frequency list of English and Hungarian adjectives Rank
English
Freq.
Rank
Hungarian
Freq.
1
same
118
1
ege´sz
114
2
old
104
2
nagy
88
3
own
83
3
kis
84
4
Big
77
4
egyetlen
74
5
possible
73
5
valo´
73
6
human
71
6
valamilyen
66
7
little
68
7
bizonyos
64
8
whole
63
8
fehe´r
56
9
necessary
60
9
fekete
52
10
small
58
10
hirtelen
50
11
different
53
11
saja´t
48
12
first
53
12
u´j
42
13
white
52
13
emberi
41
14
impossible
50
14
o´ria´si
38
15
long
50
15
hosszu´
37
16
true
48
16
nehe´z
35
17
great
47
17
nagy
32
18
new
47
18
Belso˝
31
19
full
44
19
ke´pes
31
20
enormous
43
20
ne´lku¨li
31
21
good
41
21
teljes
30
22
alone
40
22
ku¨lo¨no¨s
28
23
right
40
23
lehetse´ges
28
24
large
39
24
re´gi
28
25
real
38
25
fiatal
27
26
young
35
26
haju´
27
27
dark
32
27
jo´
27 (Continued)
NP Modification Structures in Parallel Corpora
175
Table 10.2 (Continued) Rank
English
Freq.
Rank
Hungarian
Freq.
28
Newspeak
31
28
politikai
27
29
able
30
29
u´jabb
27
30
black
30
30
fontos
26
such as La´za´r and Varga (2000), lists only two senses for ‘human’, both corresponding to the Hungarian equivalent emberi. Assuming all the 41 cases of emberi are renderings of the English adjective human, the question arises as to how the other 30 uses of human are translated. This is where a simple resource like the list of humanN pattern set against the corresponding Hungarian emberiN list (shown in Table 10.3) may be very helpful. Just a quick look at the collocations entered into by human and emberi in Table 10.3 suggests that the difference must be due to the rendering of the phrase human being (34 occurrences), translated as emberi le´ny (its literal equivalent) in only a fraction (f 6) of cases. If we now look at the occurrences of human being, we find that most of the instances are rendered with the single word ember, that is ‘man’. This leads to the more fundamental issue of attempting to establish crosslinguistic lexical equivalence. The problem is, as the above example has shown, that a one-to-one correspondence between the number of words in corresponding units across languages cannot be relied on. In fact, as Table 10.4 reveals, we find cases where an independent word is rendered with a bound morpheme (Row 1), a single word is rendered with two words (Row 2), a two-morpheme unit corresponds to two independent words that are not equivalents of the morphemes (Row 3) and also cases where they are (Row 4). Row 5 exemplifies a two-to-two correspondence in terms of morphemes, reduced through the rules of Hungarian orthography to a two-to-one word match. Word-by-word alignment? The conclusion that emerges from this exercise in lexical alignment is that word-by-word lexical matching between languages cannot be expected to work, certainly not between two languages as different from each other genealogically as English and Hungarian. Also, as a result of a variety of cultural and historical factors, the lexical stock of a language is far too idiosyncratic. Extending the scope of lexical alignment to include multiword units does seem to increase coverage but is equally fraught with difficulties on two counts. Firstly, the stock of multiword lexical units in any language is less clearly defined than the inventory of single
Incorporating Corpora
176
Table 10.3 List of bigrams with ‘human’ and ‘emberi’ sorted alphabetically Bigram (English)
f
Bigram (Hungarian)
f
34
emberi agyban
1
Human brotherhood
1
emberi arcon
1
Human consciousness
2
emberi egyenlo˝se´g
5
Human drudgery
1
emberi e´let
3
Human equality
5
emberi hangok
1
Human face
1
emberi jogokro´l
1
Human hair
1
emberi ke´pzeletet
1
Human hand
1
emberi ke´z
1
Human heritage
1
emberi ko¨zremu˝ko¨de´s
1
Human history
1
emberi lelkeket
1
Human imagination
1
emberi le´ny
6
Human inequality
1
emberi le´pe´sek
1
Human intervention
1
emberi munka
1
Human labour
1
emberi szabadsa´g
1
Human liberty
1
emberi szellemet
2
Human life
4
emberi ta´rsadalmak
1
Human lives
1
emberi teremtme´ny
1
Human memories
2
emberi terme´szet
1
Human mind
2
emberi test
3
Human nature
2
emberi testve´rise´g
1
Human societies
1
emberi to¨megek
1
Human sound-track
1
emberi tudaton
1
Human spirit
1
Human being
words, which is, after all, the province of a well established lexicographic tradition for most European languages. Secondly, as the bigram analysis shows, two-word units are just as prone to show a lack of direct (i.e. twoto-two) correspondence as are single words. The examples in Table 10.4
NP Modification Structures in Parallel Corpora
177
Table 10.4 Patterns of lexical correspondence Fabulous
mese´be illo˝, ‘fitting for a tale’
Illegal
to¨rve´nybe u¨tko¨zo˝, ‘law breaking’
Eyeless
szem ne´lku¨li, ‘without eyes’
Chicken house
csirkeo´l, ‘chicken shack’
seem to suggest that a more relevant, although certainly more cumbersome, process would be to align them morpheme by morpheme.
Structural Equivalence General characteristics Hungarian and English show a large difference in the internal structure of NPs. While Hungarian uses left-branching in the internal patterning of NPs, English uses mostly right branching (see Figure 10.3). As a result, Hungarian pre-modification can be very long and complex, while English is far more limited in the length and depth of the premodification structure allowed in front of the NP head. Hungarian does not have prepositions; hence, attaching PPs to NPs as post-modifiers does not constitute an option. So-called adjectival participles (MI, ‘melle´kne´vi igenevek’) can make Hungarian NPs complex to a virtually unlimited extent. In this respect they are similar to the string of post-modifier PPs that NPs can have in English. Nevertheless, Hungarian NPs are even more open-ended and productive, owing to the fact that the adjectival participles often inherit the argument structure of the verbs from which they derive, although this is difficult to quantify. Moreover, the arguments can have their own pre-modifiers, including adjectival participles, and so the recursion may continue ad infinitum in a way that would be inconceivable in a single clause in English, even if implemented with post-modifying PPs. Basic modification patterns The partial syntactic analysis outlined above allows us to extract modification structures from the parallel corpus. As the texts are not aligned below the sentence level, the target-language phrases corresponding to the strings in the source can only be identified manually as hits. An illustrative sample of the most common patterns is presented in Table 10.5. A comprehensive treatment of all the structural patterns found in the corpus, similar to the treatment found in Salkoff (1999), is beyond the scope of the present paper. Table 10.5 presents a sample of our findings from the corpus, omitting the AN pattern discussed above. It can be seen that there is a whole
Incorporating Corpora
178
´´ ´ gyerekekkel teli mentocsonak
a lifeboat full of children
children-with full lifeboat
Figure 10.3 Branching patterns in Hungarian and English
Table 10.5 Basic modification patterns and their structural equivalents 1
2
3
4
5
6
N of N
A N
Freedom of assembly
gyu¨lekeze´si szabadsa´g, ‘gathering freedom’
A N ed N
A N u´/u˝ N
the sandy-haired woman
seszinu˝ haju´ la´ny, ‘no-colour haired girl’
N with A N
A N u/u˝ N
the girl with dark hair
a fekete haju´ la´ny, ‘the black haired girl’
N (Adv) Adv
A N
the house immediately opposite
a szemko¨zti ha´z, ‘the across house’
N (Adv) Adv
Adv AdjParticiple N
the house immediately opposite
a szemko¨zt a´llo´ ha´z, ‘the across standing house’
N PP
N AdjParticiple N
the voice from the telescreen
a teleke´pbo˝l a´rado´ hang, ‘the from telescreen flowing voice’
array of patterns displayed by modification structures when compared cross-linguistically. It is interesting to note that although all the Hungarian expressions include an adjective or an adjectival participle (in the present discussion grouped together with adjectives), all of them are derived adjectives. It seems, then, that the structural complexity of the English expressions is mirrored in Hungarian by morphological complexity. Also noticeable is the fact that these patterns are productive and the correspondence is paratactic in the sense that the content words are lexical equivalents (‘girl’-la´ny, ‘black’-fekete, ‘hair’-haj) although they are systematically rearranged to yield well formed phrases in the given language.
NP Modification Structures in Parallel Corpora
179
NPP patterns As noted above, one of the most frequent post-modification structures of English NPs is the NP head followed by a prepositional phrase; as Hungarian lacks prepositions, structural parallels are ruled out. Moreover, the symmetry shown in Figure 10.3 cannot always be implemented: mapping the Hungarian equivalent of the prepositional phrase immediately in front of the noun head yields an ungrammatical phrase in Figure 10.4:
Figure 10.4 Post vs. pre-modification structures between English and Hungarian
As noted by Klaudy (1997: 331332), in such cases Hungarian needs a ‘support verb’ to ‘glue’ the two constituent phrases together, and as main verbs do not occur in NPs, they appear in their adjectival form, that is as adjectival participles. A similar structure is also possible in English, for example, ‘the cat sitting on the mat’ rather than ‘the cat on the mat’, but whereas in English the insertion of the -ing verb form is optional in most cases, in Hungarian it is an expediency of the pre-modification structure. The choice of the support verb, however, is not a straightforward matter and seems to be subject to a variety of linguistic factors. Consider (a)(d) in Table 10.6. The semantic relation between the head and the postmodifier can be characterised as stative existential locative, hence the least marked rendering of this relationship in Hungarian would be realised by the copula van, the appropriate adjectival participle form of which is levo˝ (‘being’). In principle, then, all four examples could have been rendered through the Hungarian word levo˝. The fact that none of them has actually been translated in this way must have been a conscious choice by the translator in which relevant aspects of the context, such as the most likely position of the dark-haired girl in the particular situation, would have been considered. Where posture is not known or is irrelevant (as in d), the support verb is chosen so as to be appropriately general i.e. tarto´zkodo´ (‘staying’). Examples (e)(f) in Table 10.6 show how the choice of support verb is sensitive to more subtle inherent features of the situation. The feature in question is, informally, DYNAMIC, which is triggered by the preposition from. Again, it would have been adequate to render this feature with a general-purpose support verb like jo¨vo˝ (‘coming’) in the sense that a teleke´pbo˝l jo¨vo˝ hang (‘the voice coming from the telescreen’) and a csapbo´l jo¨vo˝ hideg vı´z (‘the cold water coming from the tap’) are both equally good
Incorporating Corpora
180
Table 10.6 N PP structures and their Hungarian equivalents (a)
the doubtful date on the page
papı´ron a´llo´ ke´tes da´tum, ‘on paper standing doubtful date’
(b)
the page in front of him
az elo˝tte fekvo˝ ko¨nyvoldal, ‘the in-front-of-him lying bookpage’
(c)
the dark-haired girl behind him
a mo¨go¨tte u¨lo˝ fekete haju´ la´ny, ‘the behind-him sitting black haired girl’
(d)
the people in the room
a teremben tarto´zkodo´ emberek, ‘the in-room staying people’
(e)
the voice from the telescreen
a teleke´pbo˝l a´rado´ hang, ‘the from-telescreen flowing voice’
(f)
the cold water from the tap
a csapbo´l folyo´ hideg vı´z, ‘the from-tap flowing cold water’
(g)
the neighbour on the same floor
az emeleten lako´ szomsze´d, ‘the on-floor living neighbour’
(h)
our boys on the Malabar front
a malabari fronton harcolo´ fiaink, ‘the malabar on-front fighting our boys’
translation equivalents of the original. Any further detail in the description would be a matter of explicitation decided by the translator. Example (g) illustrates how the selection of the right support verb is influenced by the inherent time relations, that is GENERIC versus SPECIFIC time. In order to decide on the appropriate verb, it must be considered whether the location of the neighbour is meant generically or specifically on the occasion described. Thus, as Hungarian distinguishes between a neighbour who lives on that floor and one who is there but does not live there, Az emeleten lako´ szomsze´d (‘the neighbour living on the same floor’) may have a different reference than az emeleten levo˝ szomsze´d ‘the neighbour on the same floor’. Note how in the latter case there is no support verb in English, which is generally the case where a stative existential locative relationship is concerned. The translation in example (h) goes perhaps the furthest in terms of explicitation. The verb levo˝ (‘being’) may have been adequate but semantically not the most relevant support verb to use. The case is similar to (g) except here the existential be would not have been misleading as fighting on the Malabar front presupposes being there, while being on the same floor certainly does not presuppose living there.
NP Modification Structures in Parallel Corpora
181
Table 10.7 Modification of event expressions (a)
a poster too large for indoor display
e´pu¨leten belu¨li elhelyeze´s ce´lja´ra tu´lsa´gosan is nagyme´retu˝ plaka´t, ‘in-building inside placement forpurpose overly large-sized poster’
(b)
dealing on the free market
szabadpiacon valo´ u¨zletele´s, ‘on-freemarket being dealing’
(c)
writing by hand
ke´zzel valo´ ´ıra´s, ‘with-hand being writing’
(d)
drilling with dummy rifles
ja´te´kpuska´kkal valo´ gyakorlatok ‘with-toygun being excercises’
(e)
raising the chocolate ration to twenty grams a week
a csokola´de´ fejadag heti hu´sz grammra valo´ emele´se´e´rt, ‘of the chocolate ration weekly twenty to-grams being raise’
(f)
(began) copying a passage into the diary
az egyik fejezetnek a naplo´ja´ba valo´ ma´sola´sa, ‘the one chapter’s the into-diary being copy’
(g)
to return to work
a munka´hoz valo´ visszate´re´s, ‘the to-work being return’
Modification of event expressions Event expressions constitute a special case in NP modification in that both Hungarian and English allow for an almost open-ended string of modifiers, Hungarian to the left, English to the right of the head of the NP. In English the head is expressed by a noun, like display in example (a) in Table 10.7, a gerund or participle (examples (b), (c), (d) and (f)) or an infinitive as in example (g). In Hungarian the corresponding heads are expressed with deverbal, i.e. derived nouns. In examples (b) (g) pre-modification structures are attached to the head through the semantically empty word valo´, the adjectival participle form derived from the copular van. It thus plays the same role as the form levo˝, that is, it turns the expression to its right into an adjectival phrase and connects it to the head of the NP. Derived adjectives in Hungarian pre-modification structures A systematic comparison of English and Hungarian pre-modification structures involving adjectives draws attention to an interesting set of phenomena in Hungarian. Examples (a) and (b) in Table 10.8 are offered as illustrations of a very productive pattern in Hungarian. It appears that in Hungarian it is simply not possible to attach certain adjectives directly to
Incorporating Corpora
182
Table 10.8 Pre-modification adjectives
structures
involving
Hungarian
derived
(a)
an oblong wire cage
te´gla alaku´ dro´tkalika, ‘rectangular shaped wire cage’
(b)
a dark-haired girl
egy fekete haju´ la´ny, ‘a black haired girl’
(c)
a quick-witted woman
jo´ felfoga´su´ no˝, ‘a good perception woman’
(d)
the leather-seated model
A bo˝r u¨le´ses model, ‘the leather seated model’
(e)
the man with the overcoat
A nagykaba´tos fe´rfi, ‘the bigcoated man’
(f)
a deep, slow, rhythmical chant
me´ly hangu´, lassu´ u¨temu˝, ritmikus e´nek, ‘deep voiced, slow paced, rhythmical chant’
the head. This category includes adjectives expressing shape, which are instead attached to the head through the use of the Hungarian equivalent of ‘shaped’, denoting the way in which the head has the required property. There are some apparent exceptions like ko¨rszı´nha´z (‘round theatre’), ko¨rtelefon (‘a round of phone calls’), etc., but in the overwhelming majority of cases geometrical shape is expressed through the mediation of the word alaku´ (‘shaped’) as in example (a). In expressions signalling some physical or mental property of human beings as shown in examples (b) and (c), such pre-modification structure seems to be extremely productive. The pattern can be generalised as X has y of property z. In Hungarian this is expressed with the help of the derivational suffix -u´ when it expresses some personal trait, or -os in cases when a more general possession or part whole relationship is involved. Example (f) serves to illustrate a case where the y ‘voice’ ‘pace’ themselves, denoting a property of X ‘chant’ is optional, definitely in the case of lassu´ (‘slow’), perhaps less so in the case of me´ly (‘deep’). On the other hand, the corresponding words can also be optionally expressed in English, as shown by the glosses. Note that none of the derived adjectives may occur alone without the adjectives, as shown by the ungrammaticality of the phrases *alaku´ dro´tkalika, *haju´ la´ny or for that matter *haired girl. This fact testifies to the special syntactic/semantic role of these derived adjectives in Hungarian. Explicitation in NP modification The question may arise to what extent the structures studied in the preceding sections may justifiably be viewed as instances of explicitation, the operation through which the translated text is rendered more specific and explicit than the original (see Blum-Kulka, 1986, Klaudy, 1998).
NP Modification Structures in Parallel Corpora
183
Explicitation often, although not necessarily always, involves the addition of text to the translated version; hence we should examine all those cases in which the Hungarian version displays additional words with no counterpart in the original. As noted in the literature (Klaudy, 1997) and amply borne out by our examples, most of the insertions are due to grammatical necessity, the inserted form being needed in order to provide a corresponding grammatical expression. In my opinion, there is little justification to regard such cases to be examples of explicitation. On the other hand, considering more closely the syntactic and semantic properties of elements which have been added, it becomes possible to make finer distinctions and, in fact, distinguish explicitation from syntactic expedience. Accordingly, we should consider separately instances where the additional element is compulsory and lacks an alternative, as in the case of pre-modification with event expressions as head (or see Table 10.7), and those cases in which the translator has exercised a choice that is unrelated to syntactic requirements. The Hungarian word valo´, for instance, is a purely grammatical device required to attach the premodifiers to the head; it carries no discernible semantic value of its own. The notion of explicitation in the sense of an optional addition cannot arise in such cases. More equivocal are the patterns where the inserted word we find in the translation does have alternatives. Table 10.6 contains the relevant data. As we noted there for each of the different types of expression (existential locative, dynamic etc.), one unmarked verb was found to be appropriate to the situation, for example, levo˝, jo¨vo˝, szo´lo´ (‘being’, ‘coming’ and ‘sounding’). As the use of a support verb is required in the examples listed, use of one of these three ‘default’ verbs would not have been regarded as a means of explicitation. However, the choice of any other verb that not only serves the syntactic purpose of gluing the pre-modifier and the head together but also adds semantic content to the expression e.g. sitting versus being, should indeed be considered an instance of explicitation. Accordingly, if (c) in Table 10.6 is rendered with the default stative support verb levo˝ ‘being’ (a mo¨go¨tte levo˝ la´ny, ‘the behind him being girl’), it should not be considered an instance of explicitation. The actual form, however, is seen as explicitation for the fact that it inserts the optional additional information that the girl was sitting. This is especially the case if it is not motivated by a stylistic imperative to avoid a repetition of the same item in the text. This ‘decision tree’ gives fairly clear guidance on how to distinguish genuine cases of explicitation from grammatical necessity.
184
Incorporating Corpora
Summary In this chapter, an attempt has been made to explore the use to which a parallel corpus can be put in order to explore differences in NP modification structures between English and Hungarian. After presenting the tools and the methodology for analysing the corpus, the data were examined at the level of lexical and syntactic equivalence. The investigation has been confined to NPs as they typically serve to express the main participant roles in an event or situation described in a sentence. Hence, it is reasonable to expect it to be possible for them to be rendered in either language. The working hypothesis adopted here was that translations should be expected to follow the original unless there are compelling linguistic, stylistic or other reasons for departure from the source text. Obviously, the largely idiosyncratic lexicalisation patterns and the more or less divergent syntactic structures of languages rule out any narrow interpretation of this hypothesis. Nevertheless, it has been useful to assume this linguistically naı¨ve view of translation, serving as an underlying assumption in evaluating comparative statistical frequency data. The highly convergent summary figures relating to the number of sentences and, more importantly, the number of modified NPs in the parallel corpus provide corroborative evidence that even in a genre as far from factual informative writing as Orwell’s novel, the structuring of content was done in a broadly consistent manner. To view the issue from a different perspective, we can conclude that the top-level NP seems to be a unit with which an overall, close correlation might be expected to be found in translation. There seems to be a scale of diminishing correlation as we go down the list of the structural units of the texts, paragraphs, sentences, top-level phrases, embedded phrases, words and morphemes. The higher the level, the closer the alignment between the corresponding units in the two languages. Our findings suggest that top-level NPs seem to be units that are comparable, if not in their internal structure (as they certainly are not between English and Hungarian) then at least in number. From a communicative perspective, their use seems to be a natural way of packaging the basic information units of sentences in terms of its main components, that is, who did what, to whom, where and so on. On the other hand, word alignment, although a far more popular pursuit in corpora studies, proved precarious. The very simple techniques of identifying unigram and bigram data highlighted patterns of difference in lexicalisation and syntactic structure. One conclusion that suggested itself from such crude statistical comparisons was that crosslinguistic analysis of the lexical stock is, in the case of Hungarian, more relevant at the morpheme level. Word-by-word matching of vocabulary
NP Modification Structures in Parallel Corpora
185
items can be misleading either because elements in an equivalence pair may be bound morphemes or because they constitute more than one word. The present analysis could not attempt to provide an exhaustive list of patterns of NP modifications in Hungarian and English but it is hoped that it has succeeded in covering the main types. The method used here could, in principle, allow a more precise specification of the notion main types in terms of frequency of the NP units in both languages. However, to carry out a precise and comprehensive automated matching of toplevel NPs between the two languages would have run into difficulties in establishing the boundaries of the maximal extension of N heads. Nevertheless, as was noted above, the semi-automatic procedure whereby the modified head was machine-spotted automatically and its extension established manually proved to be an effective approach, curtailed only by the amount of manual effort required. A closer look at the internal structure of NPs has quickly led to the finding of contrasting rules in the two languages, but more importantly, to the problem of rendering English N PP sequences in Hungarian. Our preliminary findings suggest a set of subtle linguistic features at work in the selection of the compulsory adjectival participle that is used to attach the modifying phrase to the left of the head in Hungarian. Closely related are the Hungarian pre-modification patterns involving event expressions, where the argument structure of the event as well as free adjuncts to the event structure are all packaged by means of adjectival participles into a single pre-modification structure. All these cases, together with some phrases involving derived adjectives, are rendered in Hungarian through the insertion of some additional element, which is commonly held to be a manifestation of the process of explicitation. In our view, however, explicitation does not necessarily require the insertion of additional material as we have solid linguistic criteria that can be applied to decide when an inserted element serves the purpose of making the translation more explicit. It is hoped that the present discussion has succeeded in pointing out the usefulness of applying data from parallel corpora to investigating patterns of textual equivalence as well as translation strategies. Obviously, it could do little more than merely open up some avenues for future research. Note 1. This is a revised version of a paper of the same title that appered in Krisztina ´ gota Foris (eds) New Trends in Translation Studies. In Honour of Ka´roly and A Kinga Klaudy (pp. 193 206). Budapest: Akade´mia Kiado´.
186
Incorporating Corpora
References Blum-Kulka, S. (1986) Shifts of cohesion and coherence in translation. In J. House and S. Blum-Kulka (eds) Interlingual and Intercultural Communication (pp. 17 35). Tu¨bingen: Gu¨nter Narr. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Perkevic, V. and Tufis, D. (1998) MULTEXT-East: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In [0]COLING-ACL ’98 (pp. 315 319). Montre´al, Que´bec, Canada. Klaudy, K. (1997) A fordı´ta´s elme´lete e´s gyakorlata. Budapest: Scholastica. Klaudy, K. (1998) Explicitation. In M. Baker (ed.) Routledge Encyclopedia of Translation Studies (pp. 80 84). London: Routledge. Melamed, D. (2001) Empirical Methods for Exploiting Parallel Texts. Cambridge, MA: MIT Press. Olohan, M. (2004) Introducing Corpora in Translation Studies. London/New York: Routledge. Salkoff, M. (1999) A FrenchEnglish Grammar. A Contrastive Grammar on Translational Principles. Amsterdam/Philadelphia: John Benjamins. Simov, K., Simov, A., Ganev, H., Ivanova, K. and Grigorov, I. (2004) The Clark system: XML-based corpora development system for rapid prototyping. In M.T. Lino, M.F. Xavier, F. Ferreira, R. Costa and R. Silva (eds) Proceedings of Language Resources and Evaluation Conference 2004 (pp. 235 238). Lisbon, Portugal. Va´radi, T. (2003) Shallow parsing of Hungarian business news. In D. Archer, P. Rayson, A. Wilson and A. McEnery (eds) Proceedings of Corpus Linguistics 2003, Lancaster University, England, pp. 845 851.
Chapter 11
A Study of the Mandative Subjunctive in French and its Translations in English: A Corpus-Based Contrastive Analysis NOE¨LLE SERPOLLET
Introduction One of the aims of this chapter is to study the interaction between corpus linguistics and Translation Studies and to explore the impact of the former on the latter more specifically. I will deal here with a contrastive study of one type of construction in French and its translations in English, using a computer-based work on a bilingual corpus. Corpus-based contrastive analyses of specific grammatical features of French and British English (BrE) are not numerous. While several contrastive linguistic studies using corpora have been developed between English and a number of other European languages, very little work using computerised corpora has been undertaken on French in relation to grammatical features. My contribution will focus on a pair of languages on which little contrastive analysis using computerised corpora has been carried out, and on an area of grammar that has not been investigated bilingually, as previous research has tended to focus on English. Citing Malmkjær (1999: 7), my ultimate aim is to ‘step towards developing a sound theoretical basis and a methodology for carrying out research on the borderline between descriptive linguistics and Translation Studies’. Therefore, I would like to show that a linguistic approach, using a corpus linguistics methodology within a contrastive framework can provide insights into the practice of translation. Thus, my objective here will be to undertake an analysis of particular grammatical features, namely mandative constructions in French and English in two genres Press and Learned Prose. Example (1) illustrates a French mandative subjunctive and its translation into English by a mandative subjunctive. Governing expressions that trigger the subjunctive or, in English, the respective translation, are marked in italics in all examples.
187
188
Incorporating Corpora
(1) (a) L’e´tat-major du SPD s’y refusait, en re´clamant que le proble`me de l’immigration soit aborde´ dans son inte´gralite´ et a` l’e´chelle europe´enne. (Le Monde, 1992) (b) The SPD executive refused and demanded that the question of immigration be considered in its overall European context. (The Guardian Weekly, 1992) Example (2) provides a mandative construction with should and a subjunctive as its French equivalent: (2) (a) In general, it is recommended that circuits operated at a particular modulation rate should not be routed over nominally lower rate VTF channels, whenever this can be avoided. (International Telecommunications Union) (b) En ge´ne´ral, il est recommande´ que les circuits exploite´s a` une rapidite´ de modulation de´termine´e ne soient pas achemine´s sur des voies de te´le´graphie harmonique d’une rapidite´ nominale infe´rieure, chaque fois que cela peut eˆtre e´vite´. (ITU) Through the study of a problematic area of grammar, here on the one hand the translation(s) of the French mandative subjunctive in English, and on the other hand the translation(s) of the occurrences of mandative constructions in French, I will examine how a contrastive analysis of two linguistic systems can make a significant and positive contribution to the practice of translation. I intend to show how an aligned parallel corpus can be used to study how mandative constructions are translated from one language to another and to investigate the consistency or lack of translation equivalence across languages.
Corpus Linguistics, Contrastive Analysis and Translation Studies Translation Studies is nowadays an expanding multidiscipline (or ‘field of study’, (Baker, 1993: 234)). Baker borrows the term from Eco (1976: 7), who distinguishes between a discipline, which has ‘its own method and a precise object’, and a field of study with ‘a repertoire of interests that is not as yet completely unified’ including both pure and applied translation studies. Corpus linguistics is another discipline that has been growing fast in recent years and is seen as a powerful methodology, a tool for research, rather than a subject in its own right. It can be used to examine and verify in a ‘real context’ the validity of theoretical linguistic hypotheses. My object here is not to deal in detail with Translation Studies, but to illustrate the contribution that a corpus linguistic analysis of a grammatical category can make to this field of study. My work is limited to the field of contrastive analysis and to the use of original texts and their
The Mandative Subjunctive in French and its Translations in English
189
Translation
PURE Translation Studies
Descriptive Translation Studies
APPLIED Translation Studies
Translation theory
Corpus Studies Corpus Linguistics describes phenomena of translating and translation(s)
establishes general principles in order to explain and predict these phenomena
teaching of translation
Figure 11.1 Corpus studies as a ‘link’ Adapted from Holmes (1988: 71)
translations. Figure 11.1 summarises my approach to translation in general. As reflected in the diagram, corpus linguistics is increasingly being employed in different aspects of Translation Studies, as it is able not only to link translation and linguistics presenting very interesting research opportunities but also to bridge the gap between different aspects of Translation Studies. Hence, Figure 11.1 is an attempt to illustrate the existing links between a discipline and a methodology. According to Leech (1992: 106): ‘corpus linguistics refers not to a domain of study, but rather to a methodological basis for pursuing linguistic research.’ Leech adds that in fact corpus linguistics combines with other branches of linguistics by means of corpora. Hence, there is nothing to prevent translators from ‘combining techniques of corpus linguistics with the subject-matter’ of Translation Studies (Leech, 1992: 106107). Furthermore, ‘[c]orpus Translation Studies is central to the way Translation Studies as a discipline will remain vital and move forward’, according to Tymoczko (1998: 652), further adding that ‘corpus Translation Studies change in a qualitative as well as quantitative way both the content and the methods of the discipline of Translation Studies, in a way that fits with the modes of the information age’. Thus, it has now become common practice to use bilingual parallel corpora or translation corpora as practical tools in order to throw light on particular translation problems and to train translators. These corpora are defined in Baker (1995: 230) as being composed of ‘original source
190
Incorporating Corpora
language texts in language A and their translated version in language B’. They can be employed . .
.
.
to study the translation equivalents used in parallel corpora; to offer an insight into the linguistic systems of two languages, or as Barlow (1995) puts it ‘the result of a search can be examined in an attempt to find out how the second language expresses the notion captured by the search term in the first language’; to examine and verify the validity of theoretical linguistic claims; and to train translators.
Further testimony to the usefulness of bilingual data and of the important role that corpora may play in descriptive cross-linguistic research can be found in Johansson & Hofland (1994: 36) and in Johansson (1998: 22) (see amongst others Baker, 1993; Johansson, 1998; Teubert, 1996): The importance of computer corpora in research on individual languages is now firmly established. If properly compiled and used bilingual and multilingual corpora will similarly enrich the comparative study of languages. More and more studies are being carried out in contrastive linguistics and Translation Studies and this has led to a growing recognition that cross-linguistic research must be based on naturally occurring language data derived from multilingual corpora. Therefore, according to Altenberg and Aijmer (2000: 1516), parallel corpora ‘have come to be recognised as indispensable resources for cross-linguistic research at all levels of linguistic description, for theoretical as well as practical purposes’. Translation units such as mandative constructions do not stand in a stable one-to-one correspondence with each other across languages (French/English). In this contribution, I intend to explore the degree of correspondence between French and English selected mandative constructions emerging in translations between the two languages, bearing in mind that correspondence is used not only to describe the relation between element X in one language and element Y in the other language, but also as a term for the elements that enter into this relation. The study of the mutual correspondence of translations between two languages may thus help us gain new insights into the languages compared: A study of mutual correspondences between categories and items in source texts and target texts does not only reveal language-specific properties of the categories compared, but provides fruitful insights into the larger systems of which these categories are part and how
The Mandative Subjunctive in French and its Translations in English
191
these systems interact with each other and with other related systems. (Altenberg, 1999: 266)1 Returning to the links between translation and corpus linguistics, I agree with Ulrych (1997: 425) that ‘[c]omputerized corpus studies are therefore not only having a considerable impact on both descriptive and applied Translation Studies but also have the potential of merging the two branches if appropriate corpora are used’. One important aspect of contrastive analysis is that ‘it can be used as a method, or discovery procedure in linguistic description over and above its usefulness for applied linguistics’ (Johansson, 1996: 127). In what follows a corpus linguistic method is used to analyse contrastive data and hence to describe mandative construction in both French and English.
Motivation and Objectives How it all started The data used in the contrastive analysis undertaken in the present study have been extracted from the parallel corpus International Sample of English Contrastive Texts Corpus (INTERSECT) (Salkie, 1995; 2000). Two subparts of the corpus have been analysed, consisting of English translation of French press articles (‘Press’) and French translation of English ‘learned articles’ (see later section on Data Investigated for more details). These parts of the corpus under investigation can be used as a test bed for Translation Studies, in that they ‘can be used to call up sets of words or grammatical features in one language for their examination, and/or for the call up of the foreign language equivalents in the parallel aligned segments’ (McEnery & Oakes, 1996: 212). According to Holmes (1988: 100), the notion of ‘equivalence’ between source and target texts is a goal that translators wish to reach whereas what they actually achieve is a ‘network of correspondences, or matchings with a varying closeness of fit’. As can be seen in Examples (3) and (4), a French subjunctive may correspond in English to a subjunctive form or to the ‘should infinitive’ construction. (3) (a) Puisque la bataille re´fe´rendaire, pour eˆtre gagne´e, implique qu’il soit critique´ aussi par les partisans du ‘oui’, il accepte, meˆme s’il trouve que certains de ses amis en font un peu trop. (Le Monde, 1992) (b) Since winning the referendum also requires that it be criticised by the supporters of the ‘yes’ vote as well, Delors accepts it, even if he finds that some of his allies are going a little too far in that direction. (The Guardian Weekly, 1992)
192
Incorporating Corpora
(4) (a) M. Pinto de Andrate croit ‘indispensable qu’un autre parti ou une alliance puisse recueillir quelque 30% des suffrages et jouer le roˆle d’une minorite´ de blocage’. (Le Monde, 1992) (b) He feels it is vital that a third party or alliance should be able to muster about 30 per cent of the votes and act as a blocking minority. (The Guardian Weekly, 1992) During the analysis of the data, it emerged that there was no one-to-one translation between the French mandative subjunctive and English mandative constructions. The different possible translations of the French mandative subjunctive obtained in the specialised part of the Press corpus include: a mandative subjunctive, the modal should in its mandative use, an indicative, a modal, an infinitive, a past participle and other constructions. The subject of equivalent constructions will be examined in more detail in the discussion on bilingual analysis.
Objectives The analysis of selected texts extracted from the French/English parallel INTERSECT corpus (Salkie, 1995) was carried out using ParaConc, a bilingual parallel text concordancer (Barlow, 1995). With this program, ‘the result of performing a search can be examined in an attempt to find out how the second language expresses the notion captured by the search term in the first language’ (Barlow, 1995: 14). The French section of the corpus was annotated with Cordial 6 Universite´s (a software package from Synapse De´veloppement which enabled me to tag the occurrences of the subjunctive in French). I will start by working from French into English, analysing the Press category of the INTERSECT corpus (Le Monde [1992 93] as translated in The Guardian Weekly), and reporting on the English equivalents of the French subjunctive (as represented as in Figure 11.2). I will then return to French, this time studying the Learned Prose category of INTERSECT, and describing the mandative constructions in English and their translations in French (as represented by in Figure 11.2). In both cases, I will examine the types of governing expressions triggering the constructions in order to ascertain what differences exist between source and target texts. Finally, these results (from the English section of INTERSECT) will be compared with the findings obtained through a comparative analysis of extracts, equivalent in size, date and categories, from the one-millionword corpora of BrE Lancaster-Oslo/Bergen (LOB) and Freiburg-LOB corpus (FLOB) represented as in Figure 11.3.
The Mandative Subjunctive in French and its Translations in English
193
Texts categories in the corpus: Pr:
Press
LP:
Learned Prose
Figure 11.2 Objectives: analysis in INTERSECT
Identifying Relevant Forms in the Corpus Mandative subjunctive in French I will not describe here the subjunctive mood in detail. However, it should be noted that many occurrences of subjunctives in French are similar to the indicative forms; it is easier to recognise irregular verbs like eˆtre/soit, avoir/ait, ayons, faire/fasse, etc. as exemplified in (5) (cf. subjunctive fasse; indicative fait). (5)
Dans la meˆme e´mission, un responsable communal social-de´mocrate du sud de la Sue`de, qui exige que l’accueil des re´fugie´s de l’exYougoslavie sur le territoire de sa commune fasse l’objet d’un re´fe´rendum, expliquait que ‘La Sue`de n’a pas les moyens’ de sa politique d’immigration. (Le Monde, 1992)
As far as what I call the ‘mandative subjunctive’ is concerned, the choice of this mood in embedded clauses introduced by que is conditioned or favoured by the proximity of a trigger verb (volition verbs such as demander, souhaiter) or by mandative contexts expressing futurity and volition (volitional element of the mandative context). Such a grammatical
Incorporating Corpora
194
LOB (1961)
FLOB (1991)
Texts categories in the corpus: Pr:
Press
LP:
Learned Prose
F:
Fiction
GP:
General Prose
Figure 11.3 Objectives: comparison of INTERSECT and LOB/FLOB
form is triggered by expressions equivalent to the English ones I will mention below.
Mandative constructions in English: a definition In this section, I will describe mandative constructions in English. I will also briefly mention the criteria used to identify the subjunctive. Etymologically, the term ‘mandative’ is derived from the verb ‘mandate’, itself coming from the Latin mandare, meaning ‘to enjoin’, ‘to command’. In a that-clause, mandative constructions follow ‘mandative expressions’, which may be verbs, nouns and adjectives that I also call ‘triggers’. These governing expressions triggering the mandative constructions express a demand, request, intention, proposal, suggestion, recommendation, etc. (see Quirk et al., 1985: 10121015). The analysis described below deals with constructions such as the ones in Examples (6), (7) and (8) (in bold), triggered by three possible types of governing expression (in italics) (sub-sections of the FLOB and
The Mandative Subjunctive in French and its Translations in English
195
LOB corpora are indicated by upper case letters; see ‘Data investigated’ this chapter). [. . .] nor to obtain an order that the child be accommodated by them. [. . .] (FLOB Learned Prose, H)
(6)
mandative subjunctive triggered by a noun [. . .] but it is also very important that they should be fair. (LOB Press, B)
(7)
mandative should triggered by an adjective [. . .] usually by recommending that politicians or administrators introduce incentive [. . .]. (FLOB Learned Prose, J)
(8)
non-distinctive form triggered by a verb, see explanation below. Different verb forms can follow the mandative expressions, as we have seen in Examples (6), (7) and (8). We may also find an indicative, as in Example (9). This last construction will not be considered further in this chapter. Conditions have dictated that operations were scaled down enabling overheads to be reduced [. . .].
(9)
In the case of the subjunctive, this is a verb form not easy to identify in English as it is identical to the base form of the verb. According to Asahara (1994: 2) ‘the present subjunctive refers to a grammatical form that takes only the base form of the verb regardless of tense contrast, person and number concord’. With a plural subject for example, there is no difference between indicative and subjunctive forms. In the following cases, however, the non-inflected or morphological subjunctive is distinguishable from the indicative through morphological criteria: .
. .
.
in the third person singular present tense (no final -s) (as in Example (10)); in past contexts (no sequence of tenses) (11); in finite forms of be (base form for all persons and no tense marker) (12) and (13); in negated clauses (no do-periphrasis and not is placed before the verb) (14).
(10) [. . .] he proposes to Isabella that she join his plan to frame Mariana [. . .]. (FLOB General Prose, G) (11) Russia insisted that the Western powers take immediate measures to put an end to the unlawful and provocative actions of the Federal German Republic in West Berlin. (LOB Press, A)
Incorporating Corpora
196
(12) Hence it is important that the process be carried out accurately. (FLOB Learned Prose, H) (13) Conditions have dictated that operations be scaled down enabling overheads to be reduced [. . .]. (LOB Press, A) (14) Moreover, it requires that the concepts F(x) and G(x) not themselves contain any quantification [. . .]. (FLOB Learned Prose, J) I included in the frequency counts, as occurrences of the mandative subjunctive forms, not only the distinctive/genuine subjunctive forms but also the non-distinctive forms, which are indistinguishable from the indicative. In such cases, we can perform a substitution test, for example (8), substituting a third person singular subject for ‘politicians and administrators’ results in a distinctive subjunctive form. Nonetheless, Example (8) is perfectly acceptable with the indicative. This illustrates the fact that we do have a non-distinctive form in (8) as both the subjunctive and the indicative would be acceptable.
Data Investigated The INTERSECT corpus contains about 1.5 million words in French as well as English and has been manually aligned at the sentence level (Salkie, 1995; 2000). I am concerned with the following text categories: .
.
Press: extracts from the newspaper Le Monde 199293, ca. 113,000 words and their translation in The Guardian Weekly, ca. 114,000 words; Learned Prose: EU Document (Esprit), International Labour Organisation (ILO), International Telecommunications Union (ITU Telecom), ca. 190,000 words in French and 178,000 words in BrE.
According to Salkie (2000:156): The INTERSECT corpus is modest in size. The texts are not annotated or SGML-tagged, their paragraph structure has been disrupted, and the quality of the translation varies. It does, however, have one important positive quality: it is easy to access and to use. Two English corpora are used for comparative purposes. The Lancaster-Oslo/Bergen Corpus (LOB) has been compiled, computerised and word-tagged by research teams at Lancaster, Oslo and Bergen. It consists of 500 BrE texts of about 2000 words each, printed in 1961 and divided into 15 different genre categories.2 It contains one million words. The Freiburg-LOB Corpus (FLOB) has been modelled on the LOB; it consists of one million words of BrE texts printed in 1991. An exhaustive comparison will be carried out of two text genres:
The Mandative Subjunctive in French and its Translations in English . .
197
Press (A, reportage; B, editorial; and C, reviews), ca. 176,000 words; Learned Prose (H, miscellaneous, mainly government documents; and J, learned & scientific writings), ca. 220,000 words.
Contrastive Analysis in INTERSECT: From French to English and Back Again As indicated, the ParaConc bilingual parallel text concordance program (Barlow, 1995) has been used to analyse the corpus texts. The drawback of this parallel concordancer is alignment: each sentence has to constitute a separate paragraph, as each unit is delimited by a paragraph return. But this did not present a problem as I was using INTERSECT, which is aligned at sentence level. The French texts were tagged with part-of-speech tags with Cordial 6 Universite´s (C6U). This high-performance software package is a grammar and spelling corrector and also a French language tagger/lemmatiser. It enabled me to tag the occurrences of the French subjunctive. But once the French texts were tagged they had to be realigned at sentence level following extensive editing as I could not work on the texts as they were (one word and one tag per line). I needed to reformat the texts and check that the alignment was still correct and had been left intact by the tagging process.
From French to English Bilingual analysis of the Press All the occurrences of the French subjunctive in the Le Monde texts were retrieved and then manually edited to identify the relevant instances of mandative subjunctive. The translated constructions in English, equivalent to a mandative subjunctive were then also retrieved. This last part consisted in viewing in one window of ParaConc the concordance lines containing the subjunctive in French, clicking on one of the lines to highlight it and then examining in the second (translation) window the corresponding concordance line containing the translation in English. The analysis of the French subjunctive in its mandative use showed that there was no one-to-one translation between the French and English mandative subjunctive. The different possible translations that I obtained in the specialised Press corpus are listed below and in Table 11.1: . . . . . . .
Mandative subjunctive (Example (15)) Should (Example (16)) Infinitive (Example (17)) Modal [other than should] (Example (18)) Other construction (Example (19)) Indicative (Example (20)) Past participle (Example (21))
198
Incorporating Corpora
(15) (a) Et si M. Cheney, secre´taire ame´ricain a` la de´fense a fini par lui donner sa be´ne´diction, son ambassadeur a` l’OTAN n’en avait pas moins exige´ que la France regagne au pre´alable les structures de l’alliance. (Le Monde, 1992) (b) And though US Defence Secretary Dick Cheney gave it his blessing the US ambassador to NATO demanded that France first join NATO’s integrated military command. (The Guardian Weekly, 1992) (16) (a) ce qui importe c’est que la Pologne soit catholique. (Le Monde, 1993) (b) The important thing is that Poland should be Catholic. (The Guardian Weekly, 1993) (17) (a) En ce qui concerne la Transnistrie, il est important que la Moldavie renonce a` ses vues irre´alistes. (Le Monde, 1992) (b) As far as the Transdnestr is concerned, it is important for Moldova to give up its unrealistic attitude. [. . .].(The Guardian Weekly, 1993) (18) (a) [ . . .] et l’on souhaite que le prochain gouvernement de Bangkok mette de l’ordre parmi ses commandants re´gionaux [ . . .]. (Le Monde, 1992) (b) [ . . .] and the expectation is that the next government in Bangkok will do something about its regional military commanders [ . . .]. (The Guardian Weekly, 1992) (19) (a) Eu e´gard a` cette opposition du clerge´, le pre´sident Laval demande que si possible on ne lui signifie pas de nouvelles exigences sur la question juive. (Le Monde, 1992) (b) in view of unprecedented opposition from the church, [. . .] he would prefer, if possible, not to be subjected to further demands regarding the number of Jews to be deported. (The Guardian Weekly, 1992) (20) (a) Mais je suis favorable a` une structure e´largie, quel que soit le re´sultat des scrutins, car nous sortons d’une situation conflictuelle grave, et il faut maintenant que chaque Angolais, au lendemain des e´lections, ait le sentiment d’avoir gagne´. (Le Monde, 1992) (b) But I’m in favour of a more broadly-based structure because we have just emerged from a serious civil conflict, and it is vital that every Angolan, after the vote, has the feeling of having won. (The Guardian Weekly, 1992) (21) (a) [il] ne veut pas que soient mis en cause ses propres avantages sociaux [. . .]. (Le Monde, 1992) (b) [he] does not want its own social advantages touched [. . .]. (The Guardian Weekly, 1992)
The Mandative Subjunctive in French and its Translations in English
199
Table 11.1 Translation equivalents of the French mandative subjunctive in the Press subcorpus (Le Monde & The Guardian Weekly 1992, 1993) French data Subjunctive (total) 247
English data
Mandative subjunctive
Mandative subjunctive
44 (100%)
Other
6 (13.6%)
38 (86.4%)
Verbs
35
Should
2 (4.5%)
Nouns
6
Infinitive
15 (34.1%)
Adjectives
3
Modal auxiliary
10 (22.8%)
Other construction
4 (9.1%)
Indicative
5 (11.4%)
Past participle
2 (4.5%)
Table 11.1 shows that only 17.8% of the total number of French subjunctives (that is, 44/247) are in fact mandative subjunctives. In translation 13.6% of these are rendered by English mandative subjunctives, while 86.4% are rendered by other constructions. 4.5% of these translations are mandative should (that is, 2/44), 34.1% are infinitives, 22.8% are other modals, 9.1% are different constructions,3 11.4% are indicatives and 4.5% are past participles. In other words, we have translation mismatches, that is to say a low degree of formal correspondence between translations and originals, as a mandative subjunctive is translated by a mandative construction in only 18.2% of the cases including ‘should’ as well as the mandative subjunctive. Regarding the different governing expressions triggering the mandative subjunctive in French, the findings show that falloir (il faut/fallait que) is the most frequent trigger (12 occurrences), followed by vouloir (six occurrences), souhaiter (five occurrences), exiger (four occurrences) and re´clamer (two occurrences). The remaining verbs (demander, impliquer, importer, pre´coniser, proposer, supplier) occur only once. And apart from a` condition, which occurs twice, all the other nouns (grand temps, objectif, souhait, voeu) and adjectives (important, indispensable, ne´cessaire) also only occur once.
200
Incorporating Corpora
From English to French Bilingual analysis of the Learned Prose category I then worked in the other direction: from English to French, with English files that were not part-of-speech tagged. Therefore, within these technical limitations, the only possibility to retrieve the translations of mandative should, ‘genuine’ mandative subjunctive and non-distinctive forms, was to search for trigger verbs, nouns and adjectives in INTERSECT using ParaConc, then to manually edit relevant instances of mandative constructions. I was thus able to retrieve the translated French constructions equivalent to a mandative construction in English, using the technique described above. Table 11.2 gives a list of the different possible translations (in French) of these mandative constructions in the specialised Learned Prose subcorpus. Here, only 14.9% of the total number of occurrences of should are mandative should (that is, 124/832). In 71% of these cases, the translation is the French mandative subjunctive, and in only 29% other constructions. 16.1% of these translations are indicatives, 5.7% are infinitives, 3.2% are conditionals and nominalisations account for 1.6%. Translation by a totally different construction (zero correspondence) accounts for 2.4%. Regarding the translation of the English mandative subjunctive, roughly half correspond to the French mandative subjunctive and other constructions, with a majority of these accounted for by infinitives: 27.9% (that is, 12/43) are nominalisations and indicatives account for the remaining 23.3% (that is 10/43). In three of the four occurrences reported, the non-distinctive form of the subjunctive is translated by a subjunctive. The other translation found is a present participle. Mandative subjunctive forms are translated by French subjunctives in 51% of the cases and mandative constructions in English are translated in 65.9% of the cases by mandative constructions in French. To sum up, it seems that for the translation of mandative should, a French mandative subjunctive is the favoured option, whereas the translation of a mandative subjunctive form is equally divided between the subjunctive mood and other constructions. Examples (22)(25) provide illustrative data relating to Table 11.2. Examples (22) and (23) show two equivalents of the subjunctive form (a nominalisation in (22) and a subjunctive in (23)), whereas examples (24) and (25) show equivalents of mandative should (a subjunctive in (24) and an infinitive in (25)): (22) (a) In addition, the Committee must note that, according to the information before it, the accusations brought against Mr. Achour, which led to the first judgement against him, related
4 0 0
Verbs
Nouns
Adjectives
3
4
9
Adjectives Subjunctive
2
Nouns
Non-distinctive form
32
Verbs
21 ( 48.8%)
43 ( 100%)
18
Adjectives
Mandative subjunctive
8
Nouns
Subjunctive
98
88 ( 71.0%)
124 ( 100%)
832 Verbs
Mandative subjunctive
Mandative should
SHOULD (total)
English data: original Other
2 (1.6%)
Nominalisation
1
Other
Present participle
Indicative
Nominalisation
1
2 (4.7%)
8 (18.6%)
22 ( 51.2%) ---------------------------------------------------------------------Infinitive 12 (27.9%)
3 (2.4%)
Different construction
Others
4 (3.2%)
7 (5.7%)
Conditional
Infinitive
36 ( 29%) ---------------------------------------------------------------------Indicative 20 (16.1%)
French data: transfers
Table 11.2 Translation equivalents of the mandative constructions in Learned Prose (Esprit, ILO, ITU in INTERSECT)
The Mandative Subjunctive in French and its Translations in English 201
Incorporating Corpora
202
(b)
(23) (a)
(b)
(24) (a)
(b)
(25) (a)
(b)
to a former period, and that the pamphlets in the possession of Mr. Ben Slimane only, according to the complainants, demanded that inquiries be carried out into the acts of violence in the university. (International Labour Organisation) Le comite´ ne peut e´galement manquer d’observer que, selon des informations en sa possession, les faits reproche´s a` M. Achour dans sa premie`re condamnation remontaient a` une pe´riode ancienne et que les tracts en la possession de M. Ben Slimane ne faisaient, selon les plaignants, qu’exiger des enqueˆtes sur les violences exerce´es dans l’enceinte universitaire. (ILO) The CNT also asks that a formal request be addressed to the Spanish Government to make available the inventory and valuation of the historical heritage confiscated from the CNT during the Spanish Civil War, as well as the inventory work already carried out on the accumulated trade union heritage. (International Labour Organisation) La CNT insiste e´galement pour qu’il soit demande´ officiellement au gouvernement de communiquer l’inventaire et le re´sultat de l’estimation du patrimoine historique, dont la CNT a e´te´ spolie´e a` l’occasion de la guerre civile espagnole, ainsi que les re´sultats de´ja` disponibles de l’inventaire du patrimoine syndical accumule´. (ILO) [ . . .] that it would also appear desirable that the signals transmitted by an international terminal exchange should not be affected by a higher degree of distortion than those recommended in Recommendations R.57 and R.58 [ . . .]. (International Telecommunications Union) [ . . .] qu’il paraıˆt e´galement de´sirable que les signaux transmis d’un bureau teˆte de ligne internationale ne soient pas affecte´s d’un degre´ de distorsion supe´rieur aux valeurs limites des Recommandations R.57 et R.58 [. . .]. (ITU) It is desirable that when connection is established to the requested service the service-connected signal should be returned as quickly as possible. (International Telecommunications Union) Lorsque la connexion avec le service demande´ est re´alise´e, il est souhaitable de retourner aussi rapidement que possible le signal de connexion au service. (ITU)
As for the type of governing expressions triggering mandative constructions in the chosen genre, several cases were found when the same verb triggered several modals or subjunctives and hence the verb appeared only once and was not repeated in the text. Declare the view is the verb phrase that most frequently triggers a construction with should, with a total of 40 occurrences actually appearing in the text and a total of
The Mandative Subjunctive in French and its Translations in English
203
73 instances when the cases were added of the verbal ellipsis. Then we find recommend with 14/17 instances, request with 2/3 occurrences and agree, decide, intend, require, suggest with one occurrence each. Regarding the subjunctive, the most frequent trigger is request with 12 occurrences, followed by ensure with a total of five instances, urge with three occurrences, declare the view with 2/3 occurrences, ask, demand, propose, recommend, require, suggest, each with two occurrences and order with one. For the nouns and adjectives, the numbers of occurrences triggered are not so high as with the verbs: principle triggers eight constructions with the modal, desirable has eight instances, essential and necessary have three occurrences each, important has two, and advisable and preferable have one. As for a construction with the subjunctive, desirable and preferable trigger two occurrences, essential, imperative, important, mandatory and necessary have one occurrence. To summarise the findings of this section, the bilingual analysis has shown that in the Press texts, the French mandative subjunctive was mainly translated by a construction not containing should or the subjunctive: the translation equivalents were the subjunctive in only 13.6% of the cases, the other 86.4% were other constructions, of which the modal should only represented 4.5%. By contrast, in the Learned Prose texts, mandative should was in the majority of cases translated by a subjunctive mood (in 71% of the cases). The translations of the mandative subjunctive were divided almost equally between mandative subjunctives and other constructions such as infinitives, nominalisations and indicatives. Finally, the non-distinctive English forms were mainly translated by a French subjunctive. These results confirm Salkie’s observation (1996, 1997) that comparable linguistic categories rarely, if ever, show 100% correspondence in translations. Hence, it appears here that the ‘fit’ between the actual translation equivalence and the presumed prototypical equivalents varies with languages and is not always what may be expected (Va´radi & Kiss, 2001: 173). Similarly, according to Teubert (1996: 241), ‘total bidirectional correspondences are extremely rare phenomena’, and this is certainly confirmed by the comparison between mandative subjunctives in French and British mandative constructions.
Comparison of INTERSECT with Two Corpora of BrE: LOB and FLOB English Mandative constructions in LOB and FLOB Any conclusions about mandative constructions in BrE based only on translations are limited as it is unclear how far we can assume that translated texts are representative of ordinary language use, as they may differ from original texts because of the influence of the source language
204
Incorporating Corpora
(Gellerstam, 1996; Johansson, 1998). Therefore, when using translation or parallel corpora in contrastive studies, it is important to be aware of translation effects and to verify the results obtained on the basis of comparable corpora. The results obtained in LOB and FLOB will be used to check the accuracy of my findings from INTERSECT and the trends identified. The aim is to use these results as a reference point: ‘Observations based on a translation corpus need to be checked against a control corpus consisting of comparable original [Press category from FLOB] and translated texts in the same language’ (Johansson, 1998: 6, my emphasis). The original texts are taken from the Press category in FLOB, acting as the control corpus for the translated texts from the Press category of the INTERSECT corpus (see earlier section ‘From French to English Bilingual analysis of the Press’). A further comparison will be made between the original English texts in FLOB and in INTERSECT (as source texts), with respect to the Learned Prose category. The original texts in LOB (from the 1960s) will be examined in order to obtain findings that will be diachronically compared to the results from FLOB and INTERSECT (containing texts from the 1990s). With respect to the analysis of the two corpora of BrE LOB and FLOB, I have applied both a grammatical approach to corpus data, and a corpus linguistics methodology, using computer tools and retrieval software. The analysis involved developing complex queries to retrieve only the relevant mandative instances of both the modal and the subjunctive, neither of which is part-of-speech tagged. I used Xkwic (Christ, 1994), which is a part of the ISM Corpus Workbench, and a motif-based user interface to the Corpus Query Processor (CQP), in two totally comparable, grammatically tagged and computerised corpora of BrE. Therefore, I could run exactly the same retrieval queries on both corpora, providing fully comparable findings (see Serpollet (2001) for more detail about the methodology employed). Comparison of translated texts (INTERSECT) with original English texts (LOB/FLOB) The Press category is composed of originals (from the 1990s: FLOB and from the 1960s: LOB) and target texts in BrE (INTERSECT 1990s). A comparable corpora as is defined here in Baker’s sense (1995: 234): two separate collections of texts in the same language A (BrE), one containing original texts in that language and the second containing translations from a source language B (French) into the language A. The collections each have the same length and cover the same genre(s). Hence, potential differences in the results could be due to the translation process.4 The Learned Prose category contains originals (of the same period as the source corpus: FLOB and of earlier years: LOB) and source texts in
The Mandative Subjunctive in French and its Translations in English
205
BrE (INTERSECT). Here, any difference would be due to the texts themselves as no translation process is involved. This analysis will enable me to describe the evolution of mandative constructions from the 1960s (LOB) to the 1990s (FLOB) in two genres and to see if two corpora of modern BrE (FLOB and INTERSECT) show the same trend regarding this particular grammatical feature. I ran exactly the same retrieval queries on both BrE corpora (LOB and FLOB) and I applied (a posteriori) the same search criteria on the English texts of the INTERSECT corpus, both source and target. Press
The Press category is composed of comparable English texts, that is originals from FLOB (and LOB) and translations from INTERSECT. As indicated, the possible differences in the results could be due to the translation process. To be able to compare the mandative constructions in FLOB Press with the ones in INTERSECT Press, I had to carry out a search on the relevant triggers in INTERSECT (because these texts were not grammatically tagged) to retrieve both mandative subjunctive and should (Table 11.3). We note that the percentage of mandative should is greater in the original texts (LOB and FLOB) than in the translated texts, with respectively 79.5% and 70.4% compared to 44.4%. There are on the other hand more subjunctive forms in the translated texts, with 55.6%. Thus it would seem that in FLOB and in LOB (to a greater extent) the modal construction is clearly favoured whereas the mandative subjunctive is preferred in the translated texts of INTERSECT. Could the higher frequency of the subjunctive in the translations indicate some influence of French on the translations, that is translationese? It is not clear: out of 10 translated subjunctive forms in English, only five were subjunctives in the French original texts. And if we examine the eight original constructions translated by occurrences of should, we can see that in the original texts we had a construction of the type verb or noun infinitive (ordonner de/pre´conise de/l’ide´e de) in four cases, two nominalisations and two subjunctives. We must also be cautious with any conclusions drawn, as the number of occurrences is very small. But we can tentatively note that the preference of one construction over another depends on the type of data: originals or translations, as can be seen in Figure 11.4. In order to be able to compare more accurately the numbers of mandative constructions, I have normalised the frequencies per 100,000 words in Table 11.4 and Figure 11.5. Figure 11.5 gives a better idea of the numbers of mandative constructions relative to the respective size of each corpus. The percentage occurrences of should is greater in FLOB than in INTERSECT, but the normalised frequencies show some
n 35 4 5 9 44
Mandative constructions
Should
Morphologically identifiable subjunctive
Non-distinctive forms
Subjunctive forms (total)
Total
LOB
100
20.5
11.4
9.1
79.5
%
27
8
3
5
19
n
FLOB
100
29.6
11.1
18.5
70.4
%
18
10
3
7
8
n
100
55.6
16.7
38.9
44.4
%
INTERSECT (translations)
Table 11.3 Mandative constructions in the Press category (LOB, FLOB and INTERSECT [comparable corpora])
206
Incorporating Corpora
The Mandative Subjunctive in French and its Translations in English 100 80
11.4 9.1
11.1
16.7
18.5 38.9
60 40
207
79.5
70.4 44.4
20 0 LOB Should
FLOB Subjunctive
INTER Non-distinctive
Figure 11.4 Proportion of the types of English mandative constructions in the Press category in LOB, FLOB and INTERSECT
similarity, with a relative frequency of 10.8 and 7 respectively. There are more instances of should in the original texts, but more subjunctive forms in the translated texts (almost twice as many in INTERSECT as in FLOB, with a relative frequency of 8.8 and 4.5 respectively). However, the total number of mandative constructions is almost identical in these two 1990s corpora, showing a decline compared to the frequency in LOB. Other factors also need to be taken into consideration. For example, we need to be aware of the fact that the style of a particular translator could influence choices in the target texts. We do not know how many translators were involved in the translation of the texts from Le Monde. Moreover, the formal style of Le Monde could produce some general effect of translationese. According to Aarts (1998: ix) ‘[f]ull comparability can only be achieved in translation corpora’. However, he adds that this does not imply that such a corpus with perfect matches (attempted by the translators) between the source and target languages will be the perfect research tool for linguists who wish to compare languages. ‘An intrusive factor in such corpora is the translation activity itself, which may affect the texts of the target language’ (Aarts, 1998: ix). This may be due to the translation style of a particular translator or to translationese in general. One further piece of research to carry out would be to check, using comparable parts of the Guardian that are originals, if the same features and the same number of mandative constructions are found. Learned Prose
In this section, INTERSECT Learned Prose (BrE originals) is compared with FLOB Learned Prose (BrE original texts) to check any differences due to the nature of the data themselves and not to the translation process. The results presented in Tables 11.5 and 11.6 differ from those in Table 11.2 because, for the purpose of comparison, some examples have
35 4 5 9 44
Morphologically identifiable subjunctive
Non-distinctive forms
Subjunctive forms (total)
Total
n (176,000 words)
Should
Mandative constructions
25.0
5.1
2.8
2.3
19.9
Normalised per 100,000 words
LOB
27
8
3
5
19
n (176,000)
15.3
4.5
1.7
2.8
10.8
Normalised
FLOB
18
10
3
7
8
n (114,000)
15.8
8.8
2.6
6.2
7.0
Normalised
INTERSECT (translations)
Table 11.4 English mandative constructions in the Press category (LOB, FLOB and INTERSECT [normalised frequencies per 100,000 words])
208
Incorporating Corpora
The Mandative Subjunctive in French and its Translations in English 30
209
2,8
20
2,3
10
19,9
1,7 2,8 10,8
2,6 6,2 7
0 LOB Should
FLOB Subjunctive
INTER Non-distinctive
Figure 11.5 Number of each type of mandative construction in the Press category in LOB, FLOB and INTERSECT
been suppressed. Table 11.2 provided an exhaustive set of results for the bilingual analysis of the Learned Prose category, both automatic and manual searches having been carried out that made it possible to retrieve all the occurrences of mandative constructions (124 occurrences of mandative should, 43 occurrences of mandative subjunctive and 4 nondistinctive forms). However, for comparison purposes between INTERSECT and LOB & FLOB, exactly the same limited search criteria (intervals and way of counting) in the three sets of data will be used in the automatic complex queries of the following type performed by the concordance programme Xkwic. Therefore, in this section, tables with restricted results will be provided, as examples falling outside the search criteria have been removed (there was only one example discarded in the Press genre, but 56 in Learned Prose).5 Thus, we will obtain only 68 occurrences of mandative should, 40 occurrences of mandative subjunctive and 4 nondistinctive forms (see Table 11.5). The ‘Learned Prose’ texts consist of government documents and scientific writing, which is a very particular and specialised genre. It is notable that the percentage of mandative subjunctive increases from 12.2% in LOB to 40.4% in FLOB, but that nonetheless, FLOB still has a preference for the modal construction (59.6%). We can also note that the two 1990s corpora show a similar distribution of the types of mandative constructions, and although both corpora contain around 40% of subjunctive forms, the proportion of morphologically identifiable subjunctives is higher in INTERSECT (35.7%) than in FLOB (27.6%). This proportion is much higher in the two 1990s corpora than in LOB. The overall preference for mandative should is probably the result of the nature of the Learned Prose category in the bilingual corpus: half of
n 43 1 5 6 49
Mandative constructions
Should
Morphologically identifiable subjunctive
Non-distinctive forms
Subjunctive forms (total)
Total
LOB
100
12.2
10.2
2.0
87.8
%
47
19
6
13
28
n
FLOB
100
40.4
12.8
27.6
59.6
%
Table 11.5 Mandative constructions in English Learned Prose (LOB, FLOB and INTERSECT)
112
44
4
40
68
n
100
39.3
3.6
35.7
60.7
%
INTERSECT (source texts)
210
Incorporating Corpora
The Mandative Subjunctive in French and its Translations in English 100 80
10.2
12.8
211
2.5 26.2
2 27.6
60 40
87.8 59.6
71.3
LOB
FLOB
INTER
Should
Subjunctive
20 0 Non-distinctive
Figure 11.6 Proportion of the types of mandative constructions in the Learned Prose category in LOB, FLOB and INTERSECT
this section is composed of CRATER, a corpus derived from International Telecommunications Union documents and consisting of a highly specialised, very technical and legal type of language. So, for instance, if we take into account the respective size of the Learned Prose categories and normalise the number of occurrences of mandative should per million words in each of the samples, we obtain 195.5 mandative uses of the modal per million words in LOB, 127.3 in FLOB and 382 in INTERSECT, using the same limited search criteria in the three corpora (there were 640.4 mandative should per million words in INTERSECT retrieved when using an exhaustive search). Example (26) illustrates the specialised language used: (26) (a) In general, it is recommended that circuits operated at a particular modulation rate should not be routed over nominally lower rate VTF channels, whenever this can be avoided. (International Telecommunications Union) (b) En ge´ne´ral, il est recommande´ que les circuits exploite´s a` une rapidite´ de modulation de´termine´e ne soient pas achemine´s sur des voies de te´le´graphie harmonique d’une rapidite´ nominale infe´rieure, chaque fois que cela peut eˆtre e´vite´. (ITU) Table 11.6 shows the frequencies in Table 11.5 per 100,000 words normalised. Table 11.6 and Figure 11.7 highlight the high number of occurrences of mandative should in INTERSECT, three times the number of modals in FLOB. Regarding the subjunctives, these occur three times more frequently in FLOB (8.6 per hundred) as opposed to LOB (2.7 per hundred), then increasingly again by almost threefold from FLOB (8.6 per hundred) to INTERSECT (24.7 per hundred). Although we had noticed a similarity in
43 1 5 6 49
Morphologically identifiable subjunctive
Non-distinctive forms
Subjunctive forms (total)
Total
n (220,000 words)
Should
Mandative constructions
22.2
2.7
2.3
0.4
19.5
Normalised per 100,000 words
LOB
47
19
6
13
28
n (220,000)
21.3
8.6
2.7
5.9
12.7
Normalised
FLOB
112
44
4
40
68
n (178,000)
62.9
24.7
2.2
22.5
38.2
Normalised
INTERSECT (source texts)
Table 11.6 Mandative constructions in English Learned Prose (LOB, FLOB and INTERSECT [normalised frequencies per 100,000 words])
212
Incorporating Corpora
The Mandative Subjunctive in French and its Translations in English 100
213
2.2
80
23.6
60 40 20
2.7 0.4 19.5
0
2.7 5.9 12.7
LOB
FLOB
Should
Subjunctive
64
INTER Non-distinctive
Figure 11.7 Distribution of each type of mandative construction in the English Learned Prose category (LOB, FLOB and INTERSECT)
the distribution of mandative constructions in the two 1990s corpora (roughly 60% should and 40% subjunctives), we note here a difference in the number of occurrences of each type of construction in the different corpora. As the corpus materials are all original texts, the difference in frequency must be due to the different characteristics of the data themselves. Throughout the analysis above, I have shown that comparing mandative constructions in texts in two languages may provide new insights into translation and cross-linguistic relations. It can also ‘provide a new perspective on the languages concerned’ (Johansson, 1998: 15). Translation presupposes an interpretation of the source text, and the task of the translator is to find an adequate expression in the target language. The translation may be viewed as a mirror reflecting the source text. Studying the mirror image may lead to new discoveries. Things which were not noticed before, or which were not seen clearly, may be revealed (Johansson, 1998: 15). Perhaps, to extend the metaphor, we may want to acknowledge that the mirror of translation might contain some distorting factors resulting from the size of the corpus and the lack of precise comparability of texts. These factors require us to be cautious and observant in interpreting the quantitative results of the analyses.
Conclusion I hope to have shown that, in spite of the limitations of this exploratory analysis, a corpus linguistics methodology is useful, and dare I say, necessary to contrastive analysis. I also think that this chapter has presented an interesting approach to grammar, using corpus data, and has provided translators with a grammatical approach to the English
Incorporating Corpora
214
and French languages. I would add that, not only can Translation Studies and Contrastive Analysis benefit from corpus linguistics, but linguistics itself may also benefit from the field of Translation Studies. Possible applications of this corpus-based contrastive analysis include: .
.
to provide translators with a set of translation units or equivalent texts units in different languages and hence with a new awareness of recurrent problematic translations of the mandative constructions within two particular genres; and as a result enable translators to improve their final product, that is the target text being provided with a clearer notion (based on this corpus study) of how a particular grammatical concept may be translated in a specific target language.
Acknowledgements The research reported here was financially supported by an award from the Economic and Social Research Council (UK). I would like to express my thanks to Raphael Salkie, who kindly provided me with the INTERSECT corpus. Notes 1. 2.
3. 4. 5.
Emphasis added is mine. A reportage, B editorial, C reviews [88 texts; 176,000 words]; K general fiction, L mystery & detective fiction, M science fiction, N adventure & western fiction, P romance & love story, R humour [126 texts; 252,000 words]; D religion, E skills, trades & hobbies, F popular lore, G Belles Lettres, bibliography, essays [176 texts (352,000 words]; H miscellaneous, mainly government documents, J learned & scientific writings [110 texts; 220,000 words]. These are ‘zero correspondence’, that is cases in which the English target text does not contain any form that can specifically be related to the French subjunctive mood. It should also be admitted that the small size of the corpora could lead to random effects due to sampling. When a triggering expression was followed by more than one that-clause, only the first clause was included in the final results, as the Xkwic search stops at the first clause and does not account for any following that-clause triggered by the same expression.
References Aarts, J. (1998) Introduction. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research: Theory Method, and Case Studies (pp. ix xiv). Amsterdam & Atlanta, GA: Rodopi. Altenberg, B. (1999) Adverbial connectors in English and Swedish: Semantic and lexical correspondences. In H. Hasselgard and S. Oksefjell (eds) Out of Corpora
The Mandative Subjunctive in French and its Translations in English
215
Studies: In Honour of Stig Johansson (pp. 249 268). Amsterdam & Atlanta, GA: Rodopi. Altenberg, B. and Aijmer, K. (2000) The EnglishSwedish Parallel Corpus: A resource for contrastive research and translation studies. In C. Mair and M. Hundt (eds) Corpus Linguistics and Linguistic Theory (ICAME 20) (pp. 15 33). Amsterdam & Atlanta, GA: Rodopi. Asahara, K. (1994) English present subjunctive in subordinate that-clauses. Kasumigaoka Review 1, 1 30. Baker, M. (1993) Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In Honour of John Sinclair (pp. 233 250). Amsterdam & Philadelphia: John Benjamins. Baker, M. (1995) Corpora in translation studies: An overview and some suggestions for future research. Target 7 (2), 223 243. Barlow, M. (1995) ParaConc: A concordancer for parallel texts. Computer and Text 10, 14 16. Christ, O. (1994) A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX ‘94: 3rd Conference on Computational Lexicography and Text Research (Budapest, July 710 1994) (pp. 23 32). Budapest, Hungary. Eco, U. (1976) A Theory of Semiotics. Bloomington & London: Indiana University Press. Gellerstam, M. (1996) Translation as a source for cross-linguistic studies. In K. Aijmer, B. Altenberg and M. Johansson (eds) Languages in Contrast Paper from a Symposium on Text-based Cross-linguistic Studies (pp. 53 62). Lund: Lund University Press. Holmes, J.S. (1988) Translated! Papers on Literary Translation and Translation Studies. Amsterdam: Rodopi. Johansson, M. (1996) Contrastive data as a resource in the study of English clefts. In K. Aijmer, B. Altenberg and M. Johansson (eds) Languages in Contrast Paper from a Symposium on Text-based Cross-linguistic Studies (pp. 127 150). Lund: Lund University Press. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-Linguistic Research: Theory Method, and Case Studies (pp. 3 24). Amsterdam & Atlanta, GA: Rodopi. Johansson, S. and Hofland, K. (1994) Towards an EnglishNorwegian parallel corpus. In U. Fries, G. Tottie and P. Schneider (eds) Creating and Using English Language Corpora: Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zurich 1993 (pp. 2537). Amsterdam & Atlanta, GA: Rodopi. Leech, G.N. (1992) Corpora and theories of linguistic performance. In J. Svartvik (ed.) Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82 (pp. 105 122). Berlin & New York: Mouton. Malmkjær, K. (1999) Contrastive Linguistics and Translation Studies: Interface and Differences. Utrecht: Platform Vertalen & Vertaalwetenschap. McEnery, T. and Oakes, M. (1996) Sentence and word alignment in the CRATER Project. In J. Thomas and M. Short (eds) Using Corpora for Language Research (pp. 211 231). London & New York: Longman. Quirk, R., Greenbaum, S., Leech, G.N. and Svartvik, J. (1985) A Comprehensive Grammar of the English Language. London: Longman. Salkie, R. (1995) INTERSECT: A parallel corpus project at Brighton University. Computer and Texts 9, 4 5.
216
Incorporating Corpora
Salkie, R. (1996) Modality in English and French: A corpus-based approach. Language Sciences 18 (1 2), 381 392. Salkie, R. (1997) Naturalness and contrastive linguistics. In B. LewandowskaTomaszczyk and P.J. Melia (eds) PALC’97: Practical Applications in Language Corpora (pp. 297 312). Lo´dz: Lo´dz University Press. Salkie, R. (2000) Unlocking the power of SMEMUC. In S.P. Botley, A.M. McEnery and A. Wilson (eds) Multilingual Corpora in Teaching and Research (pp. 148 156). Amsterdam & Atlanta, GA: Rodopi. Serpollet, N. (2001) The mandative subjunctive in British English seems to be alive and kicking . . . Is this due to the influence of American English? In P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja (eds) Proceedings of the Corpus Linguistics 2001 Conference (Lancaster University, 30 March2 April 2001) (pp. 531 542). Lancaster: UCREL, (Unit for Computer Research on the English Language: technical papers, Volume13 Special issue). Serpollet, N. (2003) Should and the subjunctive: A corpus-based approach of mandative constructions in English and in French. PhD thesis, Lancaster University. Teubert, W. (1996) Comparable or parallel corpora? International Journal of Lexicography 9 (3), 238 264. Tymoczko, M. (1998) Computerized corpora and the future of translation studies. Meta 43 (4), 652 660. Ulrych, M. (1997) The impact of multilingual parallel concordancing on translation. In B. Lewandowska-Tomaszczyk and P.J. Melia (eds) PALC’97: Practical Applications in Language Corpora (pp. 421 435). Lo´dz: Lo´dz University Press. Va´radi, T. and Kiss, G. (2001) Equivalence and non-equivalence in parallel corpora. International Journal of Corpus Linguistics 6 (special issue), 167 177.
Chapter 12
Perfect Mismatches: ‘Result’ in English and Portuguese DIANA SANTOS
Introduction In this chapter I describe a corpus-based study designed to test the claim that the meaning of ‘result’ associated with the English present perfect according to some authors the main meaning is not operational in Portuguese. In order to do this, I start by briefly summarising some of the claims about the present perfect in English and several plausible Portuguese counterparts, proceeding then to describe the empirical study conducted and the conclusions reached. The study is informed by at least three general assumptions about language, previously presented in Santos (1996, 1997, 2000, 2004). The first assumption is that contrastive studies are the best way to access semantic data in an objective way. The second is that a relativist approach is the only methodologically sound way to perform contrastive studies, that is: one must describe each language system in its own right, before comparing the two in order to illuminate both the translation task and the semantics of each particular language. This standpoint implies that linguistic categories, realised by grammatical (morphosyntactic) categories, are language specific, and that although related languages (such as English and Portuguese) are bound to reflect rather similar concepts, the degree of similarity or difference has to be established empirically rather than postulated a priori. Another important assumption behind the present study is that categories contain several ‘components’1 of meaning. There is thus no point in arguing for a unique meaning, or primary role, given that grammatical categories have to serve many purposes. This is related to the view in Santos (1997) that vagueness is an essential property of language. Furthermore, owing to the dynamic nature of language, categories take on new roles and drop old ones. Given that a linguistic device such as a tense form applies to a wide spectrum of different verbs or verb phrases with different properties, behaviour and therefore requirements in terms of subcategorisation and so on, it will, with time, develop different uses for different verbs, as is often the case concerning the distinction between events and states. 217
218
Incorporating Corpora
The source of the empirical data discussed in this chapter is COMPARA (Frankenberg-Garcia & Santos, 2003), a sizeable manually edited parallel corpus of EnglishPortuguese and PortugueseEnglish translations of published fiction, publicly available at http://www.linguateca.pt/COMPARA/. Table 12.1 gives a quantitative snapshot of the version of COMPARA used, containing 38,099 alignment units in the PortugueseEnglish direction, and 36,937 in the EnglishPortuguese direction. All examples in this chapter are thus authentic.
A Look at the Present Perfect The English present perfect is well known as a difficult subject to teach Portuguese students (Casanova, 1985). However, it is a feature of English grammar that has been extensively studied and which is often used to demonstrate the import of result in tense and aspect (see, for example, Comrie, 1976; Dahl, 1985; McCoard, 1978; Moens, 1987; Sandstro¨m, 1993). A previous corpus-based study of the translation to and from Portuguese of this tense form was marred by too few examples (see Santos, 1996: 449471; 2004: 119123). Nevertheless, the study revealed one of the cases where the two aspectual systems differ most. After all, there is no aspectual class distinction in Portuguese that hinges on a difference of interpretation between perfect and non-perfect, one of Vendler’s (1967) criterial tests for English. Figure 12.1 illustrates the schemas assigned to the two languages as far as the conceptualisation of events is concerned. The schemas in Figure 12.1 give rise to what I have called here the ‘perfect mismatch’: as there is no grammatical device that conveys result in Portuguese, there should be no trace left in the translation when the English present perfect performs this function. Following the works mentioned above, especially Comrie (1976) and McCoard (1978), it is clear that there are other ‘components’ of meaning attributed to the English perfect. Although these are generally presented with different emphases in order to better suit competing theories, I prefer to consider them as together making up the meaning of a single linguistic category, as they can all occur together, and are not logically incompatible, as will be described below. The examples in Table 12.2 illustrate respectively: the purely temporal concept of ‘extended now’, present relevance, the existential meaning (or experiential perfect) and ‘hot news’ (or recency). It is notable that, although apparently acknowledging roughly the same set of uses and even citing the same examples, each scholar groups the meanings in a different way.2 A more objective way of clustering uses of the perfect, such as can be done by using translation evidence, will be attempted here. Of course, care must be exercised to take into account all
22 55
English
Totals
58
34
24
Translations
1,109,587
584,121
525,466
Words in source texts
*34 translations of the 33 Portuguese originals, and 24 translations of the 22 English originals
33
Portuguese
Source texts
Table 12.1 COMPARA contents v.6.1*
1,171,026
599,845
571,181
Words in translations
2,280,613
1,183,966
1,096,647
Words in source texts & translations
Perfect Mismatches: ‘Result’ in English and Portuguese 219
Incorporating Corpora
220 ENGLISH events
result
PORTUGUESE events
non-result
extended
punctual ( = change)
extended punctual accomplishment
achievements
activities
obras
mudança
Figure 12.1 Aspectual classes covering events in English and Portuguese
sorts of contrastive factors, and the fact that translation is a way of looking at a source text through target-language eyes, meaning that the translator recognises (and conveys) most distinctly the particular meaning(s) s/he is familiar with. This requires a Portuguese-grounded view of the English perfect meanings to be presented, so that a reader may assess what can be recognised or shown by translation into Portuguese.
Table 12.2 Examples of English present perfect meanings Source text
Target text
‘Well anyway, how have you been’?
‘Bem, tirando isso, como tens passado?’
This profession in America has constantly been held in honour, and more successfully than elsewhere has put forward a claim to the epithet of ‘liberal’
Na Ame´rica, esta profissa˜o e´ tida em grande conta e, com mais eˆxito do que em qualquer outro paı´s, tem reivindicado o epı´teto de ‘liberal’.
‘That’s right,’ said the Queen, patting her on the head, which Alice didn’t like at all, ‘though, when you say ‘‘garden’’ I’ve seen gardens, compared with which this would be a wilderness’
‘Muito bem’, disse a Rainha, dandolhe uma palmadinha na cabec¸a, de que Alice na˜o gostou nada, ‘se bem que jardim, como tu dizes... eu tenho visto jardins e este comparado com esses na˜o passa de um deserto’.
The senior conductor has just announced that we are approaching Rugby.
O motorista do comboio acabou de anunciar que estamos a chegar a Rugby.
Perfect Mismatches: ‘Result’ in English and Portuguese
221
The ‘extended now’, that is, a period that encompasses past and present, has two ways of being realised in Portuguese, namely different tenses: Presente and the Portuguese Prete´rito Perfeito Composto. Furthermore, there are several other ways to convey extended periods that share some resemblances with the English perfect, as we shall see. As to the ‘relevance’ and ‘hot news’ ‘components’ of meaning, however, these are, from the point of view of Portuguese, hard to differentiate from an emphasis on result; present relevance just seems a way to convey the idea that some result is still valid, even when ‘result’ is pragmatically inferred from the context. Similarly, the inference of ‘hot news’ can be explained as a pragmatic device associated with the result meaning: by grammatically marking the fact that some new information produces a relevant result now, the meaning conveyed is that of ‘hot news’. Conversely, the more recent an event, the more probable it is that its result is relevant now. The existential meaning, on the other hand, appears at first to be altogether different and related to quantification. It is employed in those cases where asserting mere existence is meaningful; it is sometimes emphatically uttered. However, its use can again be easily subsumed under the relevance interpretation: the present perfect is used to convey mere existence in some indefinite time because this existence is relevant to some discourse purpose. A different way of relating result to relevance is to argue that by adding a result to an event description, one is conveying that it is relevant for the present. Others claim that this is exactly what happens when the event is placed in an ‘extended now’. In this connection, it is interesting to note that two similar uses, that is, marking relevance to the present and conveying an existential assertion, have been attributed in Portuguese to the aspectual marker ja´ associated with an event description in the Portuguese Perfeito. This is, however, at odds with the available translation evidence, as will be shown. Elsness (1991) explains the opposition of the past simple and the present perfect with the fact that the former is the narrative tense in English, also stressing its deictic character. However, the Perfeito in Portuguese can have both deictic and existential uses, and this may be another explanation of why the two past tenses in English both translate into the Perfeito. A more abstract cross-linguistic notion of tense, mood and aspect (TMA) categories is presented by Dahl (1985). His study assigns considerable weight to the (cross-linguistic) perfect category, distinguishing between two perfect uses, ‘perfect of result’ and ‘perfect of persistent situation’ (exemplified in English by the present perfect and the present perfect progressive respectively), and ‘perfect related categories’ such as the experiential, the pluperfect and the quotative. The results of the study
Incorporating Corpora
222
indicate that Portuguese appears to lack any perfect in Dahl’s sense. However, there are several problems with Dahl’s account of Portuguese (Santos, 1996): on the one hand, he groups present perfect and pluperfect together; on the other hand, he does not study the ja´ marker.
Perfect Meanings Through Portuguese Eyes In the previous section it was claimed that the result/relevance/hot news meaning cluster of the English perfect is alien to Portuguese. Portuguese, not having a grammatically realised concept of result, would not mark such concepts. On the other hand, we will see that the ‘extended now’ concept is very relevant in Portuguese, but the tense in question has different requirements and consequences: in particular, it is deeply connected to indefinite repetition, something also expressed by several other TMA devices in Portuguese. Furthermore, the stative realm or the way states are conceptualised in Portuguese is much more fine-grained than in English, which is shown in the multitude of state-like competing grammatical devices which is presented in Figure 12.2. I have argued elsewhere, in connection, for example, with the Portuguese Imperfeito and the ser/estar distinction, that a fundamental distinction lying at the heart of Portuguese grammar (and embodied world view) is the separation between permanent (essential) and temporary (accidental) states. There are also several aspectualisers that convey a whole shade of situations between these two kinds of states. Figure 12.2 reflects this proposal about states in Portuguese. A description of relevant or related grammatical devices in Portuguese is in order, before we proceed.
atemporal states Presente, Imperfeito
dispositional states Presente, Imperfeito costumar
habitual states Presente Imperfeito costumar, andar
punctual temporary states estar repeated temporary states PPC prog ir gerúndio, andar extended temporary states PPC
Figure 12.2 Several different categories of states in Portuguese
Perfect Mismatches: ‘Result’ in English and Portuguese
223
´ rito Perfeito Composto (PPC) The Portuguese Prete This is a tense that has received much attention from Portuguese scholars, because it displays a tendency contrary to other Romance languages in that it is never interchangeable with the Perfeito. What is notable about it in connection with the present study is that the PPC is temporally related to a period covering past and present, as is also claimed for the English present perfect, an interval which has no other equivalent in time if we try to change the co-ordinates (that is, there is no past or future version of PPC). In addition, when used with event predicates, it always signals (not simply entails) indefinite repetition (iteration): What makes the PPC especially expressive and gives it a unique and unmistakable place in the Romance language landscape is [. . .] its ability to express duration and iteration without adverbial expression (although this can also be added). (Bole´o, 1936: 8, my translation) In this, it is very different from English, where the present perfect may encourage such an inference (or may coerce the interpretation), but does not express it (cf. for example, McCoard, 1978). With an eventive predicate the PPC always expresses repetition (an indefinite number of times). Campos (1987) suggests that the PPC is a tense even more present than the simple present (which is often habitual or even dispositional). Taking a different stance, Peres (1993) stresses the meaning of temporal anteriority associated with the past participle, assigning an avowedly ad hoc iterative role to the ter auxiliary in order to provide an analysis of the PPC. Other ways to refer to an extended now In Portuguese there are other conventional ways to present an ‘extended now’, using adverbials such as desde (standard translation: ‘since’) or ha´ (either translated by ‘for’ or by ‘ago’) which require the Portuguese Presente (the simple present tense), as illustrated in Table 12.3. These two different ways to convey states in Portuguese in an extended now period support the claim regarding the more fine-grained conceptualisation of states in Portuguese, which can be illustrated by the minimal pair Estou doente desde quinta (Presente) and Tenho estado doente desde quinta (PPC), both translated by I’ve been ill since Thursday, but with the second sentence expressing a temporariness that is absent from the first. Portuguese aspectualisers There are also several auxiliaries in Portuguese whose meaning is imprecise repetition, covering a diffusely extended period. When used in the present tense, they could be said to convey an extended now, which, however, situates ‘now’ in the middle and not at the end of the interval.
224
Incorporating Corpora
Table 12.3 Ways of presenting an ‘extended now’ in Portuguese with Presente Source text
Target text
Ela esta´ morando aqui . . . desde que o senhor foi embora.
‘She’s living here. . . since you left’
‘Mesmo comigo, que na˜o penso em outras mulheres desde que conheci a filha do comerciante’.
‘Even me I haven’t thought of other women since I met the merchant’s daughter.’
Teu irma˜o Manuel, desde que fugiu para Espanha, absorve-me todas as economias.
Your brother Manuel, since he ran off to Spain, has been using up all my savings.
Era a carta, a carta costumada, a carta necessa´ria, a carta que desde a velha Antiguidade a mulher sempre escreve; comec¸ava por ‘Meu idolatrado’ ( . . . )
It was the usual letter, the inevitable letter, the letter that women have written ever since Antiquity: it began ‘My idol’ ( . . . )
Na˜o do tio Maximino de agora (ha´ mais de vinte anos a gente na˜o se veˆ), (. . .)
Not Uncle Maximino as he was now (we haven’t seen one another in more than twenty years),
These aspectualisers work in a way similar to (but are considerably more fine-grained than) the English progressive, although they obey different constraints and are used with different lexical items. In other words, while the superordinate meaning can be expressed in a similar way, the aspectualisers apply to different cases, crucially connected with a lack of difference between accomplishments and activities in Portuguese. Examples are andar, ir and estar, literally ‘walk’, ‘go’ and ‘be’, all presupposing a larger interval where some (repetitive or durative) behaviour can be asserted, as shown in Table 12.4. Finally, acabar is a very interesting aspectualiser for several reasons. On the one hand, it is standardly translated by English ‘just’; on the other hand, it is vague between two different meanings, best explained by the two different English translations of a sentence such as Ele acabou de poˆr a mesa: ‘he has finished laying the table’, ‘he has just laid the table’. Acabar is often mentioned as one of the renderings of the English present perfect with ‘just’. However, it is fully operational in all tenses in Portuguese, as ‘just’ is in English.
´ marker The ja Ja´ is a very frequent adverbial in Portuguese which can be described as a recency marker or a temporal closeness marker, as it can be used in
Perfect Mismatches: ‘Result’ in English and Portuguese
225
Table 12.4 Portuguese aspectualisers conveying an ‘extended now’ Source text
Target text
Esse deve pensar que o dinheiro lhe da´ o direito de conservar a mulher, apesar de ela andar a dar com a cabec¸a nas paredes.
He must suppose that money gives him the right to keep his wife in spite of the fact that she’s screwing everybody in sight.
Na certa, a vizinhanc¸a anda falando de no´s.
The neighbours must be talking.
Voceˆ anda fazendo alguma bobagem?
Have you been up to something dumb?
Barulhosos, os colonos foram chegando.
The settlers began their noisy arrival.
Frase a frase, quilo´metro a quilo´metro, o cilindro da ma´quina de escrever vai rolando pacientemente, pedalada apo´s pedalada, por caminhos, e estradas, curvas de nı´vel, charcos.
Sentence by sentence, kilometre by kilometre, the typewriter ploughed patiently in the wake of revolving bicycle-wheels, on the road, by-roads, uphill, downhill, through the puddles.
E assim foi nascendo no meu corac¸a˜o, pudicamente, uma paixa˜o pela Viceˆncia.
And thus there grew up in my heart a chaste passion for Viceˆncia.
Depois do almoc¸o, que acabou a`s duas horas, estive arranjando uns pape´is.
‘After lunch, which lasted until two o’clock, I was working on some papers’.
Tambe´m me esta´ a apetecer uma coisa assim.
‘Frankly I wouldn’t mind doing the same myself.’
both directions3 (future and past) and/or as a relational adverb (proximity) between two event descriptions (a discourse connector4). Other meanings associated with ja´ are mere existence (an existential claim), the implication that something was planned and the marking of perfectivity (not to be confused with the perfect aspect) something seen as completed, as shown in the three examples in Table 12.5. It should be noted that ja´ is somehow related to English ‘already’, which has also been reported to co-occur often with the perfect. We will see below, however, that ja´ is far from being directly translatable by ‘already’. In fact, it seems likely that the two uses of ‘already’ in Table 12.5 are examples of translationese. Interestingly, ja´ may also co-occur (though very seldom) with the Portuguese PPC, adding the connotation of rarity: Ja´ tenho feito, ja´ tem
Incorporating Corpora
226
Table 12.5 Uses of ja´ in Portuguese Source text
Source text
Quem ja´ teve um namoro, por menos se´rio que seja, e que levou um logro destes; (. . . )
Anyone who has ever been in love, no matter how slightly, and has had to undergo such a disappointment; ( . . . )
Tem de ser hoje, ja´ recebi ordem, disse o carcereiro quando lhe contou o plano para a fuga, ( . . . )
It must be today, I have already received orders, said the gaoler, when he told him the plan for his escape (. . . )
E, quanto a` minha avo´, que ja´ morreu, essa mudou-se para a nossa quinta, ( . . . )
And as far as my grandmother, who is already dead, is concerned she moved to our country ( . . . )
acontecido (‘there have been times I did this’, ‘it has happened that . . .’). According to Campos (1984), ja´ neutralises some of the distinction between Perfeito and PPC, and simply marks the fact that the situation is not current, that is, it does not extend until now. Campos (1985: 127) also notes that ja´ and acabar have a complementary distribution. Summing up We have reviewed a number of descriptions of disparate grammatical devices in the two languages, which are of relevance to the present study of ‘result’, mainly based on monolingual studies and on the intuition of the scholars concerned. In the rest of the current chapter, two things will be attempted: .
.
to explore what contrastive (and authentic) data can add to the understanding of these features; and to test the hypothesis that result is not grammatically relevant in Portuguese, based on a number of predictions tested against the available corpus evidence.
Two kinds of predictions can be made: those related to translation behaviour and those related to monolingual properties. Starting with translation, the hypothesis will be confirmed if the following situations emerge, given sufficient data: (1)
almost no distinction in translation into Portuguese between other present perfect uses and the past simple (except possibly when adverbial markers enforce different behaviour) will emerge, although a good overlap of the Portuguese PPC and the English present perfect of the ‘extended now’ kind should occur in both translation directions when dealing with stative predicates;
Perfect Mismatches: ‘Result’ in English and Portuguese
(2)
(3)
227
an English-only rationale in the choice of present perfect or past simple when translating from Portuguese, because English needs to attend to distinctions that Portuguese does not specify; systematic loss or addition in translation, and/or compensation through the use of different aspectual classes, as Portuguese has no similar category to the English present perfect.
The second kind of prediction, relating to monolingual properties, will concern the lexical field of cause and result in the two languages: (4) (5)
English will be superior in lexical richness, that is, in the number of distinct items which are related to and/or convey result; English will also display a significantly higher frequency of lexical items occurring in running text.
Given time and space limitations, for this second kind of prediction I will only use those parts of COMPARA which can be considered directly comparable between English and Portuguese, that is I will be comparing original Portuguese and translated Portuguese (and only for Prediction 5).5 To investigate Prediction 4, I would need a language-independent criterion to yield comparable and complete lists of lexical items relating to result, or at least two independently built lists of result words for each language. Let me note that it is probable that, as a compensatory mechanism for its lack of focus on result, Portuguese is more concerned with the description of fairly complex temporal patterns that bring forth notions of disposition, fate, probability, or state of affairs, as opposed to individual actions shaping concrete events. If so, this would imply that the grammatical markers of such notions, in opposition to the result-related words, will be more frequent in original Portuguese than in Portuguese translations, but this goes beyond the scope of the present study. The Empirical Study In this section, the two sets of predictions outlined above are explored using the COMPARA corpus. Looking at translation patterns involving the TMA devices Study 1
Here we concentrate on the 170 instances of PPC identified in COMPARA: Table 12.6 provides a quantitative overview of how the PPC is rendered in English (f 78), and which English tenses give rise to PPC in Portuguese translation (f 92). Table 12.6 seems to indicate rough comparability in the distribution of tense use as the English present perfect is the overwhelmingly preferred translation of PPC in 62% of cases (48/78), and also the most common origin of the PPC in translation (in 57% of cases, 52/92). However, a
Incorporating Corpora
228
Table 12.6 PPC translation patterns from and into English English tense correspondence
Translation from Portuguese into English (from PPC)
Translation into Portuguese from English (into PPC)
Present perfect
48
52
Present perfect progressive
9
10
Present perfect progressive PPC prog
3
17
Present perfect PPC prog
0
1
Progressive
1
8
Present
4
2
Past simple
4
0
Pluperfect progressive
0
1
Pluperfect
1
1
Rewriting
3
0
78
92
Total
closer look at the data demonstrates that we must distinguish here between stative and eventive predicates. In fact, of all the 52 translations of the present perfect using the PPC in Portuguese, 31 are statives.6 Of the remaining instances, 11 verbs have arguments with a plural number and 5 mass ones (which is also a kind of plurality), therefore allowing for a plausible iterative reading. In three additional cases, the repetition meaning was made clear by other devices, as illustrated in Table 12.7. Table 12.7 Translations of present perfect with adverbials Source text
Target text
He gives me the burdened sigh I’ve heard all my life meaning that I’m too obstinate for my own good.
Ele solta o mesmo suspiro sofrido que tenho ouvido toda a minha vida, querendo significar que para meu mal sou demasiado teimoso.
‘I’ll come up every hour,’ Robert said, ‘as I’ve done since he got ill.’
Passo por ca´ de hora a hora disse Robert , como tenho feito desde que ele adoeceu.
Perfect Mismatches: ‘Result’ in English and Portuguese
229
Table 12.8 Vagueness in the English source and possible translation errors Source text
Target text
You have not realized how I have developed.
Na˜o compreendeu como eu me tenho desenvolvido.
Fergus has made Gina feel a freak.
O Fergus tem andado a fazer a Gina sentir-se um monstro.
‘I hope everyone is coming to the party,’ says Polly, who has sat up half the night cutting out paper hearts for decorations.
Espero que venham todos a` festa diz a Polly, que tem passado metade das noites acordada, a recortar corac¸o˜es de papel para a decorac¸a˜o.
Differences conveyed by the texts in the two languages are highlighted in Table 12.8. Of special interest are the possible translation weaknesses resulting from the apparent similarity of the present perfect and the PPC and the vagueness of the English source. While English as source asserts one result, the Portuguese translation expresses a repetition of small events of ‘developing’ and ‘making’, which are not stated in the original two first sentences. Likewise, the third example illustrates the choice made by the translator between several nights and a single last night, something not clearly specified in the English original. When considering translations in the opposite direction, that is from Portuguese into English translating PPC into the present perfect it is interesting to note that a mere five cases are statives in Portuguese, which shows that PPC is mainly put to use to describe (repetitive) actions. When additional temporal information is included in Portuguese, such as an adverbial phrase that makes the iteration explicit, as in Table 12.9, the English translation has no loss of information. But given the very specific import of this Portuguese tense, and the more general import (and set of uses) of the present perfect, it is not surprising to find instances where matters explicitly expressed in Portuguese remain vague, or at most implicit, in English, as in Table 12.10. The first example in Table 12.10 refers clearly in Portuguese to different occasions, whereas the translation could describe a case where Table 12.9 Presentation of the meaning of PPC with co-occurring adverbs Source text
Target text
Tenho ouvido dizer muitas vezes o que e´, senhor Sima˜o. . .
‘I have heard many times what it is, Sima˜o’.
Tenho pensado em voceˆ todos os dias.
‘I’ve thought about you every single day.’
230
Incorporating Corpora
Table 12.10 Explicitly repeated events in Portuguese with unspecified repetition in translation Source text
Target text
E´ aı´ que tenho feito muitos conhecimentos.
I have made many acquaintances there.
Eu tambe´m me tenho sacrificado, Poˆncio . . .
‘I’ve made sacrifices too, Pontius . . . ’
Tenho visto seu nome nos jornais.
I’ve seen your name in the papers.
Na˜o lhe tenho escrito por certos motivos particulares, etc.
I have not written you because of some very special reasons, etc.
all the many acquaintances had been made on the same occasion. In the next case, and owing to a possible existential interpretation of the English sentence, there is nothing that anchors the statement in an ‘extended now’, up to the present, as is conveyed in Portuguese. Finally, where the third and fourth Portuguese sentences clearly convey a repeated experience of seeing, and a repeated case of ‘not writing’, the English translations could refer to a single occasion of seeing, as well as to the accumulated absence of writing respectively. On the other hand, a situation was identified in the corpus data in which the Portuguese PPC and English present perfect seemed to be fairly equivalent, namely in relative clauses whose head is a plural count or a mass noun (10 occurrences), as shown in the first three examples in Table 12.11. The fourth example indirectly shows, however, by the addition of the progressive that if the head of the relative clause is a singular countable noun, a repeated meaning can no longer be inferred, and a bare present perfect English translation would fail to convey the original meaning. Study 2
Having looked at translations of the Portuguese PPC into English, as well as at translations from English using the PPC, we now move on to the present perfect in English. Given that the potential number of occurrences in COMPARA was too large to inspect manually, a random sample of 500 potential occurrences of the present perfect was selected by identifying instances of has or have followed by a past participle. After eliminating infinitives or cases preceded by modals, 214 occurrences were left: 136 occurrences in English source texts; 78 occurrences in English target texts. As 78 seemed too few, all occurrences from Portuguese to English (N 316) were then investigated. Table 12.12 displays the translation patterns observed.7
Perfect Mismatches: ‘Result’ in English and Portuguese
231
Table 12.11 Influence of the mass/plural character of the head of relative clauses Source text
Target text
Aqui, se algue´m deve chorar sou eu; mas la´grimas dignas de mim, la´grimas de gratida˜o aos favores que tenho recebido de si e de seu pai.
‘If anyone should cry here, it should be me; but tears worthy of myself, tears of gratitude for the favours I’ve received from you and your father’.
Os sacrifı´cios que Mariana tem feito e quer fazer por mim so´ podiam ter uma paga, embora mos na˜o fac¸a esperando recompensa.
‘The sacrifices you’ve made for me and that you want to make for me could only have one legitimate form of payment, even if you don’t make them expecting recompense.’
E´ inacredita´vel o que se tem construı´do de enta˜o para ca´.
‘You wouldn’t believe the construction that’s gone on since then.’
‘Tome, aqui esta˜o suas pedras preciosas e fique tranqu¨ilo que na˜o direi para a polı´cia qual o me´todo que voceˆs teˆm usado para contrabandea´-las para o exterior’.
‘Here, take your precious stones and rest assured that I’ll say nothing to the police about the method you’ve been using to smuggle them abroad.’
Table 12.12 shows clearly that it is not the Portuguese PPC, but the Perfeito which is the default translation equivalent for the English present perfect, even though by far the most frequent translation equivalent of the Perfeito into English is the past simple (cf. Santos, 2004). So which particular Portuguese structures are translated by the present perfect? Why does present perfect appear at all in the English translation, if it is meant to convey a result not present in the Portuguese? First, I looked for how many of these present perfect instances applied to stative verbs in English, but only 22 cases were found. One thing that became clear was that several Portuguese verbs in the Perfeito were Mudanc¸as (see Figure 12.1), with a clear meaning that a change has occurred, while many of the English verbs employed as translations were accomplishments or acquisitions; the fact that the event has been completed must be expressed in English with the perfect. This appears to cover most of the cases of the translation of Perfeito into the present perfect, as the three first examples in Table 12.13 show. The two last cases, featuring Obras instead (something that happens in an extended interval), are particularly interesting because they are somehow the converse of the cases discussed above: they both describe a number of situations fully occurring in the past (thus Perfeito) but seen as ‘summed up’ (thus present perfect). Finally, the second verb in the last example, habituou, a Mudanc¸a, is particularly interesting because it is very hard to translate
Incorporating Corpora
232
Table 12.12 Present perfect translation patterns from and into Portuguese Portuguese tense correspondence
Translation into Portuguese of English present perfect
Translation from Portuguese into English present perfect
Perfeito
82 (7 ja´)
PPC
14 (4 prog)
14 (1 prog)
Presente
11 (3 prog)
55 (3 prog)
Presente Progressivo
184 (7 ja´)
4 (4 prog)
Infinitivo Composto
8
7
Mais que Perfeito
4 (1 ja´)
4 (1 ja´)
Presente Conjuntivo
3
8
Futuro Conjuntivo
3
1
Imperfeito Conjuntivo
1
3
acabar de (Perfeito)
3
1
Particı´pio Passado
3
10
Imperfeito
2
2
Rephrasing
2
14
136
316
Total
‘Prog’ indicates present perfect progressive. Figures in parenthesis are not additional to those shown i.e. of 14 translations of Perfeito into the English present perfect, 1 is present perfect progressive.
into English, as it depicts a change into one of several kinds of Portuguese states. Its translation usually requires a gradual accomplishment (cf. Santos 1996: 229230 for further discussion). The data also reveal that a good many cases of the Portuguese Presente (52 simple, 3 progressive) were rendered by the present perfect, only 8 of which included ha´ (‘for’, ‘ago’) and even fewer (3) desde (‘since’). The most notable pattern seems to be a timeless generalisation (or ‘truth’) in Portuguese being turned into a summing up of previous events in English, as in Table 12.14. A number of cases of aspectual class change in PortugueseEnglish translation could also be identified. The first three examples in Table 12.15 describe a state that can be related to a previous event, and how such an event is expressed by the English sentence in translation. The last three
Perfect Mismatches: ‘Result’ in English and Portuguese
233
Table 12.13 Perfeito translated into present perfect Source text
Target text
Tirou os o´culos ray-ban para mostrar que tem olhos azuis, e e´ um cafajeste.
He has taken off his Ray-Ban glasses to show he has blue eyes, and he is a scoundrel.
A pintura escureceu muito, mas ainda da´ ide´ia de ambos.
The paint has darkened with time, but it still gives a good idea of both.
Herdou o orgulho do pai! murmurou Esta´cio.
‘She has inherited her father’s pride!’ murmured Estacio.
Por esta mesma via escrevemos ao Sr. Manuel Pedro da Silva, a quem novamente prestamos contas das despesas que fizemos com o sobrinho.
At this time we are also writing to Mr. Manuel Pedro da Silva, to whom we once again submit our accounts of the expenses we have incurred on behalf of his nephew.
Tantas vezes Ariela conduziu um cliente masculino pelo corredor, tantas portas de apartamentos abriu para dar passagem a um homem, e nem assim se habituou a um protocolo que considera humilhante.
So many times Ariela has led a male client down the hallway, so many doors to apartments opened to make way for a man, and even so she has never become accustomed to a protocol she considers humiliating.
examples illustrate cases where the lexically corresponding translation has different aspectual properties (Aquisic¸o˜es vague between inception and state in Portuguese, accomplishments in English). Translation errors are a wonderful window on language differences, as the next example shows: while in the Portuguese the adverb hoje (which can mean ‘nowadays’, or ‘today’) reflects a permanent or essential property, the translator has read it as referring to a particular day. See Table 12.16. Finally, the seven cases with ja´ did not reveal any clear pattern.
Table 12.14 Translation of timeless truth Presente into present perfect Source text
Target text
Os poetas exageram muito!
The poets have exaggerated greatly.
A histo´ria registra inu´meros casos de roubos de caravanas inteiras de peregrinos, e de crimes horrı´veis cometidos contra os viajantes solita´rios.
History has recorded innumerable cases of robbery of entire caravans of pilgrims and of horrible crimes committed against lone travelers.
234
Incorporating Corpora
Table 12.15 Translation of states by the event that caused them Source text
Target text
A indu´stria na˜o gostaria com certeza, esta˜o ali investidos milho˜es,
Why, Industry wouldn’t like it, millions have been invested in the project,
Digo-te que tens uma raiz de ma´ erva no corac¸a˜o; esta e´ a cruel verdade.
I tell you that a poisonous herb has taken root in your heart. This is the cruel truth.
Acabei agora mesmo de saber que a polı´cia tem informac¸a˜o de dois casos de cegueira su´bita,
I’ve just been told that the police have been informed of two cases of sudden blindness
E comec¸o a observar que, nas suas frases de quando em quando interrompidas, aparecem agora tambe´m ( . . . ) palavras incoerentes
I have started to notice that his frequently broken sentences are interspersed now and then with incoherent, disconnected words,
Preserva Mara uma vivacidade juvenil que ainda me espanta, ao fim de todos estes anos.
Mara has maintained a youthful vivacity that still amazes me.
L. ( . . . ) embarcou num dos destroc¸os da arca de Noe´, a que chamamos carruagem;
L. ( . . . ) embarked in one of those refugees from Noah’s Ark that we have called carriages
Moving on to the translation of the present perfect in English source texts, only 11 statives among the present perfects could be found, 4 of which were rendered as the PPC. So, here we would expect either a loss of meaning of the English original, or some kind of compensation using Mudanc¸a verbs in the translation and hence an aspectual class shift. In fact, in the majority of cases, the result or resulting state is implicit and is not expressed in any way in the Portuguese translation. Let us look at a few cases where the verbs in the Portuguese translation are Obras (taking time, no result) as shown in Table 12.17.
Table 12.16 Timeless truth turned into a single event Source text assisti pasmado a` aurora daquela inteligeˆncia que os senhores veˆem hoje ta˜o desenvolvida e lu´cida.
Target text In amazement I assisted at the dawn of that intelligence which you gentlemen have seen today so well developed and lucid.
Perfect Mismatches: ‘Result’ in English and Portuguese
235
Table 12.17 Present perfect rendered by Obras in Perfeito Source text
Target text
What legitimate rights had been recognized, according to the ‘standards of Western civilization’ our white governments have declared themselves dedicated to preserve and perpetuate?
Que direitos legı´timos tinham sido reconhecidos, de acordo com os ‘padro˜es da civilizac¸a˜o ocidental’ que os nossos governos brancos declararam comprometer-se a conservar e perpetuar?
The wisest doctors of our Church for many centuries have examined every verse of the Holy Scripture just as Herod’s soldiers searched for innocent children, and they have found no chapter, no line, no phrase wherein there is mention of the woodworm.
Os mais sa´bios doutores da nossa Igreja examinaram durante muitos se´culos todos os versı´culos da Sagrada Escritura, com a mesma atenc¸a˜o com que os soldados de Herodes procuraram os inocentes, e na˜o encontraram nenhum capı´tulo, nenhuma linha, nenhuma frase que fizesse menc¸a˜o do caruncho.
In explanation of some portions of this narrative, wherein I have spoken of the stowage of the brig, (. . . )
Para mais fa´cil compreensa˜o dalguns pontos deste relato, no qual tanto falei da arrumac¸a˜o da carga do brigue, (. . . )
In the first example, one may detect in the English source a continuing obligation that is lost in the Portuguese translation; in the second, if the present perfect carries the implication that the doctors are still examining the scriptures, referring to an ‘extended now’, the translation does not reflect this, although the English could be interpreted as a present relevance case. The third example is vague between expressing total amount and simply present relevance. Total amount is conveyed explicitly in the Portuguese translation by the addition of the indefinite pronoun tanto (‘so much’). In fact, close inspection of every present perfect occurrence in the English source text (Table 12.12) leads to the conclusion that the factor ‘plurality’ alone (as in the relative clause examples in Table 12.11 above) does not entail repetition of events or states, and while plurality calls for the present perfect in English, this meaning alone may not license the PPC in Portuguese, as the following example in Table 12.18 shows. Further evidence for the claim that the present perfect is vague between (i) expressing a combination of several factors (a total amount) and (ii) a temporal sequence (a possibly unfinished repetition stretching into the present) is the fact that one may consider the translation of the following sentences in Table 12.19 as correct or incorrect respectively.
236
Incorporating Corpora
Table 12.18 Plural present perfect whose translation into Portuguese cannot employ the PPC Source text While Morris Zapp is working on this problem, we shall take time out to explain something of the circumstances that have brought him and Philip Swallow into the polar skies at the same indeterminate (for everybody’s watch is wrong by now) hour.
Target text Enquanto Morris Zapp labora no problema, aproveitamos para explicar um pouco as circunstaˆncias que o colocaram, a ele e a Philip Swallow, nos ce´us polares a` mesma (visto que neste momento na˜o ha´ ningue´m com o relo´gio certo) hora indeterminada.
Table 12.19 Present perfect sentences showing either a ‘total amount’ or a ‘temporal sequence’ interpretation Source text
Target text
Their bleedings and leakages, their lumps and growths, (. . . ) the mere mention of such things makes him wince and cringe, and lately the menopause has added new items to the repertory: the hot flush, flooding, and something sinister called a bloat.
As hemorragias e corrimentos, inchac¸os e caroc¸os, ( . . . ) a simples menc¸a˜o de tais coisas fa´-lo estremecer e encolher-se, e ultimamente a menopausa acrescentou novas rubricas ao reporto´rio: afrontamentos, hemorragias, e uma coisa sinistra chamada intumescimento.
Receiverships and closures have ravaged the area in recent years, giving a desolate look to its streets.
Faleˆncias e encerramentos devastaram a zona em anos recentes, dando um ar desolado as ruas.
Finally, the translation of the present perfect as applying to a state that was not rendered by the PPC in Portuguese allows an interesting generalisation: when the situation displayed in English cannot apply to the present moment, Portuguese uses (marked) Perfeito with states, as in the examples in Table 12.20. The cases where the present perfect is translated by ja´, on the other hand, do not display a regular pattern, as noted earlier for the cases where ja´ is translated by the English present perfect. Some three cases could be interpreted as rendering an existential meaning, two an ‘according to plan’ inference, that is, ja´ marking the new event as expected. The single most general meaning, ‘near the present’ (but not in the sense of hot news), seems to be operational in other cases, as shown in Table 12.21. It appears that ja´ is an idiomatic addition here, not determined by the use of the perfect in the English original; it is a discourse marker that
Perfect Mismatches: ‘Result’ in English and Portuguese
237
Table 12.20 Present perfect translated by Perfeito applied to states Source text
Target text
I want you you who have heard his last words to know I have been worthy of him. . .
Quero que o senhor o senhor, que ouviu as suas u´ltimas palavras saiba que fui digna dele...
‘It means that every other child Sam has been in contact with ’
Isto significa que todas as outras crianc¸as com quem Sam esteve em contacto . . .
( . . . ) she recognised the greeting of someone who has been away and signals his return.
(. . . ) ela reconheceu a saudac¸a˜o de algue´m que esteve fora e assinala o seu regresso.
Table 12.21 Translation of present perfect by ja´ in the sense of ‘near the present of the utterance’ Source text
Target text
There are many moralities here in Spain, you have seen that with your own eyes, you have seen that the old conservative view and the new liberal view do not live easily together, that there is still so much ( . . . )
Aqui em Espanha a moral e´ muito rı´gida, ja´ viu isso com os seus pro´prios olhos, ja´ percebeu que as velhas ideias conservadoras e as novas ideias liberais na˜o coexistem com facilidade, que ainda ha´ muitas coisas ( . . . )
I see I have missed my train.
Vejo que ja´ perdi o comboio.
It was to celebrate this latter event that Mrs. Almond gave the little party I have mentioned.
Foi para celebrar este u´ltimo acontecimento que Mrs. Almond deu a festazinha que ja´ referi.
provides an explicit link to the speech event (or the previous event mentioned). Without ja´, the link would be lost in Portuguese. Study 3
Analysing separately the 70 occurrences of ja´ followed by Perfeito in COMPARA (34 in original Portuguese texts and 36 in translated texts), a new use was identified: ja´ is often employed to mark a final (or summation) point of a gradual (or additive) situation, as shown in Table 12.22. Furthermore, the use of ja´ with verbs of reporting is strikingly frequent: 26 occurrences of 70, relating what is being said to what had been previously said.
Incorporating Corpora
238
Table 12.22 Ja´ with Perfeito conveying a final point (or summation) Source text
Target text
Qual . . . pois se eu tambe´m ja´ cantei tudo que sabia.
Aww. . . I might’ve already sung everything I know.
Voceˆ, que ja´ comeu cinco mil mulheres, podia me esclarecer se isso e´ verdade.
Come to think of it, you’re the one who’s had five thousand women, you should be able to set me straight on that.
Olhe que ja´ bebeu bastante, papa´!
You’ve had enough already, Daddy.
Table 12.23 Translation between Portuguese ja´ and English already ja´ to already
ja´
already to ja´
already
Source text
369
1054
153
202
Translated text
156
840
363
479
Total
525
1894
516
681
Status of text
Finally, Table 12.23 displays the overall correspondence between ja´ and already in translation, showing that while 76% (516/681) of the occurrences of already correspond to ja´ in translation into Portuguese, only 28% (525/1894) of the occurrences of ja´ are rendered in English by already. Study 4
Here we examine the claim that aspectualisers cover some of the same ground as the PPC in Portuguese. Table 12.24 shows a first quantitative overview, which apparently does not support my claim that temporal specification is more fine-grained in Portuguese. In fact, and except for the ir graduality marker and ficar (Santos, 1996: 231232), all other aspectualisers are more frequent in translation from English than in the original Portuguese. A possible explanation here is that English is more concerned with temporary states and progressive aspect than Portuguese, a language that gives at least as much attention to permanent states. As this is not related to the question of result, it must remain a question for future study. Acabar de, the standard translation of just, is interesting because of its other meaning of finishing something, which leads to vague utterances where both meanings can be conveyed. Its translation pattern into English is displayed in Table 12.25. A careful analysis shows that, in translation from English, acabar is overwhelmingly used in the ‘just’ or recency sense, while in original Portuguese acabar is very often vague
Perfect Mismatches: ‘Result’ in English and Portuguese
239
Table 12.24 Aspectualisers in COMPARA in original and translated Portuguese* Aspectualiser
Original Portuguese
Translated Portuguese
andar a infinitivo
15
66
andar geru´ndio
18
1
280
134
95
750
561
257
acabar de infinitivo
96
110
acabar geru´ndio
41
10
148
14
31
51
ir geru´ndio estar a infinitivo estar geru´ndio
ficar geru´ndio ficar a infinitivo
*The absolute frequencies are comparable because the number of words in both subcorpora is almost the same, cf. Table 12.1
Table 12.25 Correspondence of acabar de Translation into Portuguese from English (into acabar de)
Translation from Portuguese into English (from acabar de)
just
62
41
recency adverbs
20
5
4
21
24
24
0
5
110
96
English correspondence
finish and other verbs Ø one other (unrelated) construction Total
between the two senses (I counted 14 cases), is more frequently used in the ‘finish’ sense (21 cases), and is quite often mistranslated.8 Study 5
Finally, we turn to lexical features, contrasting original with translated Portuguese in Table 12.26, which shows the frequency of result-related words or constructions in Portuguese, selected on the basis of introspection and dictionary inspection. It is striking that every one of these words
Incorporating Corpora
240
Table 12.26 Result words in COMPARA in original and translated Portuguese, choosing only the result-related senses Portuguese word or phrase
Original text
Translated text
resultado
30
44
consequeˆncia
11
36
resultar
19
22
levar a*
15
42
dar origem a
3
77
dar em*
4
8
redundar
1
2
culminar
0
2
derivar*
0
3
19
35
resultante
3
6
sair*
5
1
Total
105
278
fazer com que
*All senses of these verbs (as well as related idioms and connotations) that are not connected with result have been omitted
(except for sair ‘end up’) is more frequent in the translations into Portuguese than in the original Portuguese, lending some weight to the claim that English is a more result-oriented language, assuming that the Portuguese translations are influenced by the English source.
Concluding Remarks It is notoriously difficult to obtain objective semantic data. One way of approaching this problem is to test predictions derived from hypotheses using translations in a parallel corpus of texts, here in Portuguese and English. The present contribution aimed at corroborating the hypothesis that the Portuguese language gives little attention to result compared to English, while on the other hand encoding more complex distinctions or temporal patterns in the stative realm. The corpus analysis presented here provides contrastive and monolingual frequency data on such tense and aspect devices as ja´ and the PPC in Portuguese, using a publicly available resource, COMPARA, so that further analyses by other scholars are encouraged, and our studies may be replicated and either validated or reinterpreted (Santos & Oksefjell, 1999).
Perfect Mismatches: ‘Result’ in English and Portuguese
241
It also provides an illustration of the methodology for contrastive semantics put forward in Santos (2004), by taking into account contrastive vagueness, as well as the existence of different aspectual classes in different languages, and by using translation mistakes and translationese to pinpoint complex differences. In this chapter, we have helped to disentangle the intricate relations between ‘components’ of meaning such as the ‘extended now’, result, relationship with the present time, repetition, plural participants in events, relevance, and kinds of states, in Portuguese and English, with the help of a corpus of human translations. Acknowledgements I am deeply grateful to Lauri Carlson for having once challenged my translation intuitions with real data and in this way shaping my future research. I acknowledge gratefully the many comments provided by the editors of this volume, which significantly improved the text. Finally, I would like to thank Fundac¸a˜o para a Cieˆncia e Tecnologia for financing Linguateca (and therefore COMPARA) through grant POSI/PLP/43931/ 2001, co-financed by POSI. Notes 1. Throughout the chapter I use the term ‘components’ of meaning, instead of ‘shades’, to indicate that I believe in distinct units adding up to a (molecular) meaning. This can be compared to what Lyons (1977: 317) calls ‘sensecomponents’ in his description of the componential analysis of the lexicon. 2. Moens (1987: 73 76) considers, following Comrie, universal perfect, existential perfect, stative perfect (or perfect of result) and hot news perfect, while perfect progressive is given another analysis. Sandstro¨m (1993: 120 124) reduces perfect to an ambiguous category with two meanings (one for states, one for events). And so on. 3. A very common interpretation of ja´ with Presente is the one that implies immediacy, like in the standard Ja´ vou! (I’m coming!). Ja´ is also known to destroy any habitual Presente or Imperfeito connotations, when co-occurring with these tenses. 4. Campos (1984: 541 542) notes that, in sentences like O Joa˜o ja´ tinha lanchado quando o Rui chegou (‘J. had already eaten when R. arrived’), ja´ marks unambiguously that the event in Mais que Perfeito precedes the one in Perfeito. Otherwise, a simultaneous interpretation of the events eating and arriving could not be excluded. 5. As Johansson (1998) points out, carefully designed bilingual corpora allow several kinds of contrastive studies. 6. Notwithstanding statements like Moens’ (1987) that the perfect is rarely applicable to stative predicates. 7. The mismatch between the numbers in Table 12.6 and Table 12.12 is due to different ways of identifying the present perfects. In Table 12.6 this was done manually, while for Table 12.12 it was done semi-automatically. 8. Incidentally, the tense distributions in the two translation directions are also significantly different: for example, the English Portuguese direction presents
242
Incorporating Corpora
25 cases translated using Mais que Perfeito and only 9 using Imperfeito, while in the other direction the opposite trend can be observed.
References Bole´o, M. de P. (1936) O Perfeito e o Prete´rito em portugueˆs em confronto com as outras lı´nguas romaˆnicas. Separata de Cursos e Confereˆncias da Biblioteca da Universidade de Coimbra, vol. VI. Campos, H.C. (1984) Le marqueur "ja´": e´tude d’un phenomene aspectuel. Boletim de Filologia XXIX, CLUL, Lisboa, 539 53. Campos, H.C. (1985) Ambiguidade lexical e representac¸a˜o metalinguı´stica. Boletim de Filologia XXX, CLUL, Lisboa, 113 131. Campos, H.C. (1987) O prete´rito perfeito composto: um tempo presente? Actas do 3.o Encontro da Associac¸a˜o Portuguesa de Linguı´stica (pp. 75 85). Lisboa: APL. Casanova, M.I.P.G de S. (1985) O aspecto verbal: Um estudo contrastivo de ingleˆsportugueˆs. MSc dissertation, Faculdade de Letras da Universidade de Lisboa. Comrie, B. (1976) Aspect: An Introduction to the Study of Verbal Aspect and Related Problems. Cambridge: Cambridge University Press. ¨ . (1985) Tense and Aspect Systems. Oxford: Blackwell. Dahl, O Elsness, J. (1991) The perfect and the preterite: The expression of past time in contemporary and earlier English. PhD Dissertation, University of Oslo. Frankenberg-Garcia, A. & Santos, D. (2003) Introducing COMPARA, the Portuguese English parallel translation corpus. In F. Zanettin, S. Bernardini and D. Stewart (eds) Corpora in Translation Education (pp. 71 78). Manchester: St. Jerome Publishing. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson and S. Oksefjell (eds) Corpora and Cross-linguistic Research: Theory, Method, and Case Studies (pp. 3 24). Amsterdam: Rodopi. Lyons, J. (1977) Semantics, 2 vols. Cambridge: Cambridge University Press. McCoard, R.W. (1978) The English Perfect: Tense-choice and Pragmatic Inferences. Amsterdam: North-Holland Publishing Company. Moens, M. (1987) Tense, aspect and temporal reference. PhD thesis, University of Edinburgh. Peres, J.A. (1993) Towards an Integrated View of the Expression of Time in Portuguese. Cadernos de Semaˆntica, No. 14, Faculdade de Letras, Universidade de Lisboa. Sandstro¨m, G. (1993) When-clauses and the temporal interpretation of narrative discourse. PhD dissertation, Department of General Linguistics, University of Umea˚. Santos, D.M. de S.M.P. dos (1996) Tense and aspect in English and Portuguese: a contrastive semantical study. PhD thesis, Instituto Superior Te´cnico, Lisbon. Santos, D. (1997) The importance of vagueness in translation: Examples from English to Portuguese. Romansk Forum 5, 43 69. Santos, D. (2000) The translation network: A model for the fine-grained description of translations. In J. Ve´ronis (ed.) Parallel Text Processing (pp. 169 186). Dordrecht: Kluwer Academic Publishers. Santos, D. (2004) Translation-based Corpus Studies: Contrasting English and Portuguese Tense and Aspect Systems. Amsterdam/New York, NY: Rodopi. Santos, D. and Oksefjell, S. (1999) Using a parallel corpus to validate independent claims. Languages in Contrast 2, 117 132. Vendler, Z. (1967) Linguistics in Philosophy. Ithaca: Cornell University Press.
Chapter 13
Corpora for Translators in Spain. The CDJ-GITRAD Corpus and the GENTT Project ANABEL BORJA
Introduction The use of electronic language resources (LRs), such as corpora, lexicons, grammar descriptions and ontologies, is currently the subject of numerous research projects in Spain, and the results obtained in the future could completely change the way translators work today. Research projects on the development of LRs conducted in recent years have shown the potential of these tools and the phenomenon of the translator’s turn in this field. Not only have developments in technology influenced translation research, but translation scholars and professionals have in turn helped to determine the direction in which these corpus technologies have developed. The influence of translation in the development of corpora is evident as corpora are compiled with the needs of their potential users in mind, and translators and the creators of tools for automating translation*such as translation memories (TM), Computer Assisted Translation (CAT) or Automatic Translation (AT) programs* form one of the most important groups of potential users of corpora. This chapter aims to provide an overview of the corpora available in Spain that could prove useful for translators and translation researchers. The goal is to help translators working with Spanish (as a source or target language) to become familiar with readily available and efficiently exploitable computer tools of this kind. It also discusses the possibilities of exploiting these resources and the Internet in general as a dynamic, open, accessible and continually evolving corpus. With this in mind, a brief introduction is given describing the evolution of the corpus concept and its applications in translating. Then the key concepts related to corpora are briefly defined and the various types and designs of corpora are discussed in order to agree a common terminology that will allow us to refer consistently to the various types of corpora that are available. In the next section we present some of the corpora containing texts available in Spanish. This list includes all the corpus resources we have identified as useful tools for professional translators and translation researchers. The study ends with a description of the CDJ-GITRAD 243
244
Incorporating Corpora
corpus, a Multilingual Corpus of Legal Documents and the GENTT project, which is currently compiling a multilingual encyclopaedia of specialised texts for translators.1 Both projects have been developed by translation research teams that have adopted a corpus design organised around the concept of genre, and are specifically intended for medical, technical and legal translators working in Spanish.
Evolution of the Concept of Corpus In recent times, the use traditionally made of corpora in the field of literature, linguistics and lexicography has been extended, paving the way for their application in the teaching of languages, technical writing and translation. From the first corpora, designed exclusively for the use of experts researching the syntactical or morphological aspects of language (for instance, see Biber et al. (1998), Kennedy (1998) and McEnery & Wilson (1996)), we have now reached the point where these tools are in general use for the day-to-day work of language teachers and students, journalists, writers of all kinds of texts and, of course, translators, who find in this resource an excellent tool for observing language in use. The advent of the Internet is perhaps the most revolutionary aspect of the evolution of corpora. The Internet has made it possible to gain access to thousands of on-line documents, find highly specialised texts, consult texts on the same subject in several languages, locate translated texts, rapidly build corpora and perform linguistic and statistical analysis on corpora using web browsing interfaces. In fact, the Internet itself can be considered a corpus, the largest one in existence and the broadest in scope. However, the Internet has not only made it possible to access corpora and their tools, but has also proved an enormous spur to their development. In a work on the processing of bilingual corpora Abaitua (2002) explores how the Internet boom ‘has greatly increased the demand for technologies with a capacity for multilingual processing: smart browsers, indexing and cataloguing systems, extractors of information, knowledge managers, text generators, summary generators, etc’. All these applications use corpora as their raw material, and researchers in the field of artificial intelligence, computational linguistics and voice recognition who conduct their experiments using corpora can no longer conduct or conceive of research without this most valuable tool. The technology available to these researchers has enabled them to create an infinite variety of types and designs of corpora to test their computer tools. If they need a corpus that does not exist, they will compile it themselves, as they have very powerful scanners and OCR software as
Corpora for Translators in Spain
245
well as programs with highly sophisticated text recognition and analysis features. To end this brief look at the evolution of corpora and their application to translation, I would say we are very close to the philosophy of the DIY corpus. Commercial software programs are available for translators that allow the user (for example an English Spanish technical translator) to extract texts on a particular subject (for example, dermatology) from the web (medicine portals, on-line dermatology journals in Spanish and English), organise and store them on the basis of pre-established criteria and even automatically align translated texts and extract the specialised terminology. Who could ask for more? As a result of these developments, the concept of corpus has undergone a revolution and corpora are no longer the sacred (and expensive) tools they were some years ago. Today they are instantly available for translators and translation researchers to study and analyse: we can adapt them to our needs and make them our own. With this philosophy in mind, let us look again at the possibilities of designing corpora and what they have to offer translators of various kinds.
Towards an ideal Corpus Design for Translators Although the concept of corpus is very simple it is no more than an extensive collection of texts in electronic format there are many variables relating to the design and applicability of corpora that require a more painstaking analysis. There is an infinite number of ways the texts in a corpus can be selected and compiled, and those chosen will depend on the type of corpus to be created and the resources available (in terms of money, time, knowledge, etc.). The needs of the professional translators or researchers for whom the corpus is intended play an important role in the choice of these variables. When planning the design of corpora for translators the profile and objectives of those who will use the corpus must be accurately defined. Various authors (Abaitua, 2002; Atkins et al., 1992; or McEnery & Wilson, 1996) emphasise the influence that these criteria will have on the resulting corpus. A number of binary elements are involved in the design of a corpus (dynamic/static, monolingual/multilingual, oral/written, general texts/ specialised texts, synchronic/diachronic, full texts/fragments of texts, tagged/untagged, . . .). The combination of these elements with each other generates an almost infinite number of potential designs. Let us think of some examples: (a) synchronic monolingual corpus of 18thcentury Spanish literary texts, full text, no tagging; (b) diachronic corpus (19002000) of legally drafted wills, not translations but comparable originals in each language, bilingual SpanishEnglish, full text, with morphological tagging; (c) multilingual English/Spanish/French/
246
Incorporating Corpora
German corpus of medical texts on oncology from 1950 onwards that are not translations of each other, with annotation of identifying information (date, author, source. . .); (d) bilingual EnglishSpanish corpus of translated aligned texts on mixed mobile telephony, with no tagging. In order to create a corpus we will have to decide between the following dichotomies, amongst others:
printed/electronic on-line/off-line oral/written full texts/fragments of text static/dynamic synchronic/diachronic large/small general/specialised literary sources/non-literary sources representative (of a specialised language, variety of language, genre, etc)/non-representative (sample corpus) organised around a genre/not organised around a genre monolingual /multilingual parallel multilingual/comparable multilingual aligned parallel multilingual/non-aligned parallel multilingual plain/annotated for professional use/for research purposes for general research purposes/for specific research purposes for professional use in general translation/for professional use in specialised translation
Not too long ago, corpora were collections of printed texts, but now when we talk about corpora it is implicitly understood that we are referring to texts in electronic format, normally with annotation, or tagging, of some kind. While the first (electronic) corpora were very costly tools to which only specialists and researchers had access (many were subscription only), today the fundamental characteristic of corpora is their accessibility. Although some can only be obtained on CD, most can be consulted on line, after paying a subscription or are free of charge. Anyone can access them, at any time and from anywhere via the Internet. The spread of on-line corpora has meant that the proportion of dynamic corpora has increased. These are corpora that are not considered complete at any given time, but are always open to the incorporation of new elements. In general terms, we could say that translators are only interested in written corpora, but the needs of interpreting and the translation of audiovisual media should not be overlooked, as they may also benefit from the contribution of spoken language corpora. Opting for a
Corpora for Translators in Spain
247
synchronic or a diachronic design will depend on whether we want to observe a language at a particular point in time or in the process of evolving. One aspect in which considerable development has been observed is in the size and degree of specialisation of the corpora being compiled. Today, technological advances mean that corpora that previously took years of work to complete can now be completed in a matter of weeks. Initially corpora consisted mainly of literary texts, but now we find comprehensive corpora on all imaginable subjects (sociology, cardiology, textiles, pottery, law, etc.), consisting of texts of all types and genres (newspaper articles, contracts, technical instructions, children’s writing, and so on). The degree of specialisation may vary from corpora that are entirely general, such as the CRAE (literature, press, science, law, etc.) to corpora devoted entirely to a single field of specialisation, such as the CDJ-GITRAD for legal texts or TURISCOR for tourism (these three corpora are described later in this chapter). Another fundamental decision to be made when defining the design of a multilingual corpus is its size and the ratio between its size and its representativeness. How representative a corpus is will depend on its intended use, the degree of specialisation of the texts, the number of languages, etc. Obviously, a million-word corpus of Spanish weather reports will be infinitely more representative than a corpus of Spanish literary texts of the same size. As Abaitua (2002: 96), points out: ‘Another criterion for classification concerns the genre and the type of text. This criterion mainly affects the representative character of the corpus. The selection criteria will be very different depending on whether the aim is to design a specialised corpus, such as the Aarhus Corpus on European contract law or create a reference corpus that covers a wide range of styles and registers.’ Corpora organised around the concept of genre constitute a new category that has very useful applications for translation, which we will discuss in greater detail when we talk about two multilingual corpora we are currently developing: the CDJ-GITRAD corpus and the GENTT encyclopaedia of genres. Within this process of change and improvement, the move from monolingual corpora to multilingual corpora should also be mentioned, and within multilingual corpora, the appearance of parallel aligned multilingual corpora. With rare exceptions most of the corpora that have been compiled in the world have, until recently, been monolingual. Multilingual corpora have appeared in a wide variety of forms, as the ‘language’ variable itself generates many possible combinations. In the case of multilingual corpora the only obligatory requirement is the presence of texts in various languages, but the type of link between them in the different languages can be manifold:
248
Incorporating Corpora
monolingual/bilingual/trilingual/multilingual comparable corpora (untranslated texts)/parallel corpora (translated texts) comparable corpora of texts with a particular feature in common/ comparable corpora of texts with no features in common parallel non-aligned corpora/parallel aligned corpora spontaneous parallel corpora/artificially compiled parallel corpora unidirectional parallel corpora/multidirectional parallel corpora multilingual texts by native speakers/multilingual by non-native speakers. . .
Comparable corpora are made up of texts in different languages that may be related in various ways, but are not translations of each other. They may have nothing in common at all, or be on the same subject, of the same genre, or from the same chronological period, etc. Comparable corpora containing texts that are not on the same subject are used for studying general linguistic phenomena and are mainly used by linguists, although they may also help translators to define patterns of language and compare texts. Comparable corpora of texts on the same subject matter are one of the most valuable tools for professional translation. Some examples are the CDJ-GITRAD Corpus of Legal texts in Spanish, Catalan and English, the Aarhus Corpus of Contract Law texts in Danish, French and English and the bilingual English Chinese corpus compiled by Fung in 1995, used to generate bilingual dictionaries. They are much easier to compile than parallel corpora, provide substantial information in terms of terminology, field and context, and are used mainly to observe the use of specialised language in very restricted fields such as hedge funds, statistics, musical composition, etc. Comparable corpora of texts written at different periods of time are used to observe linguistic changes over the course of time, and are of great interest to the translator, particularly the literary translator of historical texts. Parallel corpora can develop naturally or as the result of a specific translation commission or project. Multilingual international organisations are the most important source of parallel corpora that originate spontaneously. This would be the case of the European Union’s databases of documents, the Hansard base of legislation in French and English in Canada, or the Hong Kong legislation database made up of documents in English and Chinese. These naturally produced parallel corpora are the result of a set of particular social circumstances and constitute one of the most important resources for the specialised translator in certain areas such as law, the discipline in which the author of this chapter is working. Their use (by alignment or in some other way)
Corpora for Translators in Spain
249
is one of the areas of research with the greatest potential. In fact, these organisations are currently trying to exploit their resources by putting together AT systems based on vast TM and grammar parsers (the EU Euramis Project is a good example). On the other hand, parallel corpora can also be compiled from existing translations. One of the main problems of parallel corpora compiled in this way might be the selection criteria for the translations to be included and the criteria for validating their quality. These considerations are relevant for corpora intended for professional translators who will use them as reference tools and need to be sure that they meet the necessary quality criteria. However, this should not be considered a problem for translation researchers interested in analysing existing translations as they stand. Parallel aligned corpora are the most useful tool for translators but pose important problems that have not as yet been addressed. As Lawler points out in his review of Botley et al. (2000), ‘Text alignment, it quickly becomes clear, is the outstanding problem in research on multilingual corpora, and thus to the extent that progress has been made in its solution its outstanding success story. The problems that arise in alignment research reprise practically every issue in Natural Language Processing (NLP) and AT (e.g., sentence division, anaphor tracking, ambiguity resolution), and the peculiar limitations of the alignment task make the application of alignment strategies to these broader problems surprisingly productive’. In unidirectional parallel corpora we always find the originals in the same language, whereas in multidirectional parallel corpora the originals may be in any language of the corpus. Unidirectional parallel corpora are typically used for studying the literary works of a writer and their translations into one or more languages. Apart from the pure text, a corpus can also be provided with additional linguistic information, called ‘annotation’ or ‘tagging’. If we look at the type of annotation, within the category of multilingual corpora we find everything from highly structured corpora with very specific tagging aimed at a specific aspect of research (such as an EnglishSpanish corpus of research articles on sustainable tourism with tagging aimed at comparing the use of certain syntactic structures) to enormous collections of texts in several languages compiled with no particular selection criterion and no annotation. The annotation may be of a different nature, such as prosodic, semantic or historical. The grammatically tagged corpora constitute the most common form of annotated corpora. In a grammatically tagged corpus, the words have been assigned a word class label (part-of-speech tag). The potential uses of corpora depend on how well they have been designed to meet the needs of research or professional use. However, in
250
Incorporating Corpora
order to take full advantage of all the information a corpus contains, it is essential to have proper annotation tools that are suited to each particular corpus. Whether a corpus is annotated and the type of annotation used will depend on the purpose for which it will be used. In the past annotated corpora were used for specific aspects of linguistic research, particularly computational linguistics, but today practically all corpora have some kind of annotation thanks to the new automatic annotation tools. The type of annotation may vary to a great extent. In order to annotate a corpus it is no longer necessary to use expensive software programs that are difficult to use and designed to carry out very specific tasks. Today anyone can obtain very valuable textual information by applying very simple search systems. The ever-increasing diversity of annotation possibilities is also observed in modern corpora, depending on the purpose for which they are designed, and the annotation of corpora is becoming increasingly automatic and standardised (with standards such as the TEI Standard, the Text Encoding Initiative).
Applications of Corpora in Translation Studies and Professional Practice This initial distinction between corpora for research and corpora for translating is viewed as relevant because it results in very different corpus designs.
Corpora for Translation Studies The use of corpora for research into translation (also referred to as CTS, Corpus Translation Studies) is a relatively new activity and is still in its early stages. The first initiative was the project started by Mona Baker in Manchester in 1993 with the TEC corpus. Although at that time professional translators were already using these tools to carry out documentation tasks, they had not yet been used in Translation Studies. Today there are still few examples of studies of this kind, and they are mostly restricted to academic circles (PhD dissertations and research projects), but as Malmkjaer (2003: 119) points out: ‘The use in translation studies of methodologies inspired by corpus linguistics has proved to be one of the most important gate-openers to progress in the discipline since Toury’s re-thinking of the concept of equivalence’. Because of the nature of their work, researchers will in principle require extensive reference corpora from which they can extract conclusions applicable to general phenomena. However, corpus design for Translation Studies will once again depend on the particular aspect of the translation phenomenon to be analysed: translation strategies, translation norms, translation equivalence, features of translated texts (Baker 1993,
Corpora for Translators in Spain
251
1999; Laviosa, 1998), shifts in translations of the same originals in the course of time or changes in other parameters (Munday, 1998). Corpora in translation studies are products of human minds, of actual human beings, and, thus, inevitably reflect the views, presuppositions and limitations of those human beings. Moreover, the scholars designing studies utilising corpora are people operating in a particular time and place, working within a specific ideological and intellectual context. Thus, as with any scientific or humanistic area of research, the questions asked in CTS will inevitably determine the results obtained and the structure of the databases will determine what conclusions can be drawn. Of the research studies carried out to date using corpora, three major categories may be mentioned:
Those that examine contrastive aspects, whether at a linguistic or cross-cultural level, using multilingual parallel corpora. Here we would find the studies that look at the various ways of solving a particular translation problem. Kenny (2001), for example, monitors the translation of creative source-text word forms and collocations uncovered in a specially constructed German English parallel corpus of literary texts. Those that examine the characteristics of translated texts without taking the originals into account, using monolingual corpora of translated texts in the target language. This is the case of the studies initiated by Mona Baker with her TEC corpora and continued by authors such as Sara Laviosa. These lines of research suggest that translated corpora-based research may fruitfully be used to discover the patterning specific to translational language and the extent of the influence of variables such as source language, text genre or translation mode on translational language. Laviosa’s 1996 study, The English Comparable Corpus (ECC): A Resource and a Methodology for the Empirical Study of Translation, for example, sets out to develop a viable descriptive and target-oriented corpus-based methodology for the systematic study of the nature of translated text. Those that examine the characteristics of texts in the target language using monolingual corpora of texts originally written in the target language. Through an exhaustive analysis of vast quantities of computerised text, translators can obtain invaluable information on grammar, semantic relationships, the acceptability of certain usages, new or obsolete usage of words, neologisms and even pragmatic aspects of the target language (see Hanks, 1996;
252
Incorporating Corpora
Moon, 1998), which enable the translator to make decisions consistent with those uses. Corpora for Professional Practice The translation of specialised texts requires from the translator a range of knowledge that generally exceeds the scope of the translator’s academic and personal training. The notion of the translator as a scholar with an encyclopaedic knowledge of the world has been a constant throughout history. The need for a profound understanding that does not miss the nuances of the original is a requirement common to all specialist areas of translation. The increasing volume of translations and their degree of specialisation have seen professional translators’ need for electronic LRs grow exponentially. In just a few years, the Internet has become the main resource for terminological and conceptual documentation for professional translators, and today it can be said that 95% of the material they need can be found on the web. This has triggered an escalating demand for on-line LRs and tools to enable the information to be searched for and exploited on-line too. In Spain a large number of projects have been set up to create LRs for translators, and there are now many resources of this kind available to professional translators. Monolingual corpora in the target language have proved to be an outstanding terminological tool for specialised translation (Bowker, 1998) and for the training of professional translators (Borja, 2005; Borja & Monzo´, 2001). From the data about professional behaviour obtained in Monzo´ (2002) and the interviews made of 200 legal translators during the last year, the GITRAD Research Team is currently preparing a study on professional habits. Although the results have not been published as yet, we can advance some data here. Today more than half the professional translators of legal texts in Spain use corpora and the Internet (as an online corpus) instead of traditional printed dictionaries and when asked which they would give up if they had to choose between the two, many say that they would keep the Internet. In fact they point out that corpora and the Internet make it possible to carry out more intelligent contextdependent searches by locating collocations and obtain results that a dictionary could never offer. At a textual level, monolingual corpora in the target language and bilingual parallel corpora are cited as a powerful textual resource for translators who can check the appropriateness of their syntax and cohesive devices in original texts on the same subject matter, period of time, level of formality, etc. From the point of view of conceptual documentation that any translator of specialised texts needs to carry out, multilingual comparable corpora are the resource of choice. In some very specialised areas where the translator needs to acquire a certain level of
Corpora for Translators in Spain
253
understanding of the field before being able to translate a text relating to a particular field, the use of multilingual parallel aligned corpora enables the professional translator to look for translated words, expressions, sentences, formulae, culture-marked terms, geographical names, etc. In the field of legal translation, the availability of parallel corpora is greater than in other fields because of the existence of international organisations’ multilingual databases, as has been indicated before. Legal translators also mention the importance of self-made corpora of translated documents as one of the main sources of information for their work but regret the fact that they have not incorporated their translations into TM and find it too demanding and time consuming a task to do it now. In fact, many translators still work with original texts in paper print. Only young translators have started working with TM from the time they start and can apply efficient automatic retrieval systems and translation management methodologies.
Corpora Developers in Spain This section covers the various resources in the form of electronic corpora that may be used by professional translators or translation researchers working with Spanish, including monolingual corpora, multilingual comparable corpora and multilingual parallel aligned corpora. They all have much to offer translators. Although the information is abundant, we are aware that many very interesting projects will not be mentioned and that much of the information given here will have to be updated within a short space of time because of the rapid progress being made in this field. However, many of the works cited are long-term projects that will remain relevant for a considerable time, as they are sponsored by government institutions or universities. Another aspect worth mentioning about this list of resources is the insight it may provide for translators creating their own resources or researching whether on-line resources exist for their particular field of interest. For the preparation of this section we have personally consulted leaders of groups currently researching computational linguistics and corpora: Joseba Abaitua (DELI Research Group), Alberto Alvarez Lugris (CLUVI Research Group), Gloria Corpas (TURICOR project), Isabel Garcı´a Izquierdo (GENTT Research Group) and Esther Monzo´ (GITRAD Research Group), among others. The information obtained from the websites of Manuel Barbero´ (http://www.bmanuel.org/), Joaquim Llisterri (http://liceu.uab.es/~joaquim/) and David Lee (http://devoted.to/corpora) has also been very useful. Finally, it will be seen that the subject resources are biased in favour of legal and economic topics, which reflects the fact that I am a translator specialising in these
254
Incorporating Corpora
disciplines. I am sure that readers will be able to compensate for this imbalance in their own particular areas of interest. The resources identified belong to three categories: monolingual corpora, multilingual comparable corpora and multilingual parallel corpora. Together with the reference and a brief description, they are defined by using the binary terminology established in previous sections. Before proceeding to describe the corpora that currently exist in Spain, it is illuminating to explore who is currently developing them: this means identifying the most important sources of corpora so that the reader can periodically check the new advances in the field. As we shall see, these are not translators or academic institutions devoted to translation.
Corpora compiled by institutions devoted to the promotion and diffusion of Spanish The most prestigious, extensive and representative monolingual corpora of Spanish texts are those of the Real Academia de la Lengua Espan˜ola and the Instituto Cervantes, institutions devoted to the promotion and diffusion of Spanish (language, literature and culture) throughout the world. No translator should neglect to visit the Royal Academy’s website and browse the inspiring resources it has to offer. The monolingual Corpus Diacro´nico del Espan˜ol (CORDE), corpus. rae.es/cordenet.html, compiled by the Instituto de Lexicografı´a de la Real Academia de la Lengua Espan˜ola, is made up of texts from three historical periods: Edad Media, Siglos de Oro and E´poca Contempora´nea, and aims to be representative of the Spanish language chronologically. It contains 125 million words, both fiction (verse and prose) and non-fiction texts (scientific, social, press and advertising, religious, historical and legal documents). The monolingual Corpus de Referencia del Espan˜ol Actual (CREA), http://corpus.rae.es/creanet.html, compiled also by the Instituto de Lexicografı´a de la Real Academia de la Lengua Espan˜ola, covers 25 years, from 1975 to 1999. It is still being compiled and will contain literary, journalistic, scientific and technical texts as well as transcripts of oral language and media records. The Miguel de Cervantes Virtual Library, http://www.cervantesvir tual.com, is another treasure for the translator working with Spanish. It is a digital edition project of the bibliographic heritage of Spanish and Latin American culture promoted by the University of Alicante and the Banco Santander Central Hispano, with the collaboration of the Marcelino Botı´n Foundation. It is a vast bibliographic and documentary repository that can be freely accessed via the Internet. The Linguistic Tools section provides a collection of tools specifically designed for analysing and exploiting digital texts. It has an advanced text search engine that allows
Corpora for Translators in Spain
255
words to be searched for within texts, and a concordance tool, which makes it possible to search for words in context. A recent and extremely interesting resource also offered by the Cervantes Institute is the Grammatical Archive of the Spanish Language (AGLE). It is not exactly a corpus of texts, but rather a corpus of grammatical records compiled by the Spanish grammarian Salvador Ferna´ndez Ramı´rez (18961983), which brings together an entire set of language phenomena that he intended to use as a corpus for producing the grammar he was planning. The result is a wide-ranging sample of the Spanish language from the Poema del Cid to selections from the contemporary press, from pieces of Latin American literature to a comment heard on the bus, selected by our greatest grammarian. It can be accessed by grammatical categories, author, work, word or expression. While complex to begin with, it is extremely interesting once its mechanism has been mastered http://cvc.cervantes.es/obref/agle/. Corpora compiled by publishing houses Next in order of importance are the monolingual corpora compiled by major publishing houses in order to extract information for lexicons and dictionaries of all kinds. The objective of these projects is to use corpora for extracting terminology and creating ontologies. Outstanding in this section is the corpus of the publishing houses SGEL and VOX. The Corpus CUMBRE offers a set of representative linguistic data for the use of contemporary Spanish, compiled by the publisher SGEL, S.A. under the supervision of a research team of the University of Murcia; although its purposes are grammatical and lexicographic, it can be classified as a general purpose corpus, in view of the diversity of the materials it contains. A free sample of this corpus may be obtained with the purchase of the dictionary Gran diccionario de uso del espan˜ol actual (Sa´nchez, 2001). The publishing house SGEL can be contacted at http://www.sgel.es/ espanyol/ieindex.htm. The second publisher cited, VOX-Bibliograf, has compiled its own corpus for the development of dictionaries. The Corpus Textual VOXBiblograf includes 10,352,337 words: literary texts (9.5%), journalistic (32.5%), scientific (4.5%), technical (29.5%), academia (15.5%), advertising (0.5%), spoken language transcripts (3%) and media broadcastings (5%). It currently participates in the EuroWordNet Project. The EuroWordNet project aims to develop a multilingual database with basic semantic relationships between words for several European languages (Dutch, Italian and Spanish). More information may be obtained through: http:// www.vox.es. We have tried to access their corpus but with no success. It seems it will be available through the Spanish EuroWordNet database interface but currently the links to the corpus have restricted access (http://www.illc.uva.nl/EuroWordNet).
256
Incorporating Corpora
The publishing house Diccionarios SM has also compiled a corpus of literary, journalistic, scientific and technical texts for the development of lexicographic work of 60,000 words. Although we have not been able to search this corpus, the company may be contacted at http:// www.grupo-sm.com/. However, through its website this publisher offers an interesting resource for translators: the contents of the books that they publish for secondary education (http://www.librosvivos.net). Corpora compiled by linguistic engineering and computational linguistics groups Another important source of corpora are the departments of linguistic engineering and computational linguistics devoted to developing computer systems capable of recognising, understanding, interpreting and generating human language in all its forms. Research in this field includes the development of linguistic resources (morphological grammars, formal and computational grammars, electronic lexicons with information in conventional formats such as EAGLES), CAT and AT programs, the development of person machine interfaces and tools for analysing and using corpora. The immense majority, if not all the systems of language processing, operate with monolingual or bilingual corpora of texts to which various linguistic processors are applied (at the phonological, phonetic, textual, morphological, lexical, syntactical, logical, semantic and pragmatic level). In these contexts, the object of creating very extensive corpora is to provide adequate bases for creating Example Based Machine Translation Systems (EBMT) or Machine Generated Text, both monolingual and multilingual (automatic production of multilingual technical documentation). Projects of this kind have generated numerous multilingual corpora and are expected to generate many more in the near future. Researchers believe that this could mean moving from CAT to AT in fields where significant bilingual corpora exist. In many cases these are multilingual projects that combine a specific economic activity (for example tourism) with the development of linguistic technologies (AT; text generation, identification of types of texts or information on Internet). Many of them receive European funding through actions such as e-Content. The Council has recently approved the e-Contentplus programme. The 4-year programme (200508), proposed by the European Commission, will have a budget of t149 million to tackle the fragmentation of the European digital content market and improve the accessibility and usability of geographical information, cultural content and educational material. EU-funded projects such as ELSNET and EUROMAP are behind many regional subprojects. EUROMAP (‘Facilitating the path to market for language and speech technologies in Europe’) aims to provide awareness, bridge-building
Corpora for Translators in Spain
257
and market-enabling services for accelerating the rate of technology transfer and market take-up of the results of European HLT RTD projects. The European Network of Excellence in Human Language Technologies (ELSNET) aims to bring together the key players in language and speech technology, both in industry and in academia, and to encourage interdisciplinary co-operation through a variety of events and services. Mention should also be made of Grup d’Investigacio´ en Lingu¨´ıstica Computacional (GilCub) of the Universitat de Barcelona, which concentrates on researching systems for processing natural language, particularly areas related with AT. EUROTRA, TRADE and MULTEXT are important projects developed by this group. MULTEXT is of particular interest for translators because it has ‘developed a set of generally usable software tools to manipulate and analyse text corpora, together with lexicons and multilingual corpora in seven European languages. It has established conventions for the encoding of corpora and harmonised specifications for computational lexicons, building on and contributing to the preliminary recommendations of the relevant international and European standardisation initiatives. All project results are freely and publicly available.’ In their webpage they also state that they have developed ‘The first freely available large-scale multilingual text corpus for the seven languages’. At the time of writing, the corpus was not available through the Internet but the group can be contacted at http:// www.ub.es/gilcub/ingles/projects/european/multext.html. Another group working in this field is Centre de Llenguatge i Computacio´ (CLiC), from the same university, Universitat de Barcelona (http://clic.fil.ub.es/). CLiC has compiled a corpus of 6,000,000 words with morphological and syntactical annotation, the Le´xico informatizado del espan˜ol (Lexesp), containing journalistic texts, literary texts and semispecialised scientific articles. CLiC has also compiled a reference corpus of 1,000,000 words with morphological and syntactical annotation, manually validated, which is the result of the fusion of two important lexical resources: 500,000 words from the corpus Lexesp and 500,000 words from the electronic corpus of the newspaper La Vanguardia. Besides these corpora, CLiC has produced bilingual lexicons connected to the lexicosemantic web EuroWordNet: bilingual lexicon English Spanish (more than 120,000 entries); bilingual lexicon CatalanSpanish (more than 32,000 entries); bilingual lexicon CatalanEnglish (in progress). These resources can be searched at http://clic.fil.ub.es/ and are also available on CDRom (Sebastia´n et al., 2000). The DELI Research Group of the University of Deusto, http:// www.deli.deusto.es/Resources/LEGE-Bi, was created in 1998, promoted by professors of the Faculty of Language and ESIDE who were interested in digital edition and linguistic engineering (http://www.deli.deusto.es/ AboutUs). Some of the projects that are currently being developed by this
258
Incorporating Corpora
group and concerned with electronic corpora are: XTRA-Bi (http:// www.deli.deusto.es/AboutUs/Projects/XTRA-Bi/): Extraccio´n automa´tica de unidades bitextuales para memorias de traduccio´n (20002001), Main Researcher: Ine´s Jacob; LEGEBiDUNA (http://www.serv-inf.deusto.es/ abaitua/konzeptu/lege2dun.htm): Textos paralelos bilingu¨es en euskara y castellano de las administraciones vascas con etiquetado SGML/TEI-P3 (1995 2000), Universidad de Deusto, Main Researcher: Joseba Abaitua. They also form part of the CORDE project. The bilingual SpanishBasque LEGEBiDUNA corpus approximately 7 million words in each language, extracted from the bilingual official bulletins of the Basque Administration: (BOA 199092), (BOB 1989 95) and (BOPV 1995) is a very useful resource for translators of legal and institutional terminology. The research group is developing tools for the automatic drafting and translation of administrative documents based on this bilingual corpus. The Computational Linguistics Group of the University of Vigo (SLI, http://webs.uvigo.es/sli/) has developed the corpus CLUVI (Linguistic Corpus of the University of Vigo). According to the information provided on their webpage (http://sli.uvigo.es/CLUVI/info_en.html), ‘The CLUVI is an open textual corpus of specialised registers of contemporary oral and written Galician language developed by the University of Vigo. Since September 2003, the SLI offers the possibility of searching and browsing the main sections of the CLUVI Parallel Corpora (8 million words), that is, the TECTRA Corpus of EnglishGalician literary texts, the FEGA Corpus of FrenchGalician literary texts, the LEGA Corpus of GalicianSpanish legal texts, and the UNESCO Corpus of EnglishGalician-French-Spanish scientific-technical divulgation texts. The public searching and browsing tool designed by the SLI is available at http:// sli.uvigo.es/CLUVI’. The number of aligned works and language pairs available at this website increases regularly, and with great vitality as the CLUVI is an academic research project in progress. At the moment, the CLUVI Parallel Corpus webpage (http://sli.uvigo.es/CLUVI/index_en.html) permits the search of four major corpora TECTRA, FEGA, LEGA and The Unesco Courier (of more than one million words each) as well as other minor parallel corpora now in progress. It should be pointed out that the CLUVI interface also permits browsing of the Legebiduna Corpus of BasqueSpanish administrative texts developed by the DELi (http://www.deli.deusto.es/News) group at the University of Deusto. The co-operative effort between the Lancaster University and the Universidad Auto´noma de Madrid has made possible the development of the CRATER Corpus. The CRATER project, Corpus Resources and Terminology Extraction, involves three languages: English, French and Spanish. The corpus developed consists entirely of technical texts from the International Telecommunications Union (ITU). The CRATER corpus
Corpora for Translators in Spain
259
has now been completed and consists of 5.5 million words. The texts are tagged with part-of-speech and morphological annotation. This multilingual annotated corpus can be searched at http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html. The Institut Universitari de Lingu¨´ıstica Aplicada (IULA), http:// www.iula.upf.es/corpus/corpus.htm, is the centre for research and postgraduate studies of the Universitat Pompeu Fabra. They are currently compiling a multilingual corpus (Catalan, Castilian, English, French and German) of texts belonging to the areas of economy, law, environment, medicine and information technology). The selection of texts is made by experts in each area; they are then classified according to their subject field, annotated according to SGML standard and the ‘Corpus Encoding Standard’ (CES) developed by EAGLES. The corpus can be searched at http://brangaene.upf.es/cgi-bin/bwananet/ seldocsCTplus.pl. VISL, which stands for ‘Visual Interactive Syntax Learning’, is a research and development project at the Institute of Language and Communication (ISK), University of Southern Denmark (SDU, http:// www.sdu.dk/) Odense Campus. Since September 1996, staff and students at ISK have been designing and implementing Internet-based grammar tools for education and research using corpora. In Spanish, the ECI-ES2 and the Europarl-es annotated corpora can be searched at http:// corp.hum.sdu.dk/cqp.es.html. The Europarl-es corpus contains 29 millions words of parliamentary debates and can be searched without password. The ECI-ES2 corpus contains 14 million words from the newspaper El Diario and requires a password. Corpora compiled by university translation or linguistics research teams As data-intensive methods have become more affordable in the 1990s, thanks to advances in computing as well as data collection efforts, empiricist methods have become the method of choice for many translation researchers at Spanish universities. The growing interest recently observed in the development of multilingual and bilingual corpora at the Faculties of Translation and Linguistics can be observed in the shift in orientation of research projects and doctoral theses. Many Corpora Translation Studies (CTS) projects are under way, many doctoral theses have incorporated corpora to perform various kinds of experiments and many teachers are co-ordinating the wealth of texts of all types that they can gather with the help of their students. However due to space limitations, in this chapter we will only cite briefly the main projects receiving institutional funding and will describe in detail two Genre Comparable Multilingual Corpora, the CDJ-GITRAD Corpus and the GENTT Project. The main advantage of the CDJ-GITRAD and the
260
Incorporating Corpora
GENTT corpora is the fact that they will permit the download of documents at full text to preserve and make evident genre conventions. A very useful resource is the 100-million-word monolingual corpus of Spanish texts Corpus del espan˜ol (http://www.corpusdelespanol.org) compiled by Professor Mark Davies of the Department of Linguistics and English Language at Brigham Young University in Provo, Utah, whose areas of activity and research include corpus and computational linguistics and design and optimisation of linguistic databases. The corpus contains 100 million words of text: 20 million from the 1200s 1400s, 40 million from the 1500s1700s and 40 million from the 1800s 1900s. The 20,000,000 words from the 1900s are divided equally among literary, spoken texts and newspapers/encyclopedias. As Davies points out in his website, in addition to being very fast, the search engine allows a wider range of searches than almost any other large corpus in existence. The database can be queried very quickly usually just a few seconds for even the most complicated queries. It permits simple exact word searches as well as much more interesting searches that are the real strength of this corpus: word patterns, words in context, collocations, synonyms, customised lists, etc. We do recommend the reader to have a look at this powerful resource. The interdisciplinary Research Group Oncoterm (terminologists, translators and physicians from the University of Granada) has developed an information system for the medical subdomain of oncology. This information system is intended for health practitioners, relatives and friends of patients with cancer, translators and journalists. A vast corpus of documents on cancer in English and Spanish has been compiled together with a terminological database. The corpus cannot be searched on-line for copyright reasons but queries can be sent to http:// www.ugr.es/ oncoterm/intro.html. However, the terminological database is available at http://www.ugr.es/ oncoterm/alpha-index.html. A multilingual corpus of tourism contracts (German, Spanish, English, Italian) for automatic text generation and legal translation is the objective of a research group based at the University of Malaga, with its main researcher Gloria Corpas. According to the information given on their website, a multilingual corpus (both parallel and comparable) will be compiled from tourism and law websites on the Internet. A protocol will be laid out for searching the WWW, and retrieving, encoding and storing (hyper)texts (http://www.turicor.org).
The CDJ-Gitrad Corpus and the GENTT Project In 1999, the GITRAD research group of the Universitat Jaume I (Castello´n) began to compile a EnglishSpanish corpus of legal documents starting from a modest collection of SpanishEnglish legal
Corpora for Translators in Spain
261
texts compiled by the author for her PhD dissertation (1998), then extended in a subsequent research project in which Esther Monzo´, Steve Jennings and the author took part. The idea of compiling this corpus arose from the observation that the comparison of legal texts in the source and target languages is something that specialised translators are always doing, and legal translators need to be conversant with the textual typology used in their field of specialisation to ensure that they are observing the necessary textual, social, and in our case legal, conventions (Borja, 2000). Moreover, the information it offers for each legal genre helps the translator respect the conventions of genre so important in this field of knowledge. For example, when translating a will, the translator should be familiar with the formal aspects of this genre in the target language. In fact, the conservatism of law is reflected in the repetitive and fossilised character of its textual structures, its phraseology and its specialised lexicon, in such a way that the legal text constitutes a paradigm of stereotyped text susceptible to generic description. It would therefore seem advisable to have systems of classifying documents for each area of specialisation and, particularly in the case of legal translation, it would be useful to have a taxonomy of texts in the source and the target language that would enable translators to compare terminology, usage and the practical application of the law. The translator should always try to fit the text he or she is going to translate into a conventional textual category that speakers of a particular language will recognise. Legal texts constitute instruments that have a certain form and function in each culture and sometimes there is no equivalence between languages due to the lack of uniformity between different legal systems. Is it a will or a fragment of a text book on law? Is it a section of an Act, a writ of summons or a judgement? It is obvious that the translator’s solutions will not be the same in all cases. In order to facilitate research into legal texts and the characterisation of their discursive features, the structural element proposed for the corpus is the cultural and anthropological concept of genre as proposed by authors such as Monzo´ and Borja (2000). The advantages of a corpus of this kind for translation have already been demonstrated in previous studies which have explored its research and training applications as well as their descriptive and ontological potential (Borja, 2000, 2003, 2005; Borja & Monzo´, 2001; Monzo´ 2001, 2003). One of the most direct applications of the corpus developed is its potential for teaching purposes. In fact, from the early days of the project, it has been used in legal translation classes at Universitat Jaume I, where the researchers of the group are engaged as teachers. The teaching
262
Incorporating Corpora
experience has shown us that our description of the legal genres is a didactic tool of undoubted value. The corpus is complemented by a system developed by Steve Jennings (2003) for managing relational databases, which enables multiple searches to be made in an extremely functional and efficient way. The user interface is fast and easy to use, and allows the documents to be recovered in text format. The CDJ-GITRAD corpus of legal documents (Spanish-Catalan-English) can be searched on-line: http://www. cdj.uji.es. The documents can be downloaded as full text for research purposes after applying for a password through the same web address. In 2000, the CDJ-GITRAD corpus merged with the GENTT project to create an encyclopaedia of original texts, comparable texts (and at a later stage, also translated aligned texts) in the legal, scientific and technical field, classified by genre. The intercultural approach to translation adopted by the GENTT research group assumes that the specialised translator needs information of three kinds: thematic, textual and linguistic (Garcı´a & Monzo´, 2004). When in possession of this information, the translator can improve his/ her knowledge, both linguistic and extralinguistic, using a self-taught process. The original organisation of this corpus for the legal division (Legal system [ Branch of Law [ Subbranch [ Genre [ Subgenre) generates a classification that is extremely useful for the translator, who can easily place the text on which they are working in the taxonomy and compare it with the equivalent genre in the legal system of the target language. This methodology is applied to the other two divisions of the corpus: the medical and the technical divisions. Moreover, this classification is complemented by a system of crossed searches that combines amongst other data the original language, the status of the text (original or translation), date of creation and the source. More information can be found at http://www.gentt.uji.es.
Conclusions The empiricist trend is rapidly gaining ground in Translation Studies as new methods and research resources are available. Corpora which until now have mainly been used for analysing aspects of cross-cultural linguistics are unquestionably useful for Translation Studies. Translation researchers can obtain very useful information from raw (untagged) monolingual corpora, but multilingual corpora containing text files in several languages (segmented, aligned, parsed and classified) allowing storage and retrieval of aligned multilingual texts against various search conditions have proved to be an invaluable tool. On the other hand, the process of professionalisation and specialisation that translation has undergone since the mid-20th century has resulted in
Corpora for Translators in Spain
263
something akin to an identity crisis in the translator that has become more acute in recent decades with the rapid development of information technologies. The new technologies have facilitated the appearance of expert systems for organising knowledge that render obsolete the more traditional modus operandi of the scientific and professional community. The high degree of specialisation required by certain types of translation today (such as legal or medical translation) makes it necessary to find new systems for obtaining and recovering knowledge and data which even the best and most encyclopaedic human mind cannot match. Today a mixed system of knowledge management is needed that enables translators to integrate their skills with electronic information management and recovery systems. Corpora resources are the basis of any such system and will very soon replace the traditional dictionaries and encyclopaedias. There is a vast variety of corpus designs and as a result when discussing the design of corpora for translators we need to define the profile and objectives of the corpus users with great precision. Nevertheless, as there is no single translator profile or a single profile of Translation Studies approach, it is impossible to talk about a single ideal corpus design for translators, but rather of specific designs for specific translation or research purposes. Translators can exploit and apply corpora in many ways (performing terminological or statistical searches, collocations, looking for functional equivalences, observing the texture of genres, investigating phraseology, etc.) and it is possible for all these tasks to be performed on the Internet with no need to download programs or databases. The interests are very varied, showing that a number of different working possibilities are opening up for translation researchers and practitioners. The corpora resources available in Spanish identified in this contribution are only a small sample of what we can expect to find in the near future. Unfortunately for translators, most of these resources are only available through the Internet-browsing tools which permit terminological and collocation searches but do not, generally, allow the reading of full texts due to copyright restrictions. Corpora based on genre provide the end user with full texts showing genre conventions and structure. A good example of what the future might bring in this field is WebCorp, a suite of tools available at http://www.webcorp.org.uk, which allows free access to the World Wide Web as a corpus, one of the most exciting tools that I have found. Note 1.
This research has been carried out within GENTT (Ge`neres textuals per a la traduccio´/Textual Genres for Translation); a project being developed at the
Incorporating Corpora
264
Universitat Jaume I (Castello´, Spain), funded by Ministerio de Ciencia y Tecnologia (BFF 2002-01932) and Ministerio de Educacio´n y Ciencia (HUM 2006-05581).
References Abaitua, J. (2002) Tratamiento de corpora bilingu¨es. In M.A. Martı´ and J. Llisterri (eds) Tratamiento del lenguaje natural (pp. 61 90). Barcelona, Spain: Edicions Universitat de Barcelona. On WWW at http://paginaspersonales.deusto.es/ abaitua/konzeptu/ta/soria00.htm. Accessed 19.1.07. Atkins, S., Clear, J. and Ostler, N. (1992) Corpus design criteria. Literary and Linguistic Computing 7 (1), 1 16. Baker, M. (1993) Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis and E. Tognini-Bonelli (eds) Text and Technology: In Honour of John Sinclair (pp. 233 250). Amsterdam and Philadelphia: John Benjamins. Baker, M. (1999) The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics 4 (2), 1 18. Biber, D., Conrad, S. and Reppen, R. (1998) Corpus Linguistics Investigating Language Structure and Use. Cambridge: Cambridge University Press. Borja Albi, A. (2000) El texto jurı´dico ingle´s y su traduccio´n al espan˜ol. Barcelona: Ariel. Borja Albi, A. (2003) La investigacio´n en traduccio´n jurı´dica, en Garcı´a Peinado y Ortega Arjonilla (dirs.) Panorama actual de la investigacio´n en traduccio´n e interpretacio´n. Granada: Atrio, Granada. Borja Albi, A. (2005) Organizacio´n del conocimiento para la traduccio´n jurı´dica a trave´s de sistemas expertos basados en el concepto de ge´nero textual. In I. Garcı´a Izquierdo (ed.) El ge´nero textual y la traduccio´n. Reflexiones teo´ricas y aplicaciones pedago´gicas. Bern: Peter Lang. Borja Albi, A. and Monzo´ Nebot, E. (2001) Aplicacio´n de los me´todos de aprendizaje cooperativo a la ensen˜anza de la traduccio´n jurı´dica: cuaderno de bita´cora. In F. Martı´nez Sa´nchez (ed.) EDUTEC’01. Congreso Internacional de Tecnologı´a, Educacio´n y Desarrollo Sostenible. Murcia: Edutec, CAM. Botley, S., McEnery, A. and Wilson, A. (eds) (2000) Multilingual Corpora in Teaching and Research. Amsterdam: Rodopi. Bowker, L. (1998) Using specialized monolingual native-language corpora as a translation resource: A pilot study. Meta XLIII, 4. Garcı´a, I. and Monzo´, E. (2004) Traducir con corpus de ge´neros, Revista de la Facultad de Lenguas Modernas (pp. 45 59). Lima, Peru: Universidad Ricardo Palma. Hanks, P. (1996) Contextual dependency and lexical sets. International Journal of Corpus Linguistics 1 (1), 75 98. Jennings, S. (2003) The Development of Software Tools for Corpus-Based Genre Research in Translation Studies: A Practical Application in the Context of the Gentt Project [treball d’investigacio´]. Castello´ de la Plana, Departament de Traduccio´ i Comunicacio´, Universitat Jaume I. Kennedy, G. (1998) An Introduction to Corpus Linguistics. Amsterdam: Rodopi. Kenny, D. (2001) Lexis and Creativity in Translation. Manchester: St Jerome. Laviosa-Braithwaite, S. (1996) The English Comparable Corpus (ECC): A resource and methodology for the empirical study of translation. PhD Thesis, UMIST, Manchester, UK. Laviosa-Braithwaite, S. (1998) The English Comparable Corpus: A resource and a methodology. In L. Bowker, M. Cronin, D. Kenny and J. Pearson (comp.) Unity
Corpora for Translators in Spain
265
in Diversity? Current Trends in Translation Studies. Manchester: St. Jerome Publishing. Malmkjaer, K. (2003) On a pseudo-subversive use of corpora in translation training. In F. Zanettin, S. Bernardini and D. Stewart (eds) Corpora in Translator Education (pp. 119 134). Manchester: St. Jerome. McEnery, T. and Wilson, A. (1996) Corpus Linguistics. Edinburgh: Edinburgh University Press. Monzo´ Nebot, E. (2001) El ge´nero textual: un concepto clave en la enculturacio´n del traductor. In A. Barr, M.R. Martı´n Ruano and J. Torres del Rey (eds) (2001) U´ltimas corrientes teo´ricas en los estudios de traduccio´n y sus aplicaciones. Salamanca: Ediciones Universidad de Salamanca. Monzo´ Nebot, E. (2003) Corpus-based teaching: The use of original and translated texts in the training of legal translators. Translation Journal 7 (4). On WWW at http://accurapid.com/journal/26edu.htm. Accessed 19.1.07. Monzo´ Nebot, E. and Borja Albi, A. (2000) Organitzacio´ de corpus. L’estructura d’una base de dades documental aplicada a la traduccio´ jurı´dica. Revista de Llengua i Dret 34, 9 21. Moon, R. (1998) Fixed expressions and idioms in English: A corpus-based approach. In Oxford Studies in Lexicography and Lexicology. Oxford: Oxford University Press. Munday, J. (1998) A computer-assisted approach to the analysis of translation shifts. Meta XLIII (4), 142 156. Olohan, M. (2004) Introducing Corpora in Translation Studies. London: Routledge. Sa´nchez, A., Cantos, P. and Simo´n, J. (2001) Gran Diccionario de Uso del Espan˜ol Actual. Corpus CUMBRE del espan˜ol contempora´neo de Espan˜a e Hispanoame´rica. Extracto de dos millones de palabras. Editorial SGEL. Sebastia´n, N., Cuetos, F., Martı´, M.A. and Carreiras, M.F. (2000) LEXESP: Le´xico informatizado del espan˜ol. Edicio´n en CD-ROM. Barcelona: Edicions de la Universitat de Barcelona (Col.leccions Va`ries, 14).