Perspectives on Arabic Linguistics XIX
AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Edit...
42 downloads
1000 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Perspectives on Arabic Linguistics XIX
AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F.K. KOERNER (Zentrum für Allgemeine Sprachwissenschaft, Typologie und Universalienforschung, Berlin) Series IV – CURRENT ISSUES IN LINGUISTIC THEORY Advisory Editorial Board Lyle Campbell (Salt Lake City); Sheila Embleton (Toronto) Brian D. Joseph (Columbus, Ohio); John E. Joseph (Edinburgh) Manfred Krifka (Berlin); E. Wyn Roberts (Vancouver, B.C.) Joseph C. Salmons (Madison, Wis.); Hans-Jürgen Sasse (Köln)
Volume 289
Elabbas Benmamoun (ed.) Perspectives on Arabic Linguistics XIX Papers from the nineteenth annual symposium on Arabic Linguistics, Urbana, Illinois, April 2005
Perspectives on Arabic Linguistics XIX Papers from the nineteenth annual symposium on Arabic Linguistics, Urbana, Illinois, April 2005
Edited by
Elabbas Benmamoun
University of Illinois, Urbana-Champaign
JOHN BENJAMINS PUBLISHING COMPANY AMSTERDAM/PHILADELPHIA
4-
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences — Permanence of Paper for Printed Library Materials, ANSI Z39.48-1984.
Perspectives on Arabic linguistics XIX : Papers from the nineteenth annual symposium on Arabic Linguistics, Urbana, Illinois, April 2005 / Edited by Elabbas Benmamoun. (Amsterdam studies in the theory and history of linguistic science. Series IV, Current issues in linguistic theory, ISSN 0304-0763 ; v. 289) Includes bibliographical references and index. ISBN 978 90 272 4804 6 (Hb; alk. paper) © 2007 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. • P.O.Box 36224 • 1020 ME Amsterdam • The Netherlands John Benjamins North America • P.O.Box 27519 • Philadelphia PA 19118-0519 • USA
CONTENTS
Acknowledgments
vii
Foreword Elabbas Benmamoun
ix
Section I: Computational and Corpus Linguistics Systematicity in the Arabic Mental Lexicon Ilana Bromberg Arabic PAPPI: A Principles and Parameters Parser Sandiway Fong
3
19
Corpus-based Linguistic Analyses: Testing Intuitions about Arabic Structure and Use Salem Ghazali
37
Learning Arabic Morphology Using Statistical Constraint-Satisfaction Models Paul Rodrigues and Damir Ćavar
63
Learning to Use the Prague Arabic Dependency Treebank Otakar Smrž, Petr Pajas, Zdeněk Žabokrtský, Jan Hajič, Jiří Mírovský, and Petr Němec
77
CONTENTS
vi
Section II: Phonology, Morphology, and Syntax Intonational and Rhythmic Patterns across the Dialect Continuum Salem Ghazali, Rym Hamdi and Khouloud Knis
97
Roots and Patterns in Arabic Lexical Processing Abdessatar Mahfoudhi
123
Affrication in North Arabic Revisited Eiman Mustafawi
151
The Syntax of Complex Tense in Moroccan Arabic Hamid Ouali and Catherine Fortin
175
On Agree and Postcyclic Merge in Syntactic Derivations: First Conjunct Agreement in Standard Arabic Usama Soltan
191
Section III: Sociolinguistics and Second Language Acquisition Null Subjects Use by English and Spanish Learners of Arabic as an L2 Mohammad T. Alhawary
217
Linguistic Diversity: The Qaaf across Arabic Dialects Maher Bahloul
247
Arabic Sociolinguistics and Cultural Diversity in Morocco Moha Ennaji
267
The Gendered Use of Arabic and Other Languages in Morocco Fatima Sadiqi
277
Index of Subjects
301
ACKNOWLEDGMENTS The Nineteenth Annual Symposium on Arabic Linguistics was held at the University of Illinois at Urbana-Champaign in March 2005. The Symposium was sponsored by the Arabic Linguistic Society and the Department of Linguistics at the University of Illinois at Urbana-Champaign. Additional support was provided by a number of departments and centers at the University of Illinois including the Center for African Studies, the Center for Global Studies, the Program in South Asian and Middle Eastern Studies, the Center for Advanced Studies, and the Beckman Institute. I am indebted to all the reviewers for their help with the selection and editing of the papers that are included in this volume. I would also like to thank to Hala Jawlakh and Bezza Ayalew for their assistance.
FOREWORD
The fourteen papers in this volume engage various issues in Arabic linguistics. The majority of the papers rely on quantitative methods to analyze data from corpora or data elicited from speakers using experimentally grounded methods. While most of the papers focus on Standard Arabic, some deal with spoken colloquial dialects from the Maghreb and the Gulf region. Section I includes five papers that deal with computational and corpus-based studies of Arabic. The topic of the paper by Bromberg is the relation between form and meaning. More precisely, the paper studies the correlation between the phonetic form of a word and its meaning in Arabic. Bromberg bases her study on the analysis of 1,000 words selected for their frequency from the Linguistic Data Consortium’s Agence France Press corpus. The author’s aim is to see whether there is predictable similarity along the semantic and phonetic dimensions. She claims that to a certain extent such correlation exists. Then she explores the psycho-linguistics implications of the study, particularly whether the observed systematic relation between form and meaning can facilitate acquisition. The paper by Fong describes the properties of PAPPI, a multilingual parser, as it is implemented to handle Arabic clause structure analyzed in the Principles and Parameters (P&P) framework. The author relies on assumptions and principles posited in the P&P framework to develop a parser that captures patterns that relate to clause structure, word order, agreement, placement of verbs, etc. There aren’t many parsers, in the public domain at least, that have been developed for Arabic using the P&P framework. Ghazali provides a large corpus study of the distribution of a number of Arabic words and grammatical particles. By looking at the collocation and colligation patterns, the author is able to demonstrate that some nearly synonymous words have different distributions depending on the words and expressions they co-occur with. This finding, based on extensive corpus investigations, shows the limits of current studies of the Arabic lexicon based on limited dictionary definitions. The results will be beneficial to researchers and teachers of Arabic alike.
x
FOREWORD
The paper by Rodrigues & Ćavar discusses a machine learning model of Arabic morphology. Assuming a root-based system for Arabic morphology, they developed an unsupervised constraint-based, statistical learning model that does not rely on the use of a dictionary, as previous models have done. The success rate of the model in learning the root system and deciding whether a consonant is part of the root system is 75%. The authors deal mainly with roots that contain three radicals. Smrž, Pajas, Žabokrtský, Hajič, Mírovský, & Němec outline the data structures of the Prague Arabic Dependecy Treebank (PADT), available from the Linguistics Data Consortium (LDC). The corpus contains annotated newswire Modern Standard Arabic texts based on the Arabic Gigaword (LDC). They illustrate the working of the corpus and its annotated data by focusing on searching for instances of Arabic improper annexation (part of the Construct State pattern), which involves a complex semantic relation between its members. Section II includes papers on Arabic phonology, morphology, and syntax. The paper by Ghazali, Hamdi & Knis takes up the issues of phonological and prosodic differences between various dialects from the western and eastern regions of the Arab world. For example, experimental studies of intonation patterns in Egyptian Arabic, Syrian Arabic, Iraqi Arabic, Moroccan Arabic, and Tunisian Arabic, reveal differences between the dialects. Thus, while the eastern dialects exhibit the declination phenomenon, Moroccan Arabic does not seem to display the same pattern. On the other hand, Iraqi Arabic displays a falling pattern (HL) and seems to be almost unique. These differences are, of course, on top of other differences such as vowel duration. Focusing on one Eastern dialect, Qatari Arabic, Mustafwi discusses affrication of the voiced velar stop. Departing from previous studies, she argues that affrication is confined to contexts where the voice velar stop is adjacent to high front vowels. One critical factor in the analysis concerns the fact that the affrication process is limited to the stem. Another interesting fact is that the process applies only within restricted paradigms. For example, it does not apply to broken plurals, verbs, and participles. The analysis is framed within the Optimality Theoretic framework and the author proposes a number of rankable constraints to derive the observed patterns in Qatari phonology.
FOREWORD
xi
The role of the root in Arabic morphology is explored by Mahfoudi from a psycholinguistic perspective. He revisits the longstanding debate regarding the nature of lexical relations in Arabic, whether they are root-based or stem-based. He also explores the role of the patterns in anchoring those relations. Based on experiments he comes to the conclusion that the root has a priming effect. This effect is not dependent on any shared orthography or meaning. The same claim cannot be made about the patterns (both sound and weak), which did not display the same priming effects. The results of the study are in line with recent studies that have argued for the psychological reality of the root with regard to the organization of the Arabic lexicon. The question that remains to be addressed is how these results can be reconciled with other studies that have shown that stems seem to play a role in establishing lexical relations. Ouali & Fortin add to the debate about clause structure in Arabic. The authors focus specifically on complex tenses in Moroccan Arabic and provide an analysis that is in the spirit of the P&P framework and its minimalist incarnation. They discuss the dependency relations that exist between tense and aspect in Moroccan Arabic, which they derive through a selectional relation between the two heads (and their projections). They also put forward an analysis whereby particles that elsewhere are analyzed as aspectual markers (such as the ka/ta morpheme) are claimed to realize the present tense. One innovative idea they advance is that complex tenses are biclausal, contrary to what has been argued before. Also working within the minimalist version of the P&P framework, Soltan deals with the ongoing debate about subject agreement and coordination in Arabic. The paper focuses mainly on Standard Arabic and tries to account for why agreement is with the first conjunct when the latter is in the postverbal position, i.e., in the VS order. This has been a matter of concern for a number of students of Arabic. The question is why agreement in the SV pattern is with whole conjunct. To derive the phenomenon of first conjunct agreement, the author argues that a simple analysis deploying the operation Agree, as understood in recent minimalist studies, can explain the facts if one assumes that in coordination the second conjunct is in fact an adjunct that is essentially not present at the point where the agreement relation is computed. This ensures first conjunct agreement. With regard to the
xii
FOREWORD
SV order, full agreement follows from the assumption that the real subject is a pronominal that is related to the preverbal conjunct, an assumption that, while not uncontroversial, has been advanced and adopted by a number of students of Arabic. Section III includes papers in Arabic sociolinguistics and acquisition of Arabic as a second language. Bahloul’s paper provides an analysis of the distribution of the phoneme /q/ and its many variants in eighteen Arabic dialects from the Maghreb to the Gulf. The data was collected based on questionnaires given to native speakers of the relevant dialects. In addition to the well-known variants of /q/, namely the voiced velar /g/ and the glottal stop, some eastern dialects of Arabic (but also some pockets of North Africa) display other variants such as /k/. Bahloul shows that within the same country and/or region two variants may have a distribution according to the urban/non-urban split. For example, in Syria the glottal stop variant is found in the major cities, such as Damascus, Hama, and Homs while /g/ is found in the rural areas. In fact, this split seems to hold across the eastern Mediterranean region, including lower Egypt. The author argues that the distribution of /q/ and its many variants leads to a linguistic map that divides the whole region into five main areas. Of course it remains to be seen how the distribution of /q/ and its variants line up with other linguistic features that distinguish Arabic dialects from each other. The issue of multilingualism in contexts where Arabic is spoken as a native language is the topic of Ennaji’s paper. The paper discusses the complexity of the linguistics situation in Morocco, where five languages compete for space: Moroccan Arabic, Classical Arabic, Standard Arabic, Berber, and French (and one can add English as well). Ennaji uses the term quadriglossia to refer to the situation with Arabic, echoing Ferguson who uses the term diglossia (for colloquial and standard Arabic) and Youssi who uses the term triglossia (which adds educated/middle Arabic to the mix). The four varieties according to the author include Classical Arabic, Standard Arabic, Educated Spoken Arabic and Moroccan Arabic. Given this complex linguistic space it is inevitable that competition would exist between the different media. Ennaji traces the history of the linguistic situation in Morocco and the cultural, social, and political dimensions and the factors that have played a role in marginalizing or strengthening a particular language or variety at the expense of the others. For example, Berber, the original
FOREWORD
xiii
language of the country and the region, has not been given the space that is commensurate with its history and demographic weight. However, Berber has recently been introduced in schools in Morocco and there seems to be a strong push to give it a more prominent role. The author provides a succinct but informative overview of the different facets of the debate and its history. Staying with the same linguistic context of Morocco, Sadiqi turns her attention to language and gender in the country. She focuses on the interplay between Arabic, French and Berber. Given the colonial history of the country, its high illiteracy rate, particularly among women, and its ethnic make-up, language naturally reflects how these issues relate to gender and how language use evolves with the changing position of women in the Moroccan society. Thus, according to the author, Standard Arabic used to be predominantly a male language, partly due to the fact that it is closely associated with the Islamic faith whose leadership and public figures used to be exclusively males. However, this situation has started to change with the increasing prominence of women who write in Standard Arabic and use it to engage in religious debates. Such use of Standard Arabic is seen as a form of empowerment for women who have reclaimed Standard Arabic as a vehicle for their own discourse. With respect to Berber, the author argues that women played an important role in maintaining the language and, with it, Berber identity. The author goes as far as to put forward the thesis that the fate of Berber parallels the fate of women in Morocco. More rights for women in Morocco have also been accompanied by more rights for Berber. On the other hand, Moroccan Arabic is less associated with a specific gender but French is used by women to reflect their social prestige while males use it to assert their economic and political positions. Though the issues are complex given the many factors at play in a society where gender issues play a major role in the social system, it is to be expected that language would reflect those dynamics. Acquisition of Arabic as a second language is the main topic of Alhawary’s paper. The paper focuses particularly on the status of the null subject (sentences without overt subjects) in the language of Arabic learners whose first language is Spanish or English. Spanish is a null subject language while English is not. Moreover, like Arabic, Spanish drops the subject because it can be retrieved from agreement
xiv
FOREWORD
on the verb. Thus, there is a tight connection between the presence of null subjects and agreement. The paper seeks to investigate the distribution of the null subject in the production of Arabic data by native Spanish and American English speakers, and the issue of transfer, i.e., whether native language use of the subjects carries over to Arabic, and the connection between the development of the null subject and agreement inflection in their Arabic data. The subjects of the study are all college students with no prior exposure to Arabic in high school or at home. One interesting finding of the experimental study is the early presence of null subjects in the Arabic data of native English speakers relative to native Spanish speakers, which is surprising since the latter have null subjects in their own native language. Moreover, English speakers also acquired the use of agreement inflection early, which shows the link between null subjects and agreement. The Spanish speakers did “improve” eventually but the differences in the production data for beginners is quite striking. Elabbas Benmamoun University of Illinois, Urbana September 2007
Section I
Computational and Corpus Linguistics
SYSTEMATICITY IN THE ARABIC MENTAL LEXICON
Ilana Bromberg The Ohio State University
1. Introduction 1.1 L’Arbitraire du signe The relationship between the form of a word and its meaning is arbitrary. This concept, first espoused by Ferdinand de Saussure (1916), states that there is nothing inherent about a worldly object or event that links it to the name it is given in any particular language. Nor is there any inherent property about the sounds of a word that cause it to be linked to a specific meaning. It is this property of languages that allows humans to create new names for new ideas and objects, and it is the reason that a single object has different names in different languages or many names in a single language. 1.2 Deviations from Saussure However, there are phenomena in natural language that cause one to question the complete arbitrariness of this relationship. For instance, most languages have some words that can be classified as onomatopoetic, such that they are meant to exactly elicit the sound that they represent. In English, some onomatopoetic words are “drip”, “splash”, “bang”, and “beep”. Similarly, the words we attribute to animal noises are, in a way, onomatopoetic, and many of these words sound similar across languages. In these cases, there is an iconic relationship between form and meaning. A non-iconic but systematic relationship between form and meaning exists among sounds identified as phonaesthemes. Studied at length by Margaret Magnus (2000), phonaesthemes are sounds that
4
ILANA BROMBERG
recur in words that fall within the same semantic field. For instance, the words “glimmer”, “glow”, and “gleam” all begin with the [gl] phonetic cluster, and semantically, they all appeal to the notion of light. Statistical analysis has shown that this cluster, and similarly for other such clusters, appears significantly more often in words having to do with light or vision than in words that do not. Furthermore, a study by Benjamin Bergen (2004) has shown that these groupings are a part of the linguistic repertoire that can be accessed productively by a native speaker. A speaker will be likely to use such a salient cluster in defining a meaning for a new word or in creating a new word for a given meaning. Given these deviations from Saussure’s hypothesis, what can we say about the arbitrariness of the relationship between form and meaning over the whole lexicon? Do the effects of onomatopoeia and phonaesthemes create a significant, perceptible correlation if we look at all words in the lexicon or are these simply isolated examples of lexically limited phenomena? This study shows that there is indeed a systematic relationship between form and meaning over a subset of the Arabic lexicon, specifically a set of one thousand highly frequent, non-morphologically related words. Pairs of words are compared for both semantic and phonetic similarity, in a manner similar to that pioneered in Shillcock et al. (forthcoming). The process for determining semantic similarities is described in section 2, and for phonetic similarities in section 3. Section 4 shows the results of the comparison and includes a discussion of the possible psycholinguistic correlates to this research. 2. Determining Semantic Similarity 2.1 Meaning from context “You shall know a word by the company it keeps” (Firth 1957). In the spirit of this well-known claim, the semantic similarity of two words shall be, in this research, defined by the number of words they have in common among their various contexts. For example, if one were to search for the words “rain” and “storm” in a corpus of written English text, one would be likely to find the following words nearby each of them: “water”, “flood”, “weather”, “temperature”, “snow”, and “wind”. Because “rain” and “storm” occur in similar contexts, they are considered semantically similar. This definition of semantic similarity
SYSTEMATICITY IN THE ARABIC LEXICON
5
is often used in the natural language processing literature (Landauer 1997, Rohde et al. forthcoming). In order to measure the similarity between two words, one measures the degree of similarity between the cumulative contexts of each of the target words. The “context” can be defined as, for example, several words surrounding the target word, the paragraph in which it is found, or the entire document in which it is found. In this way, words can be divided into groups comprising different semantic fields (Cutting et al. 1992). This definition of semantic similarity has some limitations. In general, syntax and morphology do not enter into the description. One could initially tag words for part of speech or some other feature, but this will not naturally fall out of the semantic groupings. Along the same lines, the rating of similarity between two words will say nothing about the actual relationship between the words, i.e., whether they are synonyms or antonyms, have a part-whole relationship, etc. Furthermore, the semantic fields of the words will not be determined automatically. However, if the particular algorithm in use has worked correctly, a human should be able to categorize any of the automatically grouped words into a semantic field fairly easily. These limitations do not affect the outcome of this research, as I am currently only interested in the semantic similarities between pairs of words, not their relationship or semantic field. 2.2 Corpus analysis The first step in determining semantic similarity is to obtain a corpus. I have used in this study the Linguistic Data Consortium’s distribution of the Arabic Gigaword Corpus, a resource comprising four Arabic newspapers. I make use of only one of these newspapers, specifically Agence France Presse, which includes over 100 million words. The newspaper is written in Modern Standard Arabic, and the documentation states that there may be some regional dialect disparity due to the international scope of the services. The next step is to process the corpus to extract only the information that is needed, namely, each word and the position in which it occurs. To do this, I use the Buckwalter Morphological Analyzer, henceforth BMA (Buckwalter 2002). The job of the BMA is to produce a complete morphological analysis for each non-vocalized Arabic word shown to it. It does this by first splitting the orthographical form into
6
ILANA BROMBERG
every numerically possible combination of prefix, stem, and suffix, with different vocalizations. By numerical, I mean that a prefix may have from 0 to 4 letters, a stem from 1 to infinite letters, etc. Each combination of morphemes that results in an actual Arabic word, as determined by combinations dictionaries included in the BMA, is output as a possible solution to the given input word. A solution includes vocalization, morphological decomposition, part of speech, gloss, and lemma. The BMA may produce many solutions for a given input. No formal part of speech analysis is done, i.e., there is no analysis of the syntax of the sentence; therefore, there is no ranking of the given solutions. Rather than introduce the additional step of part of speech tagging, I assume a priori that each of the BMA solutions has equal probability of being the correct solution. Thus, using BMA, I collect all possible solutions for each position in the corpus, recording both the numerical corpus position and the probability of each solution being the correct one. At this point, the target and context lists are produced. The target list includes those words that will be analyzed for phonetic and semantic similarity, and the context list contains the words that will define the semantic relatedness of the target words. Only nouns, verbs, adjectives, and adverbs are included in each list. Furthermore, all words in the lists are derived from a different root, as defined by the BMA, which includes root information in the stem dictionary. In Arabic, if two words are derived from the same root, then they are morphologically related and thus automatically have similarities in both form and meaning. The goal of the study, then, must be to see if the formmeaning relationship exists between words derived from different roots.1 The target list comprises the one thousand most frequent words in the corpus, taking into account the probabilities discussed in the previous paragraph. The context list includes the next six hundred most frequent words in the corpus. A series of empty semantic vectors are created, such that each vector represents a target word defined by the same six hundred context words. This may be visualized as a matrix in which there are one thousand rows, one for each target word, and six hundred columns, 1
Note that this study does not directly assess the relationship of form and meaning between roots per se, as the distribution of one word derived from a specific root will not reflect the total distribution of all words derived from that root.
SYSTEMATICITY IN THE ARABIC LEXICON
7
each representing a context word. If context word i is found within the vicinity of target word j, then cell [i,j] is incremented by one. To fill in these semantic vectors, I step through the corpus in its morphologically analyzed form to look for target words and corresponding context words. Each time a target word is found (in the exact form in which it is found in the target list), I look at a window of ten words before and ten words after that target word. Within this window, I look for any instance of a context word and increment the target word’s semantic vector accordingly. A window of ten words on each side of the target word is used for two reasons: 1) This method is simpler, practically and intuitively, than using a window consisting of the entire document, and 2) this method is the one used in the research that this particular project is aiming to reproduce, and thus it is desirable that the results be as comparable as possible. As the process progresses, the one thousand semantic vectors are simultaneously filled in with values reflecting occurrences of the same six hundred context words. When this process has been completed, two calculations take place: the first is a smoothing calculation, the second, spatial comparison of the vectors. Logarithmic smoothing of the context word counts is performed in order to emphasize the first few appearances of a context word over the repeated appearance of the same word. In other words, the first appearance of a context word in the vicinity of a target word is more important than any subsequent appearance. A smoothing calculation of 1+log(count) accomplishes this. 2.3 Vector analysis Vector comparison is done through the calculation of cosine distance, as is standard in computational literature for comparing the similarity of numeric vectors. The formula for cosine distance is: n
r r cos( x , y ) =
!x y i =0
n
i
i
!x !y i =0
(Manning & Schütze 1999)
n 2 i
i =0
2 i
where x and y are vectors, n is the number of elements in the vector, and xi and yi are elements in x and y, respectively. In order to interpret
8
ILANA BROMBERG
the resulting figure as a distance measure, I subtract the result from 1. Thus, two vectors with a distance of 1 are considered maximally far apart (semantically dissimilar), whereas two vectors with a distance close to 0 are considered close together (semantically similar). Each of the one thousand semantic vectors is compared to every other vector, resulting in one million distance measures. Since distance(A,B) = distance(B,A), half of the measures are discarded. Also, we know that distance(A,A) = 0, and since including one thousand measures of 0 in the form-meaning comparison would skew the results, these comparisons are also left out. I am thus left with 499,000 semantic distance measures between the one thousand target words, ranging between just above 0 for pairs of very similar words, to about .6 for pairs of semantically dissimilar words. 3. Determining Phonetic Similarity 3.1 Feature analysis The phonetic comparison of the target words is undertaken through a phoneme-by-phoneme comparison, which, in Arabic, basically amounts to a letter-by-letter comparison. The phonetic distances between isolated Arabic sounds are calculated, and then the words are compared as strings of phonemes using Levenshtein distance. The letters of the Arabic script are essentially phonemic; that is, each letter stands for exactly one sound. The one major exception to this statement is the process of pharyngealization, through which pharyngeal, or emphatic, consonants tend to cause certain other consonants or vowels within the same word to also be pronounced as pharyngeal. I have not as yet taken this process into account; however, planned future work on this research most certainly will. Each phoneme is characterized by a 28-feature phonetic vector. The features in each vector are typical descriptive features such as consonantal, voiced, strident, lateral, etc. The initial values are taken from Bruce Hayes’ FeaturePad, which includes values for each feature for every English phoneme. Extra vectors were added to cover the Arabic phoneme set, as sounds such as the emphatic consonants were not included in the initial set. The values for these sounds were collected by gathering pronunciation information from a study by El-Imam (2001), descriptive information from the IPA chart, and articulatory information from an elementary Arabic textbook (Brustad 2001), such that I
SYSTEMATICITY IN THE ARABIC LEXICON
9
was able to encode the features of all of the Arabic sounds in a manner analogous to the sounds already encoded. As such, each phonetic vector has 28 features, each with a value of -1, 0 (feature does not apply), or 1, and no two vectors are alike. With the information encoded in this manner, the phonetic distance between the vectors can be calculated using Manhattan distance, as suggested by Nerbonne & Heeringa (1997). Two sounds with very different feature values, such as /a/ and /z/, have a large distance between them, whereas similar phonemes such as /m/ and /n/ have short distances separating them. These distances come into play in the word-to-word distances, the calculation of which is described next. A second reference for describing Arabic phonemes in terms of features originates from the work of Sami Boudelaa and his colleagues. In this case, each phoneme is described by only 15 features, each feature having a value of either 1 or 0. The features used are similar to those in the set described above; however, the smaller number of features leads to less redundancy in the data, which is desirable. Manhattan distances for the phonemes described in this manner were also calculated, and I will describe the results based on each of these phoneme description sets separately in section 4.1. 3.2 Word comparison The phonetic makeup of a word is defined by the individual phonemes that comprise that word. Thus, we can refer to a word as a phoneme string. Phoneme strings can be compared for similarities by simply comparing how many phonemes they have in common. A more sophisticated comparison takes into account the positions in which the phonemes are found, as well as differing degrees of similarity between the phonemes themselves. Compare, for instance, the following pairs of words: a. plus - plush b. corn - born c. plug - gulp
Pair (a) is very similar; they differ by only one sound, /s/ vs. /2/, and in fact, these two sounds are themselves very similar, so very little phonetic difference exists between the two words. Pair (b) might be
10
ILANA BROMBERG
considered slightly less similar than the previous pair. While they only differ by one phoneme, the difference between /k/ and /b/ is greater than the difference between /s/ and /2/, at least in terms of the features discussed above, and perhaps intuitively as well. The third pair of words share all of the same phonemes; however, since they are arranged differently, the phonetic similarity between the two phoneme strings is reduced. An algorithm known as Levenshtein Distance is used to calculate the distances between phoneme strings by taking into account the phonemes present in each string, the order in which those phonemes appear, and the relative phonetic distances between the phonemes themselves, as calculated above (Sankoff 1999). The algorithm works by transforming a “source” word into a “object” word through a series of insertions, deletions, and substitutions. Each of these operations changes exactly one phoneme of the source word to make it more like the object word. For instance, we can transform the source word “pant” into the object word “arts” by performing one of each operation: p a p a a a Table 1.
n n n r
t t t t
s s s
Insertion of s Deletion of p Substitution of n by r
Furthermore, a cost is assigned to each of these operations, so that the overall transformation can be compared to other transformations. Most importantly, the cost of a substitution is equal to the distance determined by the feature comparison previously described. Thus, changing an /a/ to a /z/ will cost more than changing an /m/ to an /n/. Insertions and deletions all have the same cost, and the only requirement on that cost is that the sum of a deletion and an insertion must always cost more than a substitution, such that a substitution will be preferred by the algorithm. In terms of phoneme order, the Levenshtein distance algorithm is a comprehensive algorithm, such that it calculates the cost of transformation of every alignment of the two words (without rearranging phonemes within the word). As such, the cost of each of the alignments in Table 2 are calculated:
SYSTEMATICITY IN THE ARABIC LEXICON
p a n t a r t s p a n t a r t p a n t a r p a n t a p a a r p a a r t p a a r t s p a a r t s Table 2.
11
s t
s
r n t n s n
t s t s t t
n t
The least cost alignment is always chosen as representative of the phonetic distance between the pair. Thus, the phonetic distance between the words in pair (a) above will be equal to the cost of substituting /s/ with /2/, as this is the only operation necessary to transform “plus” into “plush”. Similarly, only one substitution is necessary to change “corn” into “born”, but since the distance between /k/ and /b/ is greater than the distance between /s/ and /2/, the cost of the transformation of pair (b) will be greater than that of pair (a). As for pair (c), the following operations will take place to complete the transformation: p l
u
g
l g g g Table 3.
u u u u
g g l l p
Deletion of p Substitution of l by g Substitution of g by l Insertion of p
Even though the two words contain the same phonemes, the words will be considered phonetically more distant than the previous two
12
ILANA BROMBERG
pairs. Levenshtein distance does not allow the phonemes within the word to be transposed, so each of these steps is necessary to transform “plug” into “gulp”. The cost of transformation of pair (c) will be greater than that of pair (a) or (b) due to the additional operations. In this way, every pair of words in the target list is compared phonetically. For the same 499,000 pairs compared semantically in section 2, a phonetic distance score between 0 and 124.5 is calculated. Again, a word is not compared to itself, so the lowest score is just above 0 for two words that are maximally similar and about 124.5 for two words that are phonetically maximally distant.2 4. Results and Discussion 4.1 Arabic is systematic Having calculated the semantic and phonetic similarities between the same group of high frequency Arabic words, the final step is to calculate how these two measures compare. As can be seen in Figure 1, there is a positive correlation between the two measures. That is, as the semantic similarities between words increase, so does the phonetic similarity. Alternatively, as phonetic similarities increase, so do semantic similarities. Note that the correlation between the two measures is small; this is expected. A large correlation might lead us to believe that there existed something very close to a sound-meaning correspondence as described in the introduction. As it stands, we find only a small correlation, but a significant one. For a data set this large, the test for statistical significance falls outside the efficacy of the χ-squared or other typical significance tests. Instead, I use a method developed by Shillcock et al. (forthcoming) and described in Cohen (1995) as appropriate for this type of data. I begin with my original matrix of semantic distance scores, one score for each pair of words. I then take my phonetic distance matrix and randomize the scores representing the phonetic distance between pairs. Next, I calculate the correlation between the original semantic matrix and randomized phonetic matrix. After completing this calculation a number of times with differently randomized phonetic matrices, I 2
This figure is not a predetermined maximum—simply the largest distance between two words in the current data set.
SYSTEMATICITY IN THE ARABIC LEXICON
13
create a curve representing the likely outcomes of these randomized experiments. Placing the actual experimental results on this curve shows that the reported correlation of actual data is statistically significant, in that is a very distant outlier on the curve of randomized results. The correlation coefficient of 0.1418 is more than 100 standard deviations away from the average randomized score.
Figure 1. Semantic Distances related to Phonetic Distances based on a 28-phonetic feature paradigm. 1000 random data points. Slope of correlation = 18.85, coefficient of correlation = 0.1418
A second test shows that the relationship between form and meaning remains when the phonetic similarities are judged using less redundant feature specification data. The results in Figure 2 were obtained by testing the same semantic similarity data against the phonetic similarity data derived from the information in Boudelaa’s phonetic feature matrix. Again, the correlation coefficient is small but statistically significant.
14
ILANA BROMBERG
4.2 Is systematicity specific to Arabic? Thus, the hypothesis that a form-meaning correlation exists has been proven correct. Is this phenomenon specific to Arabic? In fact, similar results were found by researchers conducting a similar study in English. However, some may say that Arabic is predisposed to such systematicity on a morphological level. This study ruled out the most well-known kind of morphological similarities in Arabic, those being the similarities that arise in words derived from the same root. However, a hypothesis espoused by Georges Bohas (1997) states that every Arabic root is itself derived from a more abstract form, what he calls an étymon. An étymon is an unordered biconsonantal pair that comprises some semantic field, such that any three- or four-letter root derived from this étymon has meanings associated with that semantic field. If this is true, and Bohas makes a very convincing case, then the form-meaning correlation in Arabic is indeed built into its morphology.
Figure 2. Semantic Distances related to Phonetic Distances based on a 15-phonetic feature paradigm. 1000 random data points. Slope of correlation = 6.1768, coefficient of correlation = 0.1419
SYSTEMATICITY IN THE ARABIC LEXICON
15
4.3 Psycholinguistic correlates Aside from morphological interest, what does this correlation, or systematicity, mean? One main point of interest is whether this systematicity is actually encoded in the brain. If we could test for this systematicity in the mental lexicon, would we find that it is a productive aspect of natural language? If so, what else might we find? Sound-meaning systematicity, if it exists in the mental lexicon, may aid in retrieval of two kinds. First, upon hearing a new or rarely used word, a listener may be able to utilize the general form of the word to relate it to other, more familiar, words that fall into the same formmeaning grouping, if one exists. This might let the hearer interpret the unknown word’s general connotation, if not its denotation. I hypothesize that this would be a fallback retrieval method for words which the hearer cannot analyze successfully using morphology, etymology, or context. Or, perhaps systematicity works along with, rather than subsequent to, these analyses. Systematicity may also be an aid in the opposite type of retrieval; in the “tip-of-the-tongue” phenomenon, a speaker has a particular meaning in mind but cannot come up with the appropriate word right away, even if it is a word that the speaker knows and has previously used. In this case, the speaker might appeal to systematicity to retrieve words that are akin to the sought out word in meaning and also in form, thereby leading to a simultaneous phonological and semantic priming (e.g., Boudelaa 2004). Along these lines, one might think about systematicity as an aid to first language learning. A child may appeal to form-meaning correspondences when learning names for new objects in her world or new concepts. If these hypotheses are correct, then systematicity must be a built-in characteristic of language, one that exists to aid the learner, the hearer, and the speaker. We would then expect to find a very similar formmeaning correlation existing cross linguistically. Or, perhaps the amount of systematicity within a given language will be in direct correlation with learners’ perceived degree of ease of learning it as a second language; that is, languages with a high degree of form-meaning correlation should be easier to learn than those without this correlation. Also, if systematicity is an aid for retrieving the form or meaning of rarely used words, we might expect to find a higher degree of correlation among these words. Indeed, this was the effect found in the similar
16
ILANA BROMBERG
systematicity study conducted phonological information.
using
an
English
corpus
and
5. Further Work Before testing this theory on other languages, I plan to pursue this line of research more closely in Arabic. My most immediate step will be to run this same study, searching for the least common words in the corpus. The lack of overlap in context words among this new set of target words is an empirical problem to be overcome; hence it has not been included in this initial study. I am also interested in seeing if a significant difference in systematicity exists between derived and nonderived nouns in Arabic, as they have a different distribution in the lexicon. I predict that non-derived nouns, which hypothetically occur less frequently overall (due to the fact that derived nouns also occur in verb form), will show more systematicity. Some planned changes to the research itself include using a different morphological analysis in the semantic comparisons, such that there is less uncertainty in predicting what is the correct full form and analysis of each word presented in the newspaper text. This may be done through part of speech tagging to rank each of the choices delivered by the BMA, or by using a different tool entirely, such as Mona Diab’s Support Vector Machine analyzer (2004). Furthermore, I intend to experiment with other methods of phonetic analysis, including acoustic analyses of recorded data. Hopefully, these changes will refine the results so that they will be more reliable when comparing to data from other languages.
REFERENCES Arabic Gigaword. Linguistic Data Consortium University of Pennsylvania, 2003. LDC Catalog No.: LDC2003T12. Bergen, Benjamin K. 2004. “The Psychological Reality of Phonaesthemes”. Language 80:2.290-311. Bohas, Georges. 1997. Matrices, étymons, racines, éléments d’une théorie lexicologique du vocabulaire arabe. Louvain: Paris. Boudelaa, Sami & William D. Marslen-Wilson. 2004. “Abstract Morphemes and Lexical Representation: The CV-skeleton in Arabic”. Cognition 92.271-303.
SYSTEMATICITY IN THE ARABIC LEXICON
17
Brustad, Kristen, Mahmoud Al-Batal & Abbas Al-Tonsi. 1995. Alif-Baa: Introduction to Arabic letters and sounds. Washington, D.C.: Georgetown University Press. Buckwalter, Tim. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, University of Pennsylvania, 2002. LDC Catalog No.: LDC2002L49. Cohen, Paul R. 1995. “Empirical Methods for Artificial Intelligence”. Cambridge, Mass.: MIT Press. Cutting, Douglass, David Karger, Jan Pederson, & John W. Tukey. 1992. “Scatter/Gather: A cluster-based approach to browsing large document collections”. Proceedings of the 15th Annual ACM/SIGIR Conference: Copenhagen. Diab, Mona. 2004. “Relieving the Data Acquisition Bottleneck for Word Sense Disambiguation”. Proceedings of ACL 2004. El-Imam, Yousif A. 2001. “Synthesis of Arabic from Short Sound Clusters”. Computer Speech and Language 15. 355-380. Firth, John Rupert.1957. “Modes of Meaning”. Papers in Linguistics 1934-1951. Oxford: Oxford University Press. 190-215. Hayes, Bruce. 2004. FeaturePad. Downloaded from http://www.linguistics. ucla.edu/people/hayes/120a/FeaturePad.htm, Feb 3 2005. Landauer, Thomas K. & Susan T. Dumais. 1997. “A Solution to Plato’s Problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge”. Psychological Review 104:2. 211-240. Magnus,Margaret. 2000. What’s in a Word? Evidence for phonosemantics. Ph.D. dissertation, University of Trondheim. Manning, Christopher D. & Hinrich Schütze. 1999. “Foundations of Statistical Natural Language Processing”. Cambridge: Massachusetts Institute of Technology. Nerbonne, John & Wilbert Heeringa. 1997. “Measuring Dialect Distance Phonetically”. Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON-97). Rohde, Douglas L. T., Laura M. Gonnerman, & David C. Plaut. Forthcoming. “An Improved Method for Deriving Word Meaning from Lexical CoOccurrence”. Sankoff, David & Joseph Kruskal. 1999. Time Warps, String Edits, and Macromolecules: The theory and practice of sequence comparison. CSLI Publications. Saussure, Ferdinand de. 1916. Course in General Linguistics. Charles Bally & Albert Sechehaye, eds. Wade Baskin trans. New York: Philosophical Library. Shillcock, Richard, Simon Kirby, Scott MacDonald & Chris Brew. Forthcoming. “Systematicity in the Mental Lexicon”.
ARABIC PAPPI A PRINCIPLES AND PARAMETERS PARSER
Sandiway Fong1 University of Arizona
1. Introduction PAPPI is a freely-available and extensible multilingual parsing engine in the Principles-and-Parameters (P&P) framework (Chomsky 1981).2 At the heart of PAPPI is a core engine written in Prolog, a Horn clause logic-based programming language originally designed for natural language processing (Colmerauer et al. 1973), containing both a module for the recovery of phrase structure and a set of linguisticallymotivated primitives for the expression of structural constraints imposed by linguistic theory. The basic system described in Fong (1991) implements classic Government and Binding (GB) theory for English (Lasnik & Uriagereka 1988). Following the central working hypothesis in the P&P framework, i.e. that there is a core theory or set of basic principles broadly applicable across languages, the system is designed to operate with a single parsing engine implementing GB theory, while at the same time accommodating implementations of different languages under systematic parameterization. More concretely, in addition to English, PAPPI implementations exist for SVO languages like Chinese (Lin 1
The author would like to thank Abdelkader Fassi Fehri of IERA (Mohammed V University, Rabat, Morocco) for supplying the original impetus for the Arabicspecific implementation described here. 2 PAPPI is available for the MacOS X and Linux software platforms at http://dingo. sbs.arizona.edu/~sandiway/pappi/.
20
SANDIWAY FONG
1997) and French, V2 (verb-second) languages like Dutch, and SOV (head-final) languages such as Turkish (Birtürk 1998) and Japanese (Fong 2001). Although each language extends PAPPI in a different direction, all of the implementations share a common core theory. In software engineering terms, this advantage of the P&P approach results in a reduction in the cost and time required for initial parser construction (provided, of course, that the theory being implemented for a particular language is broadly compatible with extant implementations). The reuse of core linguistic principles across languages is not only an efficient use of software resources, but also serves to reinforce or confirm the theoretical status of the principles as being broadly applicable across languages. The goal of this paper is to continue this theme of re-use, and show how PAPPI can be adapted to handle aspects of Arabic syntax unexplored in any other PAPPI language implementation to date. In particular, this paper highlights the implementation of VSO/SVO word order and concomitant verbal agreement phenomena in Arabic. 2. PAPPI Architecture The PAPPI parser is organized around a series of four software layers shown in Figure 1 below. Lexicon
Parameters Periphery
PS Rules
Principles
Programming Language
Compilation Stage
LR(1) Type Chain Tree Inf.
Figure 1. PAPPI Architecture
2.1 The language-particular level At the topmost level, a lexicon, parameter settings and a periphery file must be defined for each language. In this section, we will review
ARABIC PAPPI
21
the lexicon and parameter settings used for Arabic. The periphery file holds language-particular rules not directly derivable from core principles. The AGR criterion, to be discussed later in this paper, is an example of a rule or principle specific to the Arabic implementation and not already present in the core system. 2.1.1 The lexicon The lexicon is a list of the ‘words’ recognized by the parser together with category and syntactic features that together drive phrase structure recovery and principle applicability. Depending on the size and complexity of the particular implementation, words may be morphologically decomposed via simple concatenative rules also stated in the lexicon, e.g. see the implementation of Turkish morphology in Birtürk (1998). 3 For example, (1), adapted from (Fassi Fehri 1993:2.5) has the PAPPI lexical entries given in (2). (1) kataba r-rajul-u r-risaalat-a wrote the-man-NOM the-letter-ACC “the man wrote the letter” (2) a. lex(kataba, v, [morph(write,past(+)),grid([agent],[theme])]). b. lex(r, mrkr, [right(n,[],[prefix(r),def(+)])]). c. lex(rajul, n, [a(–),p(–),agr([3,sg,m]),case(_),theta(_)]). d. lex(u, mrkr, [left(n,[],[suffix(nom),morphC(nom)])]). e. lex(risaalat, n, [a(–),p(–),agr([3,sg,n]),case(_),theta(_)]). f. lex(a, mrkr, [left(n,[],[suffix(acc),morphC(acc)])]).
Each entry representing a word form or morpheme is defined using the Prolog predicate lex of arity 3, i.e. lex takes 3 arguments (written as lex/3), with template given in (3). (3) lex(W,C,L). W = word or morpheme C = category label L = list of features (comma-separated list delimited by square brackets)
3
Currently, there is no template-style morphology implementation, a feature that would be useful in describing Semitic languages in PAPPI. An external morphological analyzer can also be interfaced to PAPPI.
22
SANDIWAY FONG
The verb form kataba in (2a) has the feature morph(write,past(+)), indicating that it is the past tense form of write in Arabic, and also the feature grid([agent],[theme]), indicating to the parser that it has a thematic grid with external role agent and an internal role theme that must be instantiated in syntactic structure. The thematic grid will also drive theta role assignment in the parser. Nouns rajul and risaalat, in (2c) and (2e), respectively, each have Binding theory features a(–) and p(–) indicating that they are nonanaphoric and non-pronominal. The values of these two features informs the parser that (common) nouns like rajul and risaalat are not subject to Binding conditions A and B. (However, [a(–),p(–)] marks them as referential expressions, and hence they are subject to Binding condition C.) The feature agr([P,N,G]) encodes person (P), number (N) and gender (G) agreement features for the two nouns. Finally, the case(_) and theta(_) features are initially unvalued, as indicated by the Prolog anonymous variable underscore “_”. In the course of parsing, these slots will be instantiated by parser operations implementing Case and Theta principles. The lexical entries for the morphemes r-, -u and -a are given in (2b), (2d) and (2f), respectively. These morphemes are encoded as “markers”, indicated by the distinguished category label mrkr. PAPPI markers are lexical entries that do not project syntactic structure; instead, they are resolved as syntactic features attached to adjacent lexical items. For example, the definite determiner prefix r- has feature right(n,[],[prefix(r),def(+)]). The feature right/3 has template (4). (4) right(C,L1,L2) C = category to be matched to the right L1 = list of match features (Note: [] represents the empty list, i.e. nothing to match) L2 = list of add features
This means that the morpheme r- does not project in this implementation; it simply finds the noun to its right, adds features prefix(r) and def(+), and disappears. In the case of (1), rajul will be the recipient of these two features. The Case markers -u and -a, given in (2d) and (2f), operate in similar fashion to r-, i.e. they do not project; instead they mark the
ARABIC PAPPI
23
noun to their left with syntactic features. In particular, the noun to the left will receive a morphC (morphological Case) feature instantiated to values nom (nominative) and acc (accusative), respectively. Case markers are already present in the PAPPI system for of-insertion in English, and the Japanese case particle system. For Arabic, we simply make use of the pre-existing code. Case theory will check the value of morphC against the value assigned to the abstract case(_) feature during parsing. 2.1.2 The parameters The core theory includes modules dealing with phrase structure, head and phrasal movement, Case, Theta and Binding theory. With respect to word order, Arabic is implemented as a SVO language; in Xbar theoretic terms, phrases are specified as uniformly head-initial and specifier-initial, as with English. (We will describe how VSO word order is implemented in this framework in section 2.2.) For head movement in the extended verbal complex, Arabic is specified as having ‘strong’ agreement (unlike English) in the sense of Pollock (1989). Other parameters in the system govern the introduction and licensing of empty categories, e.g. whether empty expletives are allowed and the binary-valued pro-drop parameter; also, by means of a wh-in-syntax parameter, whether wh-movement is overt or covert. 2.2 The core theory level As Figure 1 indicates, below the language-particular layer, i.e. the lexicon, parameter settings and periphery file, lies the core set of principles common to all implementations. This core level is logically divided into a phrase structure component, based on X-bar syntax and the effects of (overt) movement, and a set of principles or constraints implementing various modules of GB theory. In this section, we briefly outline these two components and their parameterization with respect to the Arabic implementation. 2.2.1 X-bar syntax The parser recovers X-bar phrase structure, factoring in for the effects of overt phrasal and head movement plus word order. The X-bar rule template system shown in (5–6) is part of the core theory common to all implementations.
24
SANDIWAY FONG
(5) a. rule XP -> [XB|spec(XB)] ordered specFinal st max(XP)proj(XB,XP). b. rule XB -> [X|compl(X)] ordered headInitial(X) st bar(XB), proj(X,XB), head(X). (6) a. head(n). head(v). head(a). head(p). head(i). head(c). head(neg). b. bar(n1). bar(v1). bar(a1). bar(p1). bar(i1). bar(c1). bar(neg1). bar(v2). c. max(np). max(vp). max(ap). max(pp). max(ip). max(cp). max(negp). d. proj(n,n1). proj(v,v1). proj(a,a1). etc. e. proj(n1,np). proj(v1,vp). proj(a1,ap). etc.
(5a) specifies that XP, as instantiated through the category labels defined by max/1 in (6c), expands to the constituent ordering XB followed by spec(XB) provided specFinal is true, i.e. when the language is specifier-final. In the case of Arabic and English parameterization, specFinal is set so that it does not hold, and consequently, XP expands to the reverse or ‘flipped’ order, namely spec(XB) followed by XB. The single-bar category label XB is defined via proj(XB,XP) as instantiated in (6e). spec(XP) is parameterized according to category. For example, spec(i1) can be defined to be np (noun phrase), indicating that inflection has a specifier, i.e. the surface subject position. Similarly, spec(v1) can be specified as null (subjects are first merged at specifier-IP) or np, i.e. first merged at specifier-VP assuming the VPinternal subject hypothesis. Similar considerations apply to the expansion of single-bar level XB to a head X followed by compl(X) in (5b). The complement relation compl(X), as given in (7), encodes categorical selection in standard PAPPI for the extended verbal projection VP–(NegP)–IP–CP. (This ordering will be revised in section 3.) Given the phrase structure definitions in (5)–(7) instantiated for Arabic, and factoring in head movement in the verbal complex (to be described below), we can recover the corresponding Arabic phrase structure for example (1) as shown in Figure (2) below. 2.2.2 Verbal head movement Note that in Figure 2, the surface word order is VSO. This is achieved from an underlying SVO word order by a combination of the nominative Case-marked subject r-rajul-u being merged in specifierVP, i.e. we assume the VP-internal subject hypothesis, and the inflected verb kataba raising to head-adjoin to inflection from its underlying position (marked by the trace Vt left by head movement).
ARABIC PAPPI
25
(7) compl(i,vp). compl(i,negp). compl(neg,vp). compl(c,ip).
Figure 2. Phrase structure for example (1)
implements the core Pollock-style verbal head movement system as defined by the rules in (8).4 In the case of Arabic, we can take advantage of the parameter setting agr(strong) already implemented for Romance verbs. Hence, verbs will obligatorily raise to inflection, thereby completing the VSO word order. PAPPI
(8) a. b. c. d.
rule v(V) moves_to i provided agr(strong), finite(V). rule v(V) moves_to i provided agr(weak), V has_feature aux.
rule i(I) moves_to v(V) provided agr(weak), \+ V has_feature aux. rule i(I) moves_to v(V) provided agr(strong), \+ finite(V).
The verbal complex may also contain elements such as pronominal clitics. For example, in (9) (= (Fassi Fehri 1993:2.6)) the 1st person pronominal clitic –tu is attached to the matrix verb /arad (wanted). (9) 'arad-tu 'an y-uqaabil-a r-rajul-u l-mudiir-a wanted-I that 3-meet-subj the-man-nom the-director-acc
For Arabic, we co-opt the mechanism already defined for the Romance pronominal clitic system: in (10) –tu is specified as a PF clitic marker possessing pronominal Binding features [a(–),p(+)], and agreement features 1st person singular (masculine/feminine). 4
Note: in (8c–d), \+ is the Prolog negation operator, e.g. \+ finite(V) means nonfinite V.
26
SANDIWAY FONG
(10) a. lex(tu,pf(cl),[a(-),p(+),agr([1,sg,[m,f]])|Fs]) :- subjClFeatures(Fs). b. subjClFeatures([adjoin(v,right),morphC(nom)]).
The lexical entry for –tu also includes the general subject clitic properties given in (10), i.e. it head-adjoins to the right of the verb, adjoin(v,right), and it is marked with the nominative morphological Case feature morphC(nom). The resulting parse for (9) including the verbal complex /arad-tu is shown in Figure 3 below.5
Figure 3. Phrase Structure for Example (9)
2.2.3 Empty expletive EXPL In both Figure 2 and 3, subjects remains in their first merge position, i.e. the specifier-VP position, and do not raise to the surface subject position, i.e. specifier-IP. The Extended Projection Principle (EPP) is responsible for licensing the specifier-IP position. Assuming this position is always present in syntactic structure, i.e. PAPPI will always generate an (initially underspecified) empty NP to fill it, core 5
Orthographical representation is limited to the ASCII character set in some of the trees shown. In Figure 3, capitalized Q is used to stand in for '.
ARABIC PAPPI
27
principles must interact in the course of derivation to license and determine its status as one of several possible empty NPs, e.g. PRO, pro or variable. In the Arabic PAPPI implementation, licensing of empty subject NPs is extended to an empty expletive EXPL for VSO word order only via the AGR criterion. 2.2.4 The principles The basic PAPPI implementation includes a core set of GB principles broadly applicable across languages including Case and Theta theory, Binding conditions A, B and C, the ECP, LF-movement and QR, plus components of Full Interpretation, grouped into filters (operations that may rule out structures) and generators (operations that may generate additional structural possibilities and lead to “fan out”) as shown in Figure 4.
Figure 4. Filters and Generators
All phrase structure recovered by the generator Parse S-structure, the parser operation that implements X-bar syntax plus the effects of overt movement, must be validated across the other filters and generators listed.
28
SANDIWAY FONG
2.3 The programming language level The PAPPI system is designed with the goal in mind that the grammar and language implementor will create definitions for the language-particular and core theory software layers only. In particular, it is envisaged that principles will be written using a fixed set of linguistically-motivated primitive operations provided by the lowerlevel programming language layer shown in Figure 1. To illustrate how principles may be implemented using these primitives, consider the code fragments given in (11) and (12) for Case theory. (11) a. Structural Case is assigned under government (plus Case adjacency) b. sCaseAssign IN_ALL_CONFIGURATIONS X WHERE sCaseConfig(X,Case,NP) THEN NP HAS_FEATURE case(Case). c. sCaseConfig(CF,Case,XP) :governs(Head,XP,CF), caseAssigner(Head,Case), ADJACENT(Head,XP,CF) IF caseAdjacency. d. caseAssigner(INFL,nom) :% I(AGR) assigns nominative Case CAT(INFL,i), INFL HAS_FEATURE agr(_). e. caseAssigner(Verb,acc) :% Verbs assign (acc) Case to CAT(Verb,v). % a direct object f. caseAssigner(ECM,obq) :% ECM complementizer assigns CAT(ECM,c), % oblique Case ECM HAS_FEATURE ecm. (12) a. Case Filter: All lexical NPs must receive Case. b. caseFilter IN_ALL_CONFIGURATIONS X WHERE lexicalNP(X) THEN assignedCase(X). c. assignedCase(X) :- X HAS_FEATURE case(Case), ASSIGNED(Case). d. lexicalNP(X) :- CAT(X,np), \+ EC(X).
(11) defines the structural Case assignment operation sCaseAssign.6 The IN_ALL_CONFIGURATIONS primitive in (11)b) specifies that every configuration X in a phrase structure tree that happens to satisfy sCaseConfig/3, as defined in (11c), will have an NP whose feature slot case(_) will be instantiated by the value of Case determined by the
6
Note that sCaseAssign is parameterized through sCaseConfig/3 by the predicate caseAdjacency, which is set for languages like English.
ARABIC PAPPI
29
choice of case assigner as defined in (11)d–f).7 For example, Figure 5 below provides a more detailed look at the parse of example (1) (shown previously in Figure 2) illustrating the effect of the structural Case operation. As indicated by the valued feature structures case(nom) and case(acc), r-rajul-u and r-risaalat-a have been assigned abstract nominative and accusative Cases by virtue of their syntactic position, and both NPs will also satisfy the requirements of the Case Filter.8
Figure 5. Constituent Features for (1)
Similarly, (12) defines the operation caseFilter as a universally quantified condition over tree structures that checks to see whether all lexical NPs, as defined by (12d), have had their Case slots filled or not.9 7
The infix predicate HAS_FEATURE/2, and the relations ADJACENT/2 and CAT/2 are all linguistic primitives provided by the programming language layer. HAS_FEATURE and CAT/2 facilitate access to constituent features and category labels, respectively. ADJACENT/2 computes whether two constituents are adjacent in syntactic structure. 8 The parse in Figure 5 also shows the effects of Theta theory. The theta slots for both arguments have also been valued. 9 The primitives ASSIGNED/1 and EC/1 check to see whether a supplied argument is unvalued or contains an empty category, respectively.
30
SANDIWAY FONG
2.4 The computational mechanism level The final layer in Figure 1 is the computational mechanism layer. Rules and principles written at the programming language level are automatically compiled down to and mapped onto a variety of different computational mechanisms. For example, the phrase structure component is mapped to a backtracking LR(1)-based parser in PAPPI. Universally and existentially quantified conditions on phrase structure are mapped onto a series of tree-scanning operations. In standard PAPPI, candidate structures are generated one-at-a-one, and principles are evaluated serially, i.e. in generate-and-test fashion. Alternatives to this architectural model are also possible. For example, Fong (1999) describes an implementation of the lowest level for parallel execution across multiple machines. The advantage of software abstraction surfaces as transparency in the sense that none of the principles had to be rewritten or otherwise redefined when retargeted for parallel execution. 3. NegP and VSO In the (standard) extended verbal projection model implemented in PAPPI, i.e. VP-(NegP)-IP-CP, a verb may raise to inflection via negation and further onto the complementizer (C) position. For example, English allows both (13) and (13). (13) a. Didn’t John leave? b. Did John not leave?
This model has to be revised for Arabic clause structure if we are to model the account in Fassi Fehri (1993:26), in which there is no verb movement to C even in the case of yes-no questions, and negation must appear between IP and CP. (14) ;a-laa y-a;tii zayd-un Q not 3-comes Zayd-nom “Isn’t Zayd coming?”
This can be accomplished by re-defining (or effectively parameterizing) categorial selection for Arabic X-bar structures only as shown in (15), cf. (7). The extended verbal projection is now morphed into the sequence VP-IP-(NegP)-CP.
ARABIC PAPPI
31
(15) compl(i,vp). compl(neg,i2). compl(c,i2). compl(c,negp).
As a result, (14) can now be parsed as shown in Figure 6.
Figure 6. Parse Structure for (14)
4. VSO/SOV Word Order and the AGR Criterion Fassi Fehri (1993:32) motivates the AGR Criterion through contrasts like in (16) and (17). (16) a. daxal-at n-nisaa'-u makaatib-a(-hunna) entered-F the-women-NOM office.PL-ACC(-their.F) “The women have entered (their) offices” b. *daxal-na n-nisaa'-u makaatib-a(-hunna) entered-F.PL the-women-NOM office.PL-ACC(-their.F) (17) a. n-nisaa;-u daxal-na makaatib-a(-hunna) the-women-NOM entered-F.PL office.PL-ACC(-their.F) “The women have entered (their) offices” b. *n-nisaa'-u daxal-at makaatib-a(-hunna) the-women-NOM entered-F office.PL-ACC(-their.F)
In (16) and (17), the verb exhibits gender agreement only with the subject n-nisaa;-u “the women”, whereas in (16) and (17), number agreement is also present. One possible generalization is that rich AGR (gender + number agreement) can license a referential NP in specifierIP position, whereas poor AGR (gender agreement only) only licenses the empty expletive EXPL (introduced in section 2.2).
32
SANDIWAY FONG
The PAPPI specification of the AGR Criterion is given in (18). (18) a. agrCriterion IN_ALL_CONFIGURATIONS X WHERE specIP(X,Spec) THEN poorAGRiffEXPL(Spec,X). b. specIP(IP,Subject) :- CAT(IP,i2), Subject SPECIFIER_OF IP. c. poorAGRiffEXPL(Spec,IP) :IP HAS_FEATURE agr(AGR), poorAGR(AGR) IFF expletive(Spec). d. expletive(X) :- X HAS_FEATURE nonarg(+). e. poorAGR(X) :- unspecifiedForNumber(X).
Along similar lines to the operations defined in section 2.3, agrCriterion is a universally quantified tree predicate that checks all specifier-IP configurations for compliance with the AGR Criterion. In the case where poor AGR holds, only an expletive, defined as possessing the feature nonarg(+), may occupy the surface subject position. Should poorAGR/1 fail to hold, i.e. AGR is specified for number, the IFF operator in (18c) ensures that expletive/1 does not hold for the element occupying the subject position. The AGR Criterion permits examples (16a) and (17a), as shown in Figures 7 and 8, respectively, to pass.
Figure 7. Parse Structure for (16a)
ARABIC PAPPI
33
Figure 8. Parse Structure for (17a)
It also correctly blocks (17b), i.e. the case of poor AGR with a nonexpletive in specifier-IP position, as shown in Figure 9.
Figure 9. Parse Structure for (17b)
However, it fails to immediately block (16b), i.e. the case of rich AGR without subject raising to specifier-IP, as shown in Figure 10.
34
SANDIWAY FONG
Figure 10. Parse Structure for (16b)
The reason is that PAPPI inserts an (underspecified) null NP at the time of phrase structure recovery for (16b). In particular, the null NP is initially unspecified with respect to expletiveness or referential argumenthood. The AGR Criterion operation checks only for nonexpletiveness, so it is passed onwards to other principles as a possible empty argument. However, the Theta Criterion, which forces available theta roles and arguments into a 1-to-1 pairing, ultimately blocks the empty argument version of (16b) since the empty NP occupies speciferIP, which is a non-theta position. 5. Conclusions We have described a computer implementation of Arabic clausal structure compatible with the P&P framework. By re-using grammar components already developed for core grammar, or by building on or re-configuring mechanisms developed specifically for existing languages in the PAPPI system, we have demonstrated how a sample Arabic parser can be quickly produced. We have also shown how language-particular constraints such as the AGR criterion can be defined in a multilingual parsing system.
ARABIC PAPPI
35
REFERENCES Birtürk, Ayşenur. 1998. A Computational Analysis of Turkish using the Government-Binding Approach. Ph.D. dissertation, Middle East Technical University (METR), Ankara, Turkey. Chomsky, Noam. 1981. Lectures on Government and Binding. Dordrecht: Foris. Colmerauer, Alain, H. Kanoui, R. Pasero & P. Roussel. 1973. “Une Système de Communication Homme-Machine en Français”. Report, Artificial Intelligence Group, Université d’Aix-Marseille II. Fassi Fehri, Abdelkader. 1993. Issues in the Structure of Arabic Clauses and Words. Amsterdam: Kluwer. Fong, Sandiway. 2001. “Japanese PAPPI”. Researching and Verifying an Advanced Theory of Human Language, Report (5) ed. by Kazuko Inoue, 445-464. Chiba, Japan: Kanda University of International Studies (KUIS). _____. 1999. “Parallel Principle-Based Parsing”. In Proceedings of the Sixth International Workshop on Natural Language Understanding and Logic Programming (NLULP ’99), 45–57, International Conference on Logic Programming (ICLP), Las Cruces, New Mexico. _____. 1991. Computational Properties of Principle-Based Grammatical Theories. Ph.D. dissertation, MIT Artificial Intelligence Laboratory, Cambridge. Lasnik, Howard & Juan Uriagereka. 1988. A Course in GB Syntax: Lectures on binding and empty categories. Cambridge, MA: MIT Press. Lin, Koong. 1997. Pappi-C: A Chinese Principles-and-Parameters. Ph.D. dissertation, Tsinghua University, Taiwan. Pollock, Jean-Yves. 1989. “Verb Movement, Universal Grammar, and the Structure of IP”. Linguistic Inquiry 20.365–424.
CORPUS-BASED LINGUISTIC ANALYSES TESTING INTUITIONS ABOUT ARABIC STRUCTURE AND USE
1
Salem Ghazali Institut Supérieur des langues de Tunis
1. Introduction Evidence from language corpora shows that a great deal of the information provided by dictionaries in general, and Arabic dictionaries in particular, on the meaning and use of words is scanty, sometimes obsolete and hardly meets the needs of the learner. Following Sinclair (1996, 1998, 1999) and others for English, I will attempt to argue on the basis of evidence from Arabic corpora that the meaning or function of a lexical item is largely determined by other words with which it tends to co-occur to a varying degree. Dictionaries do not say much about typical meanings and uses of words as part of a pattern whether in terms of word forms (collocations) or grammatical classes (colligations) nor do they indicate the semantic preference a word may have or the communicative intent the use of a specific word may imply. As an illustration of the valuable contribution of corpus linguistics to refining and correcting our understanding of lexis and grammar, this paper investigates the use of two Arabic words, one very common, the
1
I am very grateful to Hafedh Hlila and Ferid Chekili for reading this paper and for the pertinent comments and suggestions they made. Given their interest in syntactic theory, they both proposed that I include syntactic arguments and justifications for a number of points in the paper. I have not, however, gone that far since this paper is intended to be a description of how words and structures are actually used and not an account of theoretical issues underlying that usage.
38
SALEM GHAZALI
verb particle qad, and one much less frequent, the verb waʃuka /Ɂawʃaka. 1.1 The corpus The corpus from which these words were extracted comprises approximately five million words representing twentieth-century texts such as the complete novel Ɂal-Ɂayyaam by Taha ˙ussein, passages from several other modern novels, essays on philosophy and literature, translations of foreign literary and philosophical texts, doctoral theses in philosophy and linguistics, phonetics, secondary school books, the Bible, newspaper articles as well as texts from the Middle Ages such as kitaabu Ɂal-buxalaa by Ɂal-jaaḥiẓ and other texts from Ɂat-tawḥiidiʕ’s night chats. The available corpus includes around 18 million words, but more than half of them come from newspapers. It was then decided to work with a balanced selection and not include all the journalistic texts in order to reduce bias by having more or less equal input from different types of texts. 2. The Verb Ɂawʃaka The dictionary entry for the root “w-2-k” includes, among other forms, two main verb forms, waʃuka and Ɂawʃaka, the second being the “muqaaraba” verb form. The Arabic dictionaries Ɂal-ʕarab Ɂal-muḥiiṭ and Ɂal-muʕ1im Ɂal-wasiiṭ give both forms the same meanings: (a) “hurry, speed up”, and (b) “forthcoming, about to happen”, but only occurrences of the second meaning are found in the present corpus. Ɂalmuʕjim Ɂal-wasiiṭ explicitly notes that the muqaaraba verb Ɂawʃaka may also be used in addition to waʃuka implying, I presume, that the typical form is waʃuka. However, all of the occurrences of the verb in the corpus are word-forms of Ɂawʃaka and not waʃuka. The dictionary also states that the verb is used more frequently in the imperfective, an observation confirmed in this corpus as the verb occurred 32 times in the imperfective and only 14 times in the perfective. There were also four occurrences of the lexeme as a noun waʃk. 2.1 Collocation and colligation patterns The two major collocates of the verb Ɂawʃaka are the complementizer Ɂan (32 occurrences) and the preposition 'ala (17 occurrences). The most common pattern (21 occurrences) is one where
CORPUS-BASED LINGUISTIC ANALYSES
39
the verb Ɂawʃaka is followed by a definite lexical NP then by the complementizer Ɂan followed by a verb in the imperfective as in (1v)), (2i), (2ii), (2v) and (3iii) below. The NP may of course include a modifier or may be clausal, but typically Ɂan immediately follows the NP. The structural subject NP may also precede the verb Ɂawʃaka (13 occurrences), in which case Ɂan immediately follows the verb as in (1iii), (1vii), and (1x) below. The second most frequent pattern is when Ɂawʃaka is followed by the preposition 'ala. In this pattern (nine occurrences), the verb is typically followed by the preposition ʕala then an NP (1ii), (1viii) (see also 2ii, 3iv). A lexical subject NP may also follow the verb and precede the preposition ʕala (2vi). (1)
i. δakara raɁiisu-l-ḥukuumati Ɂanna Ɂal-ʒazaaɁira Ɂawʃakat fi sanat 1994 ʕala Ɂal-Ɂiflaas “The head of the government mentioned that in 1994 Algeria was on the brink of bankruptcy.” ii. wa ḥiinan kuntu Ɂuuʃiku ʕala Ɂal-halaak ...and sometimes I was about to die iii. wa kaana…huʒuumun ʕala Ɂal-muɁallifiina yuuʃiku Ɂan yakuuna ʕudwaanan. “The attack on (criticism of) authors was almost an aggression” iv. maʕa Ɂintiẓaarin liwaʃaki qiyaami Ɂal-saaʕati “…while waiting for doomsday” v. wa kaan qabla kulli ʃayɁin muqtaṣidan yuuʃiku Ɂiqtiṣaadu-hu Ɂan yabluxa Ɂal-buxla “He was, first of all, thrifty, almost greedy.” vi. wa Ɂawʃakat Ɂal-fawḑa Ɂan tasuuda “The situation was about to become chaotic.” vii. ḥatta Ɂawʃakat Ɂan tatruka fi ḥayaati Ɂal-fata ɁaƟaaran munkirat “…it was about to have adverse effects on the life of the boy.” viii. wa Ɂana raʒulun qad Ɂawʃaktu ʕala Ɂal-kibari “I am a man about to reach old age” ix. wa Ɂawʃakat Ɂiʒaazati ʕala Ɂal-intihaaɁi “...and my vacation was about to end.” x. ʕinda ḥaaffati Ɂal-sariiri, tadnu min δura Ɂal-ʃaʒani, tuuʃiku Ɂan tadmaʕa. “On the bedside, becoming extremely sad, about to burst into tears.”
40
SALEM GHAZALI
These patterns thus show that the verb has a narrow range both in terms of collocation and colligation; only a limited set of words are coselected and these words occur in a specific structure. With regard to grammatical distribution (colligation), the verb is always followed, within a four-word span, by either a complementizer followed by a verb in the imperfective or a preposition followed by an NP. In terms of lexical co-selection (collocation), the comple-mentizer is always Ɂan and the preposition is always ʕala. Having outlined these regularities in terms of both the syntactic patterning and the lexical co-selection of the verb Ɂawʃaka, let us turn to some pragmatics aspects of the verb; that is, what is expressed or implied by the verb when it is associated with other words. 2.2 Semantic preference and prosody A closer look at the collocates of Ɂawʃaka reveals that 30 out of the 50 occurrences of the verb found in the corpus refer to a situation where something unpleasant or undesirable is about to happen or came close to taking place. In fact, the semantic preference of this verb, i.e., the terms it co-selects, are usually words with unpleasant nature such as bankruptcy (two occur-rences), death (two occurrences), burning, aggression, doomsday, disorder, chaos, weeping, confusion, shut horizons, greed, infatuation, adverse consequences, pitfalls, abyss, aging, etc. The examples in (1) below, where the underlined words in italic are co-selected by the verb Ɂawʃaka, illustrate some of the semantic preferences mentioned above: a country on the verge of bankruptcy, a person about to die, waiting for doomsday, a situation about to become chaotic, a person about to become old, etc. The co-selected nouns of unpleasant nature may directly follow the verb Ɂawʃaka or may be separated from it by two, three, or four words (usually the subject NP that may be followed by a PP and possibly a complement clause containing the collocate). The result is thus an expressive connotation of an undesirable state of affairs that is (likely) about to take place. The illocutionary intent of the writer/speaker can only be extracted over a wide span of words, i.e., through what Sinclair (1991) has termed semantic prosody. Lexico-semantic information of this nature is very useful for the learner of Arabic, but dictionaries of Arabic do not provide them. Both of the widely-used Arabic dictionaries mentioned above are silent about such semantic prosody.
CORPUS-BASED LINGUISTIC ANALYSES
41
They only give two definitions for the verb Ɂawʃaka as stated earlier. The fact that specific connotations are associated with the verb Ɂawʃaka in 60% of the cases encountered in the corpus data shows that this usage is not due to chance, but that this verb is typically used with that particular coloring in Arabic. Going through other collocates of the verb Ɂawʃaka, one will also note that there is another semantic prosody implying that some process or state of affairs is about to come to an end and, in most cases, this ending is disastrous, undesirable, unwelcome or regretful. In these cases, the verb Ɂawʃaka has semantic preferences for words such as “end, completion, finish, exhaustion, termination, separation”, etc. In addition to what is implied in some of the examples in (1) above where “life and the world are coming to an end” in (ii) and (iv) respectively, “the end of youth” in (vii) and “the end of vacation” in (viii), there are at least 12 other occurrences of Ɂawʃaka similar connotations as illustrated in the following examples: (2)
i. wa kaana rubbama taʕarraḍa li-baʕḍi Ɂal-hammi ḥiina yuuʃiku Ɂal-ʃahru Ɂan yanqaḍ i “He might be met with some grief when the month was about to end” ii. wa yuuʃiku ma bayna Ɂaydii-himaa min Ɂal-maali Ɂan yanfaδa “and will be about to have spent all the money they have in their hands” iii. fal-sanatu tuuʃiku ʕala nihaayati-ha wa la budda lii Ɂan Ɂastaʕidda lil Ɂimtiḥaani “The (academic) year is about to end and I must get ready for the exam.” iv. wa lamma qaṣura Ɂalʕumru wa Ɂawʃaka ʕala Ɂan-nihaayati qaal lahaa… “When he started aging and his life was about to end, he said to her…” v. wa qad Ɂawʃaka al-ṣayfu Ɂan yaḍalla-na wa sa-naftariqu “The summer is approaching and we will be separated.” vi. wa Ɂiδa ḥadaƟa Ɂan Ɂawʃaka bankun ʕala Ɂal-Ɂiflaasi “If it happens that a bank is on the verge of bankruptcy”
In (2i), we have a situation where a person becomes anxious when the end of the month approaches, namely because he is going to be short of money as shown in example (2ii), which I assume follows sentence (i) in the same text. The verb Ɂawʃakain (2i) has two key
42
SALEM GHAZALI
collocates, hamm “grief” and yanqaẓii “ends”. In (2iii), a student is worried about the academic year coming to an end because he has to get ready for his exams. In (2iv) a person is aging and reaching the end of his life. In (2v) the summer is approaching, and some state of affairs, perhaps the school year, is coming to an end, which will lead to separation from friends or loved ones. There is thus an underlying regularity in the communicative use of this verb that is not inherent in the meaning provided in dictionaries but consistently emerges from the co-text. The semantic prosodies discussed above are present in the majority of the occurrences of the verb Ɂawʃaka. There are, however, a minority of cases where the verb is used in a neutral way as shown in (3) below. No expressive connotation of the type found in the examples discussed above can be inferred from the words surrounding the verb Ɂawʃaka; only one of the inherent meanings defined in the dictionary seems to be conveyed. There are a few additional occurrences of the verb where the context provided by the concordance lines was not sufficient to allow for an adequate assessment of semantic prosody. (3)
i. wa kunta taraa-ha ḥiina yartafiʕu Ɂal-ḍuḥaa wa yuuʃiku Ɂan-nahaaru Ɂan yantaṣifa… “You would see her during forenoon and as midday approaches” ii. Ɂila Ɂal-ḥaddi Ɂallaδi tuuʃiku maʕa-hu Ɂan takuuna Ɂal-qaasima Ɂalmuʃtaraka Ɂal-Ɂaʕẓam “...to the point where it would almost be the largest common denominator” iii. wa haakaδa yantahi ʕindamaa Ɂawʃaka ʕala Ɂal-xatmi Ɂila Ɂalqawli … “...and so he ends by saying when he is about to conclude….”
Except for some words such as technical terms, it is a well-known fact by now that a word interacts with other words it co-selects to create meaning and in some cases to allow for the communicative intent to emerge. When a statistical tendency for a given expressive connotation emerges from corpus data as in the case of the verb Ɂawʃaka, dictionary definitions should include that connotation. It is very useful for the learner/user of the language to come to grips with both the literal and the implied meaning of a lexical item, especially when one is dealing with a language that is not spoken as a mother tongue such as Modern
CORPUS-BASED LINGUISTIC ANALYSES
43
Standard Arabic. I assume that if some proficient users of Arabic were asked to provide the meaning of the verb Ɂawʃaka, it would be unlikely that their definitions will allude to the semantic prosodies discussed above; one would not have expected them to have more insights than lexicographers. However, there is overwhelming evidence from corpus analysis that the implied meaning of the verb is part of their linguistic competence or intuition, a compelling reason to produce corpus-based dictionaries of Arabic. 3. The Particle qad 3.1 Preliminary remarks I will now turn to qad, a word of a different nature. Unlike the verb Ɂawʃaka, qad is a verbal particle that is very frequent as it accounts for 0.3 to 1% of all the words in the corpus depending on the type of text. Initially, the corpus contained 6,314 occurrences of qad including cases when it is preceded by a coordination con-junction (wa-qad and fa-qad) and the assertive particle (la-qad). Only occurrences of bare qad and wa-qad (4,532 occurrences) were considered for this paper, which finally amounted to about 4,000 occurrences after leaving out duplicates, quotations, insufficient context, and errors. During the course of data analysis, additional texts in electronic form were made available. These include the classical texts from Ɂal-buxalaaɁ that provided an additional 226 occurrences of bare qad, and large corpora from current newspapers. After going over a couple of million words from the newly-acquired corpus from newspapers, there did not seem to be any new patterns emerging; so I stopped searching for other occurrences in newspaper texts. I finally selected the two forms qad and waqad that account for 74% of the occurrences of all the forms of qad. Within the (wa)-qad group, bare qad represents 54% of the occurrences. To my knowledge, scholars have addressed mainly the function of the verbal particle qad with respect to tense, aspect, or modality. For a discussion of these uses of qad, the reader may refer to Bahloul (1994). No serious claims will be made in this paper as to the temporal, aspectual, or pragmatic functions of qad. The main focus will be on the collocation/ colligation pattern of this verbal particle and its role in clausal structure. Thus, relatively frequent occurrences of (wa)-qad introducing canonical sentences will not be of much interest here.
44
SALEM GHAZALI
Before examining the patterns of usage of (wa)-qad as they emerge from the corpus data, let us first see what the dictionaries have to say. Note that qad is part of the entry for the root qadd or qadad which includes other lexemes that are synchronically unrelated to the verb particle qad, such as qadd “stature”, qadiid “jerked meat”, etc. For Ɂalmuʕ1im Ɂal-wasiiṭ, qad: (a) is a particle which, when preceding a verb in the perfective, conveys emphasis, and when preceding a verb in the imperfective, the verb complex will convey possibility or doubt and in other cases diminutive or augmentative meaning; (b) may also be in the form of a verbal noun such as in qadni dirham, meaning “one Dirham is enough for me”. (a) and (b) are obviously separate lexemes. lisaan Ɂal-ʕarab Ɂal-muḥiiṭ lists the above forms and usages with more details and examples from classical poetry and adds others, namely, (c) qadi used in elliptical constructions where the following verb is omitted. This usage is reminiscent of the elliptical use in English of the auxiliary “do” as in “we didn’t go but it’s as if we did”. Qadi will be used here like “did” in English where the VP is deleted. This dictionary also mentions that a verb in the perfective can be an adjunct (ḥaal) only if it is preceded by qad explicitly or implicitly, a grammatical structure that will be dealt with in some detail below. It is also worth noting at this stage that none of the forms qadni and qadi in (a) and (b) above appeared in the corpus data whether in classical or modern texts. 3.2 Structural patterns of (wa)-qad I will first make some general observations on the major co-texts where (wa)-qad appeared in the corpus, focusing mainly on the relatively frequent patterns. The first general observation is that 75% of the occurrences of (wa)-qad are found before the perfective. The high frequency of the occurrence of (wa)-qad with the perfective is not stable throughout the different corpus samples. In books such as ;al;ayyaam, (wa)-qad+ perfective accounts for most of the occurrences (98%). In doctoral theses, the compound (wa)-qad+imperfective is more common than (wa)-qad+ perfective (54% and 48%, respectively). This may be due to the fact that reporting on the results of scholarly work generally calls for circumspection, thus the use of (wa)-qad with the imperfective to express probability or possibility when explaining some observed phenomena. Similarly, in newspapers qad+imperfective
CORPUS-BASED LINGUISTIC ANALYSES
45
is used in editorials 50% of the time but only 8% of the time in reports. Having noted this, I will not have anything else to say about the use of (wa)-qad with the imperfective and will concentrate on its use with the perfective in the rest of the paper. 3.2.1 Bare qad Going over all the concordance lines, one notes that the major collocate of qad is the complementizer Ɂanna. The two other complementizers Ɂinna and, to a much lesser extent, Ɂan are also possible in related constructions as will be explained below. In fact, 39% (about 680 lines) of the occurrences of qad appeared in one of the following typical structures: (a) (main clause)+ Ɂanna +NP+ qad+ verb in the perfective, that is, a complement clause with an SVO word order introduced by the complementizer Ɂanna. The NP subject of the complement clause may be just a noun as in (4ii, iv), a noun with modifiers as in (4i), or a clitic subject pronoun attached to Ɂanna (4iii). The NP subject is followed by qad, then the verb in the perfective. These constructions can be matched with English “that clauses” where the complementizer “that” corresponds to Ɂanna in the Arabic constructions and qad would occur in the bracketed empty slots in the English translations below. Note also that the complementizer ;anna may be preceded by a preposition that can be cliticized such as bi- in bi-;anna (see 4iv). These prepositions, when they occur, are required by the type of verb in the main clause and are not pertinent in the pattern as co-selections of qad. (4) i. …Ɂaslafna Ɂanna [mawḍuuʕa Ɂal-zamaani fi kutubi Ɂal-qudaaama]NP qad ɁuƟira ʕaraḍan “…we have previously stated that the question of time in classical books [….] was raised accidentally.” ii. laakin yabdu Ɂanna [Ɂal- Ɂafʕaa]NP qad ḥassat bi-Ɂal-xaṭari Ɂalwaʃiiki…. “But it seems that the viper [….] had sensed imminent danger.” iii. … wa raɁaa Ɂanna -[hu]NP qad Ɂaḍaaʕa maa yakfii min waqtin wa ʒuhdin “…and he realized that he [….] had wasted enough time and effort.” iv. wa Ɂaḥassa Ɂal-ʕamm maḥfuuẓ bi-Ɂanna [Ɂal-laḥẓata]NP qad Ɂiqtarabat… “…and uncle Mahfuuz felt that the moment [….] is approaching.”
46
SALEM GHAZALI
As illustrated in (5) below, Ɂanna may also be preceded by li- (liɁanna) “because”, ka- (ka-Ɂanna) “as if”, raʁma (raʁma Ɂanna) “despite”, ʁayra (ʁayra Ɂanna), Ɂilla (Ɂilla Ɂanna) “but, however, although”, etc. Ɂanna and whatever precedes it in these constructions functions also as a complementizer introducing ad-verbial clauses that are fully formed sentences having the same pattern with respect to qad as the complement sentences above, that is, … Ɂanna + NP+ qad+ verb in the perfective. (5)
iii. sa-taraa-ha ka-Ɂal-Ɂaalati waqafat bal sadaɁat li-Ɂanna [muḥarrikaha]NP qad Ɂuntuziʕa min-ha. “You will see her like a machine that stopped working and rusted because its engine […] has been removed.” iv. … Ɂilla Ɂanna [Ɂan-numuwwa Ɂad-diimuʁraafii]NP qad tamayyaza huwwa Ɂal-Ɂaaxar bi-quwwati-hi Ɂal-haaɁila “However, population growth [….] was also extremely important.” v. raʁma Ɂanna[-humaa , Ɂal-ɁiƟnatayni,]DP qad subiqataa bi-ḥarfi Ɂal-ẓaaɁ Ɂal-mufaxxim “Although they [….] were both preceded by the emphatic consonant ẓaa?” vi. wa tuṭʕimu Ɂal-ṣaadira wal-waarida ka-Ɂanna [Ɂallaaha]NP qad Ɂistaxlafa-ka ʕalaa rizqi-himaa “You are feeding everyone as if God [….] entrusted you with their livelihood.”
(b) Ɂinna +NP+(adverb, PP)+qad+verb in the perfective2 where Ɂinna introduces a root sentence and can be preceded by the preposition fa (fa-Ɂinna). There are also several occurrences of the same pattern with the conjunction lakinna “but” being in the initial position of a root sentence. Note also that like Ɂanna, Ɂinna introduces embedded sentences complements of the verb qaala, (a verb of saying). The colligation pattern of qad in these sentences remains the same. The concordance lines in (6) are illustrations of these co-selections of qad. The bracketed empty slots in the English translations show the positions corresponding to Ɂinna and qad.
2
The verb following qad may also be in the imperfective, but as stated earlier, qad+imperfective expressions will not be dealt with in this paper.
CORPUS-BASED LINGUISTIC ANALYSES
(6)
47
i. Ɂinna [haaδayni Ɂal-qaanuunayni]NP qad Ɂabrazaa la-naa Ɂahammiyyata Ɂal-ʕaamili Ɂaz-zamani “[....] these two laws [….] have highlighted for us the importance of the time factor.” ii. Ɂinna [Ɂal-fursa]NP qad Ɂaʒalluu muluuka-hum Ɂiʒlaala-hum li-ɁlɁaalihati “[....]the Persians [….] have venerated their kings the same way they venerated Gods” iii. wa bi-Ɂal-fiʕli fa-Ɂinna [haaɁulaaɁi]NP qad Ɂistaṭaaʕuu Ɂan yataɁaamaruu ḍidda-hum “Indeed [….] these [….] have been able to conspire against them.” iv. wa lakinna -[hu]NP maʕa δaalika qad ḍaraba li-Ɂal-fataa mawʕidan “but, anyhow, he [….] gave an appointment to the boy.” v. wa yuqaalu Ɂinna -[hu]NP qad ʕammara Ɂarbaʕa miɁati ʕaamin “It is said that he [….] had lived four hundred years.”
Co-occurrences of qad with Ɂan are rare in this corpus and totally absent from newspapers, theses, and other modern essays and books. There are about half a dozen concordance lines, most of which come from novels. The typical structure is a verbal complement sentence (VSO order) introduced by Ɂan directly followed by qad then the verb in the perfective: main clause+ Ɂan +qad+verb in the perfective, as illustrated by the examples in (7) below. (7) i. wa Ɂiδa huwa yunbiɁu-humaa bi-Ɂan qad Ɂaana la-humaa Ɂan yusaafiraa “and there he was informing them that [….] it was time for them to travel.” ii. wa min-al-muḥqqaqi Ɂayḍan Ɂan qad kaana la-hum fi-Ɂal-rubʕi zumalaaɁun “It is also certain that [….] they have classmates in the area.” iii. wa Ɂaʃʕara-hu bi-Ɂan qad Ɂutiiḥa la-hu Ɂan yaʒlisa “He let him know that [….] he was permitted to sit down.” iv. wa xuyyila Ɂilaa haδihi Ɂal-Ɂumm Ɂ-taʕiisati Ɂan qad samiʕa Ɂallaahu la-haa wa li-zawʒi-ha “this miserable mother had the impression that [….] God has heard her prayers and those of her husband.”
The interesting observation to be made here is that while qad coselects Ɂanna/Ɂinna 39% of the time in the corpus as a whole, this pattern accounts for 50% of the occurrences of qad+perfective in
48
SALEM GHAZALI
journalistic texts. In fact as will be shown below, the remaining 50% of the occurrences of qad in journalistic style are in the verbal complex kaana…qad, which practically leaves no room for any other possibility. One may wonder whether qad in these constructions is intended for tense, aspect or modality as has been claimed in the literature, or is rather part of a pattern where its use is almost automatically triggered by other words. Given the high frequency of these patterns especially in Modern Standard Arabic as used in periodicals, I believe that collocations (and colligations) are being formed with qad with limited choice for the writer as to the phrasal context in which it should occur. Being repeatedly used in similar patterns, qad may be losing its role as an aspectual or assertive particle to become a regularly co-selected item in a phrasal expression. The second major collocate of qad is kaana in the verbal complex kaan(…)qad+perfective which accounts for 32% of all the occurrences of qad (580 concordance lines) of the general corpus, but the rates of occurrences are not uniform across all types of texts. These constructions are relatively rare in classical texts such as those of Ɂaljaḥiẓ and Ɂat-tawḥiidi (13% and 8%, respectively) and very frequent in modern writing such as in recent doctoral theses and newspapers (64% and 50%, respectively). This trend may very well indicate that kaan(…)qad+perfective is gaining the status of a collocation in Modern Standard Arabic irrespective of whether qad has retained its aspectual or modal functions. In these collocations, qad may directly follow kaana, or the two words may be separated by the NP subject: kaana+NP+qad as in (8) (i), (ii), and (iii) or NP+kaana+qad (8) (iv) and (v). The latter may be nested in a complement clause introduced by Ɂanna/Ɂinna. Note that the lexical subject NP may follow the verbal complex and other material (8v) or be absent from the immediate co-text (8vi). The examples below illustrate these structures. (8)
i. wa kaana [Ɂal-kongres Ɂal-Ɂamriiki] NP qad haddada bi-waqfi Ɂalmusaaʕadaati Ɂallati tamnaḥu-ha Ɂal-wilaayaatu… “The American Congress had threatened to stop the aid provided by the United States.” ii. wa kaana [Ɂal-druuz]NP qad Ɵaaruu ʕala firansa sanata 1925 “The Druuz revolted against France in 1925.”
CORPUS-BASED LINGUISTIC ANALYSES
49
iii. wa kaana [lubnaan]NP qad ṭaalaba maʒlisa Ɂal-Ɂamn Ɂal-dawli Ɂalyawm bi-Ɂtixaaδi ɁiʒraaɁaatin “Lebanon asked the (UN) Security Council today to take measures” iv. Ɂiδ Ɂanna [Ɂal-mufaawaḍaat]NP kaanat qad badaɁat Ɂawwalan maʕa raʒuli Ɂal-Ɂaʕmaali Ɂal-Ɂustraali “Given that negotiations had initially started with the Australian businessman.” v. fa-Ɂinna[hu]NP yakuunu qad ḥadaƟa fiʕlan fi ɁaƟnaaɁi δaalika Ɂalwaqti “It must have then really happened during that time.” vi. Ɂiδa kaana qad sabaqa-hu Ɂilaa Ɂal-kalaami fi-l-mawḍuuʕi [xuṭabaaɁun.] NP. “...if (other) speakers had preceded him in talking about the subject” vii. wa kaana qad ʕaqada Ɂawwala Ɂiʒtimaaʕin la-hu fi Ɂayyaar (maayu) Ɂal-maaḍi “It held its first meeting last May.”
A much less frequent collocation, in which qad co-selects kaana, is one where kaana immediately follows qad. The typical patterns are either qad+kaana+verb (in the imperfective) such as: ḥatta Ɂintahaa Ɂilaa ʁadiiri maaɁin kaƟiiri Ɂal-ḍafaadiʕi qad kaana yaɁtii-hi … “until he reached a pond full of frogs where he used to go…”
or qad+kaana+adjective (a predicate complement in general) such as in: wa yabduu Ɂanna Ɂabaa tammaam qad kaana waaʕiyan bi-δaalika “It seems that Abaa Tammaam was aware of that.”
These constructions account for only 3.5% of the occurrences of qad, but constitute nonetheless clear cases of collocations. Note that these patterns would be relatively more frequent if one considered colligations; that is, not the exact lexical items themselves but their word class. In these patterns, qad is often followed by one of the verbs “sisters” of kaana such as Ɂaṣbaha, Saara, baata, etc., which are inchoative predicates meaning roughly “become”, as well as other verbs. It is also interesting to note that qad kaana constructions are strikingly absent from recent newspapers texts and doctoral theses in
50
SALEM GHAZALI
this corpus. These types of texts, however, exhibit very high frequencies of occurrences of collocations where qad co-selects kaana in the imperfective to indicate probability, but as stated above these patterns are not dealt with in this paper. The corpus also comprises a range of collocations involving qad, most of which come from literary sources, and some are rather obsolete constructions. The most frequent of these collocations (4% of the total occurrences of qad) is when qad starts (or is part of) a quotation following the verb qaal “say” as in: wa qaala: “qad qabiltu δaalika Ɂayyuha Ɂal-Ɂamiir.” “He said: ‘I accept, your Royal Highness’.”
The other collocations are much less frequent but each one of them appeared at least ten times in the corpus, with (f) and (h) below found only in classical literature. a. (fa)(wa)+Ɂiδa (bi)+NP+qad+verb “(and then) and all of a sudden NP” b. wa law +qad+ verb “if (only) NP” c. kam min+ NP+ qad+ verb “how many NP” d. Ɂawa laysa+ qad+ verb “didn’t (had not, was it not the case that) NP” e. rubba +NP+ qad+ verb “many a NP” g. ma daama(NP)+ qad+ verb “since (NP) ” f. falaa yazaalu +qad+ verb “NP still (yet, didnʕt stop)” h. qad+ wallaahi+ verb “qad +by God verb” (this is the only construction where qad may be separated from the following verb)
Before turning to some syntactic aspects pertaining to the distribution of qad, I would like to make one further comment on what seems to me to be a change in the use of this particle in modern writing. It is argued that one of the major functions of qad is to denote a completed action. The Hans Wehr Arabic-English dictionary gives the word “already” as a possible translation for “qad”. There is also the word baʕdu which means “already, yet” in Arabic. For “purists”, using qad and baʕdu together would be a pleonasm since one of them is redundant as it adds no information not already provided by the other. Surprisingly, the corpus contains 21 instances of the occurrences of these two words together in the same sentence. These patterns come
CORPUS-BASED LINGUISTIC ANALYSES
51
from modern texts including scholarly essays. The examples in (9) below are an illustration of these would-be “pleonasm”. (9)
i. laʕalla Ɂal-qaariɁa qad fahima baʕdu Ɂannana lam nudrik… “The reader may have already understood that we have not realized” ii. ḥatta yakuuna Ɂal-lisaanu qad tahayyaɁa baʕdu li-l-ḥarakati Ɂalmuwaaliyati. “...until the tongue is already in a position for the following vowel.” iii. wa lam yakun baʕdu qad balaʁa sinna Ɂar-ruʒuulati “and he had not yet reached adulthood.” iv. kaana taariixu Ɂal-miitaafiizika qad ḥarrara baʕdu Ɂat-tafkiira Ɂalfalsafii “The history of metaphysics had already liberated philosophical thinking.”
When confronted with these examples, a specialist of Arabic retorted “people don’t know Arabic anymore”. That may not be the only reason, however. Factors affecting language change, such as transfer from English and French, are causing the emergence of a new usage where qad may be in the process of forming a collocation with baʕdu to achieve the function it used to achieve on its own. The remainder of the description of bare qad will be devoted to occurrences where I believe its presence is motivated by syntactic structure regardless of any additional underlying assertive, temporal or aspectual interpretations. Consider the examples in (10) below where the bracketed strings in English are the translations of what immediately follows qad in the original Arabic sentences. (10)
i. falaa yazaalu qad ʁaṣṣa….. “He kept on [choking…]” ii. haaδa Ɂal-ʃaaʕiru Ɂallaδi badaa fi naẓari Ɂal-Ɂaxbaariyyiina qad ẓafara bi-ma ɁaxṭaɁa-hu ʁayru-hu. “This poet, who according to historians, seemed [to have succeeded where others have failed].” iii. fa-lamma naẓara Ɂilaa Ɂar-raʒuli qad inƟanaa raaʒiʕan… “When he saw the man [go back]. (when he noted [that the man was going back]).” iv. fa-lamma raɁaa faraḥa-hu qad Ɂaḍʕafa qaal Ɂinna faraḥa-ka… “When he saw (felt) that his joy had increased, (when he saw his joy [increase]) he said: ‘your joy …” v. fa-waʒadtu-humaa qad faxuraa ʕalayya bi-maa ḥabaahumaa bi-hi …
52
SALEM GHAZALI
vi. vii.
viii.
ix.
x. xi.
xii.
xiii.
“I found them [boasting about the favor they obtained]… or I found that they were boasting…” yabdu lii Ɂal-baabu qad qudda min xaʃabin qadiimin “The door seems to me [to have been carved out of old wood], or (It seems to me that the door….)” Ɂamma Ɂal-marɁatu fa-hiya ʒaalisatun ʕala-r-rimaali qad rafaʕat Ɂiḥda rukbatay-ha wa … “As to the woman, she was sitting on the sand [bending up one of her knees and…]” kuntu kaƟiiran maa Ɂaḥussu bi-Ɂaxii Ɂal-Ɂaʁbar qad Ɂaxraʒa raɁsa-hu min-al-faʒwati “I often sensed my dusty (or proper N) brother [stick his head out of the opening].” Ɂila daaɁirati-l- Ɂistiʃraafi Ɂal-firansii Ɂallati tabduu qad faqadat maa kaana la-haa… “...to the French supervising body which seems [to have lost what it had]…” fa-ḥasiba nafsa-hu qad Ɂaḍḥaa fi Ɂaḥqari qaryatin min quraa.. “He believed himself [to be in one of the most despicable villages]…” wa-qtaraba kandiid min-humaa faraɁaa δaalika Ɂal-muḥsina Ɂilay-hi qad ṭafaa li-burhatin Ɵumma ʁamara-hu Ɂal-yammu… “Candide came closer to them and saw the one who was good to him [float for a short period then sink in the sea].” fa-taʒidu baʕḍa tilka Ɂal-masaaɁila Ɂal-dustuuriyya qad ʕaalaʒuu-ha min baabin yataʕallaqu bi-Ɂal-qaḍaaɁ “You find [that they have treated some of those constitutional issues from a legal perspective].” yuɁƟiruuna Ɂal-ʒuluusa ʕalaa haaδihi Ɂal-ḥuṣuri wa-l-Ɂabsiṭati qad Ɂulqiyat ʕala-l-Ɂarḍi… “They like to sit on these mats and carpets [(which) were laid on the floor…]”
Sentence (10i), from classical literature, represents a pattern that is absent from modern writing in this corpus. The collocation laa yazaalu qad is present in texts from the same period (the Middle Ages). This sentence will be ill-formed if qad is removed. Sentence (10ii) is also ungrammatical, in my opinion, if qad is deleted. Without qad, sentence (10iii) is ambiguous: a) either the subject of the verb ɁinƟanaa “go back” is the NP Ɂar-raʒul “the man” in which case ɁinƟanaa raaʒiʕan is a circumstantial clause (an adjunct), or b) the subject of the verb ɁinƟanaa is the same as that of the verb naḍara “saw”. The sentence
CORPUS-BASED LINGUISTIC ANALYSES
53
can be disambiguated by changing the inflection on the first verb naḍara, for example, naḍarna “we saw”, nonetheless the sentence is considered to “read much better” with qad introducing the circumstantial clause.3 Syntactically then, qad seems to be similar to the complementizer “that” in English and sentence (10iii) can equally be translated as “when he noted that the men was going back…” The same observations can be made with respect to the role of qad in many of the sentences in (10). In fact, in most of these cases qad is used to introduce a complement clause and is found in constructions where in English, for example, one finds: a) b) c)
“That” before a finite complement sentence, A non-finite ordinary clause with a PRO subject, and A non-finite exceptional clause with a cognitive verb. (Note the presence of verbs such as waʒada “find”, ḥasiba “consider”, raɁaa “see, feel”, etc.).
There are also frequent occurrences of qad in clauses following the verb badaa “seem”. Arguing for the assertive properties of qad, Bahloul (1994) suggested that it is less likely to appear in constructions following yabdu “it seems” because this verb is far from being assertive (p. 122). Data from this corpus, however, show that badaa/yabduu...qad constructions are very common, as illustrated in (10ii), (10vi) and (10ix). This constitutes further evidence, in my view, that qad is often needed for clausal syntax regardless of pragmatic considerations. When asked about the role of qad in these constructions some teachers of Arabic say that it is there for linking. One argument for my intuition that the main role of qad in these patterns is to introduce clauses is the fact that it may often be easily replaced with the complementizer Ɂanna. In some of these constructions, ;anna may occupy the exact position of qad; in others a change is needed at the level of clausal structure. Examples (10ii), (10v), (10vi), (10ix) and (10xii) are repeated below as (11i), (11ii), (11iii), (11iv), and (11v), with Ɂanna instead of qad. Note that in (11i), (11iii), and (11iv) Ɂanna occupies the same position as qad in the sentence and that there is no need to change the translation.4
3
According to the judgments of some colleagues who teach Arabic language and literature.
54
SALEM GHAZALI
(11)
i. haaδa Ɂal-ʃaaʕiru Ɂallaδi badaa fi naẓari Ɂal-Ɂaxbaariyyiina Ɂanna-hu ẓafara bi-ma- ɁaxṭaɁa-hu ʁayru-hu. “this poet, who according to historians, seemed to have succeeded where others have failed.” ii. yabdu lii Ɂanna Ɂal-baaba qudda min xaʃabin qadiimin “It seems to me that the door has been carved out of old wood.” iii. fa-waʒadtu Ɂanna-humaa faxuraa ʕalayya bi-maa ḥabaa-humaa bi-hi … “I found them boasting about the favor they obtained… or I found that they were boasting….” iv. Ɂila daaɁirati Ɂal- Ɂistiʃraafi Ɂal-firansii Ɂallati tabduu Ɂanna-ha faqadat maa kaana la-haa… “...to the French supervising body which seems to have lost what it had….” v. fa-taʒidu Ɂanna-hum ʕaalaʒuu baʕḍa tilka Ɂal-masaaɁili Ɂaldustuuriyya min baabin yataʕallaqu bi-l-qaḍaaɁ “You find that they have treated some of those constitutional issues from a legal perspective”
The “logical connector” function of qad is not limited, in my view, to replacing the complementizer Ɂanna. It is often required by the syntax when a modifying construction is introduced or a fresh start is needed. In example (10xiii), qad is followed by an attribute and can perfectly be replaced by the relative pronoun Ɂallati. In example (10vii), what follows qad is an adverbial clause. In various other constructions qad, usually preceded by a comma, is found at the start of a new sentence that is related to a previous one as in the following example: kaana bilgar ʕimlaaqan tabluʁu qaamatu-hu Ɂal-sittata Ɂaqdaam, qad raɁaa-ni Ɂafqudu waʕyi ʕalaa haaδa-l-maʃhadi. “‘Bilgar,’ who was a six foot giant, saw me faint in front of this scene.”
Similar occurrences of qad are numerous in the corpus, and most of them seem to be there to facilitate transition from one sentence to the other. The great majority of these occurrences of qad are best translated either with a subject pronoun co-referential with an NP in the previous sentence or by a relative pronoun in a non-restrictive relative clause. Of course, qad is often intended for tense/aspect and modality purposes, and there are many instances of that usage in the corpus. A
CORPUS-BASED LINGUISTIC ANALYSES
55
clear example is (12i) below where the verb following qad expresses an activity that is already completed and where qad seems to be intended for both assertive and aspectual purposes. It is interesting to compare (12i) to (12ii) where the latter expresses some sort of a habitual activity or general truth and where the time of predication has no bearing on the proposition expressing the general truth. In (12ii), the speaker is trying to explain that the process of forming a word such as Ɵawb “garment” resembles the process of weaving it. Note that qad can be deleted in this sentence without affecting grammaticality, but note also that it is part of a Ɂanna +NP+qad+verb in the perfective collocation/colligation in which there is a co-selection of specific lexical items and grammatical classes. (12)
i. Ɂam yuʕaddu Ɂal-ḥafiifu min-al-ḥarfi wa-l-ḥarfu qad Ɂiktamala bitamaami-l-ḥabsi wal-Ɂinfiʒaari “Or should aspiration be considered part of the consonant when the consonant has (already) been completed after closure and release?” ii. Ɂamaa taʕrifu yaa Ɂabaa biʃr Ɂanna Ɂal-kalaama Ɂismun waaqiʕun ʕalaa ɁaʃyaaɁin qad ɁiɁtalafat bi-maraatiba, wa-taquulu bi-l-maƟali: haaδa Ɵawbun… “Don’t you know, Abaa Bichr, that language (speech) is naming things that are gradually composed, you say for, example, ‘this is a garment…’”
In summary, there is no room in this paper to provide all the occurrences of bare qad, but the examples given are fairly representative and provide support for the fact that: a) A great deal of the occurrences of qad are in collocations of the type kaan(…)qad+perfective, Ɂanna +NP+qad+verb in the perfective or Ɂinna +NP+(adverb, PP)+qad+verb in the perfective. These collocations account for practically all the occurrences of qad in newspapers and scholarly essays in present-day usage. Other collocations involving qad exist as well but are less frequent. b) The particle qad is not employed necessarily for tense, aspect or modality. It is often used as a complementizer or a subordinator to facilitate sentence embedding or for smooth transition between sentences.
56
SALEM GHAZALI
3.2.2 wa-qad The situation is a lot less complex with wa-qad as it is used predominantly (84% of the cases) in sentence-initial position. Sentence-initial position does not necessarily mean after a comma or a period, even in texts where punctuation is provided. In newspaper articles this is practically the only context where wa-qad is used. Another context where we find wa-qad is when it functions as a subordinator to generally introduce an adverbial (circumstantial) clause (13% of the cases). The remaining few occurrences not accounted for are either wa-qad kaana collocations or instances where the context provided by the concordance line does not permit elucidation of its exact function. Some examples of wa-qad in sentence-initial position are given in (13) below. In examples (13iii), (13iv), and (13v) where wa-qad does not follow a period, its position in the English translations corresponds to the underlined conjunction “and”. The rest of the examples (13vi to 13x) illustrate collocations with wa-qad. Haδa waqad “besides” and xaaṣṣatan (xuṣuuṣan) wa-qad “especially that” are frequent in present-day Arabic, especially in newspapers. Ɂammaa waqad “now that” is relatively less frequent. Ɂillaa wa-qad, which has no obvious translation out of context, is only attested in classical literature in this corpus. (13)
i.
ii. iii.
iv.
v.
wa-qad taɁassasat Ɂal-raabiṭatu munδu sittati Ɂaʕwaamin wa taḍummu fi ʕuḍwiyyati-haa… “The league was founded six years ago and includes in its membership…” wa-qad ɁanʃaɁat tuunis fi-l-Ɂaʃhuri Ɂalmaaḍiyyati maʒlisan Ɂaʕlaa…. “Tunisia has set up, in the last few months, a supreme council…” humaa fi-l-ḥaqiiqati buʕdun waaḥidun, waqad ẓahara haδa Ɂalmafhuum maʕa naẓariyyati Ɂan-nisbiyyati Ɂal-muḥaddada “They are actually one dimension, and this concept appeared with special relativity theory.” Ɵumma ḥallalnaa-ha kaamilatan wa-qad makkanatna ɁalɁistintaaʒaatu min Ɂidraaki Ɵaʁaraatin naaʒimatin… “We then analyzed all of them and the findings allowed us the note some shortcomings resulting from…” wa ʃaaʕa ṣiitu-hu fi Ɂuruuba wa fi Ɂamariika wa-l-ṣiin wa-qad ʕurifa xaaṣṣatan bi-kitaabi-hi Ɂal-ʃahiir …
CORPUS-BASED LINGUISTIC ANALYSES
57
“He became famous in Europe, America and China and was especially known for his famous book…” vi. haaδa wa-qad mazzaqa Ɂal-qaδδaafi nusxatan min-al-qaanuuni Ɂalʒadiidi “Besides, Qadhdhaafi has torn out a copy of the new legislation.” vii. Ɂammaa wa-qad Ɂaṣrartumaa ʕalaa Ɂal-raḥiili, fa-Ɂinni sa-Ɂaamuru… “Now that you have insisted on leaving, I will order (instruct)….” viii. wa natamanna la-haa naʃran qariiban, xaaṣṣatan wa-qad raɁayna ɁalbaaḥiƟiina fi-l-ṣawtiyyaati Ɂal-ʕarabiyyati… “We hope it will soon be published, especially that we saw researchers in Arabic phonetics…” ix. … manaaṭiqa raʁma Ɂittissaaʕi-ha ẓallat maḥruumatan, xuṣuuṣan waqad faʃalat firqatu Ɂafaaquṣ Ɂal-qaarrati fi ɁadaaɁi haaδihi Ɂalmuhimmati. “…areas which, despite the importance of their size, continued to be deprived, especially after the permanent Sfax (theatrical) group had failed in fulfilling this mission.” x. wa maa wahaba Ɂallaahu Ɂal-ʕaqla li-Ɂaḥadin Ɂillaa wa-qad ʕaraḍa-hu li-Ɂal-naʒaati wa laa ḥalaa-hu bi-l-ʕilmi Ɂilla wa-qad daʕaa-hu Ɂila Ɂalʕamali bi-ʃaraaɁiṭi-hi “When God grants reason to someone then he will certainly lead him to safety and when he graces him with science, he is necessarily requiring him to obey its methods.”
Let us now consider the occurrences of wa-qad in (14) below where it is not in sentence-initial position and where the underlined words in the English sentences are possible translations for wa-qad. The first observation is that wa-qad can be deleted in none of these examples without reorganizing the sentences. Qad, but not wa, may be deleted perhaps in (14vii). Second, what follows wa-qad in these constructions, especially in (14i) to (14vii), is an adverbial clause (a circumstantial clause, or ḥaal in traditional Arabic grammar termi-nology) which is a fully formed sentence but has the characteristics of complement sentences in the sense that it can neither be interrogative nor imperative. Thus, here too wa-qad functions as some sort of a complementizer that can be translated in English by “when, while, after, before” or possibly a relative pronoun ((14vi) and (14vii)). In (14viii) and (14ix), the expressions set off by commas are parenthetical but wa-qad cannot be removed. In (14x) Ɂammaa wa-qad raɁaytuki faincludes the collocation Ɂammaa wa-qad +perfective mentioned above and where wa-qad is an obliga-tory constituent.
58
SALEM GHAZALI
(14)
i.
maa ɁakƟara maa kaana yastamiʕu li-l-qaariɁati wa-qad ḥamala Ɂamiinata bayna δiraaʕay-hi “He often used to listen to the reader (while) holding Amina in his arms.” ii. Ɵumma qaal li-Ɂaxii-hi wa-qad waḍaʕa yada-hu ʕalaa katifay-hi “...then he said to his brother while putting his hand on his shoulder…” iii. wa Ɂummi fi rukni-ha tastamiʕu Ɂilaa Ɂal-Ɂaxbaari yarwuuna-haa waqad tahallalat Ɂasaariiru waʒhi-haa “My mother, in her corner, is listening to the news being read with her face lines shining (bright facial expression).” iv. wa kayfa tamʃi wa-qad ʒaʕalta fi baṭni-ka maa yaḥmilu-hu ʕiʃruuna raʒulin “How can you walk when you had put in your stomach what needs twenty men to be carried.” v. wa-Ɂittaʒaha Ɂal-ʒamiiʕu Ɂila “Ɂat-tribunaal” wa-qad Ɂaḥaaṭat bi-hi quwwaatu Ɂalʒandarma wa-Ɂal-buuliis “They all headed for the court which (while it) was surrounded by national guard and police forces.” vi. Ɂaktubu naṣṣan ʕalaa lisaani Ɂal-Ɵuʕbaani wa-qad Ɂadxala-hu Ɂalʕammu maḥfuuẓ ʒiraaba-hu “I am writing an essay in the name of the snake which uncle Mahfuuz has put in his bag.” vii. wa humaa Ɂaxawaaya Ɂal-Ɂakbaru minnii diib wa-qad Ɂaṣbaḥa fiimaa baʕd Ɂadiib wa haykal “They are my two older brothers Diib, who (and he) later became Adiib, and Haykal.” viii. Ɵumma taqaddamat Ɂilay-hi waḥda-ha, wa-qad ʒaaʕa, fa-ʁaḍiba wa qaama min makaani-hi naḥwa-ha fa-qaala la-ha “Then she approached him by herself, and he was hungry, so he became angry and got up to walk towards her and said to her:” ix. fa-Ɂal-laahu ʕinda-hum manaḥa Ɂal-Ɂinsaana, wa-qad xalaqa-hu, ʕaqaaran yaḍmanu la-hu ʕaafiyyatan daaɁimatan “For them, God provided man, when (given that) he created him, with a drug that will guarantee him good health for life.” x. laqad Ɂaḥbabtu Ɂal-Ɂaanisa koniikand Ɂammaa wa-qad raɁaytu-ki faɁinnii Ɂaxʃaa Ɂallaa Ɂuhibba-haa baʕdu “I loved Miss Konikand but after seeing you (now that I have seen you) I am afraid I don’t love her anymore.”
These structures then illustrate the fact that wa-qad cannot be attributed the same function in sentence-initial position and in
CORPUS-BASED LINGUISTIC ANALYSES
59
embedding contexts. In my opinion, it is highly unlikely in Modern Standard Arabic that sentence-initial wa-qad+perfective serves any purpose other than being a filler that a writer resorts to almost automatically in order to start a sentence. Inside a sentence, however, the particle plays a major role, namely in making sentence embedding possible. One of my colleagues, who has been teaching Arabic lexicology and terminology at the university for many years, was asked to translate a book from French into Arabic, which he did to the great satisfaction of those who paid for the job, except for one small detail. He was kindly requested to go over his translation again with a view to remove some of the too many (wa)-qads. There are, of course, various contexts where the function of waqad is mainly temporal, aspectual or assertive. In (15i) and (15ii) below, qad may be deleted without affecting grammaticality, if wa is left. In that case, sentence (15i) loses the emphasis provided by qad “did die” and (15ii) also loses emphasis, and at the same time, the message that the activity of emptying the plate had already been completed will not be explicitly expressed. (15)
i. li-δalika ʕazama ʕalaa taḥṭiimi qafaṣi-hi Ɂal-ṣaxriyyi Ɂaw yamuut waqad maata. “For that reason, he decided to destroy his stone cage or die, and he did die.” ii. wa Ɂinna-maa hiyya laḥaẓaatun laa tataʒaawazu rubʕa Ɂal-saaʕati waqad fariʁa ma kaana fi-Ɂal-ṭabaqi “They were a few moments, no more than a quarter of an hour, before the plate was emptied.”
4. Conclusion When I started this investigation, I had no preset ideas on the function and use of the lexical items I decided to examine other than my high school Arabic and whatever intuition may have resulted from that training a long time ago. I have learned, using the British National Corpus,4 and later from my own work on Arabic and that of others on English, that one’s intuition and the description of a language found in dictionaries and grammar books (which are also mainly based on 4
“How to Use Corpora in Lexicography”, a workshop organized by John Sinclair and others in the Tuscan Word Center (Italy) in October 2000.
60
SALEM GHAZALI
introspection) do not provide the whole story. Extensive corpus research can help update language descriptions since language is continually changing, and in many occasions corrects the intuition of the unbiased researcher by providing key information on the nature, function and structure of language. As such, corpus-based analysis is not a linguistic theory, but one of its major empirical tools. Once I have examined all the concordance lines for the two words Ɂawʃaka and (wa)-qad, I have learned, among other things, that: (1) Both of them occur mainly as parts of typical constructions, whether collocations or colligations, where the choice of lexical items and grammatical class is constrained. (2) Ɂawʃaka has a preference for words with undesirable connotations and a semantic prosody implying the coming about of a situation that is not welcome or the end of a process leading to an undesirable state of affairs. (3) Bare qad occurs mainly in two major collocations, especially in present Modern Standard Arabic, and may also be used elsewhere as a complementizer introducing complement sentences. (4) wa-qad, if not used to start a new sentence, may also function as complementizer introducing adverbial clauses. (5) Although (wa)-qad is described in the literature I am aware of as having temporal, aspectual or assertive functions, the corpus data show that they are often either confined automatically to some position in the sentence, in a collocation/ colligation or required by the syntax for clause embedding. (6) Some of the meanings and functions of these words given by dictionaries are not found in this corpus. Most of these observations are not available in dictionaries or, to my limited knowledge, in syntactic analyses
REFERENCES Bahloul, Maher. 1994. The Syntax and Semantics of Taxis, Aspect, Tense and Modality in Standard Arabic. Ithaca: Cornell University. Ghazali, Salem & Abdelfattah Brahem. 2001. “Dictionary Definitions and Corpusbased Evidence in Modern Standard Arabic”, Arabic NLP Workshop, ACL/EACL, Toulouse. Hanks, Patrick. 2000. “Immediate Context Analysis: Distinguishing meanings by studying usage”. The Tuscan Word Center Workshop on Lexicography.
CORPUS-BASED LINGUISTIC ANALYSES
61
Sinclair, John. 1991. Corpus, Concordance, Collocations. Cambridge: Oxford University Press. _____. 1996. “The Search for Units of Meaning”, TEXTUS 9.1, 75-106. _____. 1998. “The Lexical Item”. Contrastive Lexical Semantics ed. by E. Weigand. Amsterdam: John Benjamins. _____. 1999. “A Way with Common Words”. Out of Corpora ed. by H. Hasselgard & S. Oksefjell. Amsterdam and Atlanta: Rodopi. Tognini Bonelli, Elena. 2000. “Functionally Complete Units of Meaning across English and Italian: Towards a corpus-driven approach”. Lexis in Contrast ed. by B. Granger & B. Altenberg. Amsterdam and Philadelphia: John Benjamins. Wehr, Hans. 1976. A Dictionary of Modern Written Arabic. Ed. by J. Milton Cowan. Ithaca: Spoken Language.
LEARNING ARABIC MORPHOLOGY USING STATISTICAL CONSTRAINT-SATISFACTION MODELS1 Paul Rodrigues & Damir Ćavar Indiana University
1. Introduction The morphology of Arabic has been a difficult problem for unsupervised morphological analysis systems. Typical solutions to the analysis of Semitic morphology involve rules and grammar machines that, by necessity, are nearly as complicated as the morphology they are trying to discover. Furthermore, most of these systems incorporate fixed lexicons that would require many hours of labor, broad knowledge of Arabic, and lexicographic experience to replicate. Additionally, many of these solutions incorporate linguistic knowledge, and are of little theoretical interest. Arabic words are constructed by a root and pattern-based morphological system, where the root represents a semantic field and the pattern represents syntactic information, such as voice, transitivity, or intensity. There are over 5,000 Arabic roots, which can be 3, 4, or 5 characters in length, with the shorter roots being the more common. The 3, 4, and 5 character roots each have different pattern systems. For example, McCarthy (1979) contains a table showing 72 patterns for triliteral roots, and 24 patterns for quadriliteral roots. Root morphemes occur with varying degrees of regularity. Sound roots are the most perfect, with the three radicals of the root appearing in the surface form of the word. Doubly weak roots are the least perfect, 1
The authors would like to thank Stuart Davis for pointing us to the phonological literature mentioned in this paper, Robert F. Port and Katherine Tippetts for their comments on presenting the results, and an anonymous reviewer.
64
PAUL RODRIGUES & DAMIR CAVAR
in which only one radical can be found in the word (Mace 1998:26103). Arabic also has concatenative morphology. Particles such as the definite article (al), the conjunction (wa), or pronouns can be attached to the stem, as can a morpheme representing grammatical gender. Concatenative mor-phemes that represent natural gender, person, and number exist. Additionally, case endings such as nominative (u), accusative (a), and genitive (i) may appear attached to nouns in the formal language. Reduplication occurs, but is not reported to be a productive process. McCarthy (1979) discusses several examples, such as e.g. waswas “whisper”, and mishmish “apricot”. There has been a controversy over the past several years as to the status of the root as a morpheme. Aphasia studies (Prunet et al., 2000) and hypocoristic analyses (Davis & Zawaydeh 2001) have shown evidence for the root being considered a morpheme, while the work of others, such as McOmber (1995) and Ratcliffe (1997), have shown that words are derived from a stem morpheme. We would like to point out that root identification is still necessary for lexicography and information retrieval, regardless of its morphemic status in the speaker’s lexicon. The model we propose is statistical and constraint-based. It approaches Arabic morphological parsing with a segmental approach, in accordance with current linguistic theory. Learning occurs incrementally, and we adapt our grammar with each new word. We track the accuracy of our algorithm at each word, allowing us to see how well the algorithm learns. Though bootstrapping with a dictionary would most certainly aid the algorithm, we include none. 2. Prior Work Most of the work in Arabic morphological parsing has been using the finite-state approach. Beesley & Karttunen (2000) and Beesley (1996) describe some of the solutions spearheaded by their team at XEROX. The systems described are extremely complicated machines, requiring many hand-coded rules. Though these papers are clear in theory, with clear explanation of architecture, we are never shown quantitative evaluations on Arabic data.
LEARNING ARABIC MORPHOLOGY
65
There have been several successful approaches using dictionaries. One such approach using a combination of a dictionary with statistics is Sebawai (Darwish 2002) Sebawai uses a training set of word-root pairs in order to bootstrap the learning of the root, and to construct a list of prefixes and suffixes. An additional dictionary, a small list of particles, was supplied to the parser. Sebawai reached 92.7% precision and no recall was reported. One more recent approach is the Buckwalter Arabic Morphological Analyzer (Buckwalter 2002). This is an extremely accurate morphological parser, scoring as high as 99.25% precision (no recall was reported) (Maamouri et al. 2004). This system relies on a large lexicon of 548 prefixes, 906 suffixes, and 78839 stems, as well as thousands of rules stored in morpheme compatibility tables. There has been a statistical approach designed for Hebrew root morphology with motivations similar to ours (Daya et al. 2004). Their best results came from using a Hidden Markov Model (HMM) for each radical, trained on a corpus of manually tagged roots, scoring 80.90% precision with 88.16% recall. Their system is also a constraint-based ranking system, but our systems diverge in that our approach is entirely based upon statistical co-occurrence of the root radicals, and not any sort of machine (such as a HMM), and we do not manually mark the roots in the learning phase. Additionally, they included a database of suffixes in the experiment that produced the best results. After discovering the root of the word, the suffix must be valid. If it is, a higher ranking is awarded. Though their results are impressive, we view this suffix-checking approach to be supervision. Without this suffix-dictionary, their system only reaches 59.83% precision and 57.98% recall. These numbers are still the results of a supervised approach however, as they still use a root-dictionary for learning. Though Buckwalter’s system has high precision, it requires a massive lexicon. Darwish’s and Daya et al.’s approaches pride themselves on the ability to bootstrap off of short dictionaries, but these are not realistic models of natural linguistic acquisition. Unsupervised approaches have had much greater success in the domain of concatenative morphology acquisition.
66
PAUL RODRIGUES & DAMIR CAVAR
With Dictionary (Supervised) Without Dictionary (Unsupervised) Approach Results Approach Results Darwish P=93% Rodrigues, Ćavar P, R=75% Daya et al. (Suffix+Root P=81%, Dict.) R=88% Elghamry P, R=92% Daya et al. (Root Dict P=60%, Only) R=58% Buckwalter P=99% Table 1. Summarizes different computational approaches to Arabic root parsing. (P=Precision, R=Recall)
John Goldsmith’s Linguistica approach (2001) showed very good results using a Minimum Description Length (MDL) analysis, reaching 85.9% precision and 90.4% recall averaged across English and French corpora. Goldsmith quickly points out, however, that “some of the assumptions made in the implementation restrict the useful application of the algorithms to languages in which the average number of affixes per word is less than what is found in such languages as Finnish, Hungarian, and Swahili, and we restrict our testing in the present report to more widely studied European languages.” By restricting the test set to those languages that are not only purely concatenative but also hold a low average number of morphemes per word, the problem has been greatly simplified. The Linguistica approach postulates that there is only one morphological split point, and that all words have a stem and an affix. Essentially, each character offers a Boolean decision as to whether or not it is the terminal character in the stem morpheme. A word eight characters long, with purely concatenative morphology, offers only eight possible split points. A random split will have a 12.5% chance of being correct. However, a random selection of a threecharacter Semitic root from an eight-character string only offers a 1.7% chance of being correct. The experiments performed by Creutz & Lagus (2002) on Finnish text demonstrate how difficult it is for an unsupervised morphological system to make multiple splits within a word. Linguistica performed with only 43.1% of the words correctly parsed and an additional 24.1% words partially parsed. Creutz & Lagus’ own MDL algorithm performed only moderately better.
LEARNING ARABIC MORPHOLOGY
67
3. A Statistical Approach 3.1 Root morphology In the algorithm we present here, the root system is learned by comparing frequency statistics. Evidence is weighed for and against a hypothesis of three characters being declared as the root morpheme. Positive evidence includes: the summation of the ratios between a letter being a root and being an affix, the summation of the frequency that a letter has shown up as a possible root, and the summation of the probabilities that the letter belongs to the root. This is then divided by the negative evidence: the summation of the probabilities that the letter is an affix and the summation of the frequency that a letter has shown up as a possible affix. This ratio yields a “score” for the triliteral. The triliteral that has the highest score is determined to be the root of the word. If there is a tie, then the most frequent trilateral is chosen (Elghamry 2004). Elghamry’s 2004 paper described this as a two-pass algorithm, and not an online-learning algorithm. Constraints were added to the triliteral root algorithm to reduce the search time. We considered only triliteral roots. By requiring a linear distance of five characters in between the first and last letter of the root, and a distance of no more than three characters between the radicals, numerous unlikely character combinations were rejected. While incorporating these rules adds supervised knowledge to the system, removal of these constraints does not significantly impact the results. All characters within the first and last radicals of the root, except for the middle radical, are considered the template. All characters outside the end radicals are considered concatenative morphology. For example, in the word kitaabi “my book”, ktb would be labeled the root, the interior i_aa would be our template, and the i would be labeled as possible concatenative morphology. The output of the program represents this as Xi. This does not match perfectly the definition of template used by McCarthy (1979), as his analysis introduced several templates that contain phonemes outside the end root radicals. In our system, this would be learned as concatenative morphology. 3.2 Concatenative morphology The concatenative morphology algorithm is broken up into two sub-systems, GEN and EVAL. GEN generates possible morphological
68
PAUL RODRIGUES & DAMIR CAVAR
seg-mentations for a word. EVAL then takes these hypotheses and evaluates them according to several metrics and constraints. (Ćavar et. al. 2005) GEN uses Alignment-Based Learning to generate morpheme hypotheses. At every new word, we generate only the hypotheses based upon alignment with previously learned morphemes. EVAL uses a constraint-based voting architecture to determine the optimum segmentation over various memory and processing constraints. The metrics we include are: MINIMUM DESCRIPTION LENGTH MINIMIZE KULLBACK-LEIBLER DIVERGENCE MINIMIZE RELATIVE ENTROPY MAXIMIZE MUTUAL INFORMATION MAXIMIZE FREQUENCY MAXIMIZE MORPHEME LENGTH
The Minimum Description Length Principle (Grünwald 1996) allows us to constrain our grammar hypotheses only to those that increase our grammar size the least. This allows for a smaller memory footprint and faster recall speed. For each word, we calculate the Kullback-Leibler Divergence (KLD) (MacKay 2003). KLD tells us how much our grammar will increase in size if we add our hypothesis. The result is a measure of bits. The function q represents the probability mass function (pmf) of the original grammar, and the function p represents the pmf of the new grammar. Variable x represents the currently processed token.
We have also included a variant of KLD called Relative Entropy (RE), which calculates the conditional probability between uni- and bigram sequences. Variables y and x are our string tokens.
LEARNING ARABIC MORPHOLOGY
69
Mutual Information (MI) tells us the dependence between one morpheme hypothesis and another. We weigh the result by the probability of co-occurrence. The frequency-weighted MI of al+kitAb is computed in the following example.
Each constraint has a vote, and the hypothesis with the most number of votes wins. Constraints can be adjusted in importance by weighting parameters. For integration with the Semitic root parser, these weights have been adjusted to account for the frequency of the X root placeholder. MAXIMIZE LENGTH was set to 1.5 and MAXIMIZE FREQUENCY was set to 2.5, while the remaining constraints were set to 1.0. These weights were set based on limited empirical testing, future work should include the ability to learn these weights. This system for concatenative morphology has been shown to work well with languages with simpler morphology, such as English, but poorly for agglutinative languages, such as Uzbek. Additionally, languages such as Arabic that use interdigitation perform poorly, without the tiered approach discussed in this paper (Ćavar et al. 2005). This is due to the algorithm’s preference for a fixed substring root in which statistical dependence could be calculated. The tiered solution discussed in this paper actually reduces the unpredictable data, yielding excellent results on the concatenative analysis. The reduplication module performs simple compression, looping through a string and searching for every substring match. It generates all ABC-pattern grammars of the word, and the shortest grammar string is chosen. For example, waswas “whisper” would return ABCABC, ABAB, and AA. AA would be chosen, as it represents the largest repeated pattern in the string. In the case of two reduplication grammar strings of equal length, the most frequent one is returned.
70
PAUL RODRIGUES & DAMIR CAVAR
4. Learning These experiments were performed on morphosyntactically correct words generated from the Buckwalter Arabic Morphological Analyzer database, a database of roots, prefixes, suffixes and combination rules. (Buckwalter 2002) This database uses the Buckwalter transliteration system, a lossless Latin-based orthographic system for Arabic.
LEARNING ARABIC MORPHOLOGY
71
Two random numbers between 0 and 1 are generated for each word. When the first number was above 0.15, prefixation was allowed to occur. When this was true for the second number, suffixation occurred. A random prefix, root, and suffix were selected. If the prefix was allowed to combine with the root, and the root allowed to combine with the suffix, and the root was triliteral, the vowelized word was returned. The words generated yielded an average length of 8.1319 characters, which is slightly higher than the average in the Arabic Treebank (Maamouri et al. 2004). This method, though not ensuring the word category distribution of Arabic, is necessary to both guarantee that our root is triliteral, as well as perform the online evaluation of the 10,000 words. Verifying the roots by running through another morphological analyzer would introduce undesired error into the evaluations. Preprocessing was done on the corpus to allow a better comparison with the other parsers. Alif maqsoura and ya were conflated, as well as hamza, alef maada, alef with hamza above, and alef with hamza below. These two rules were used by Darwish (2002). Alef wasla and alef were conflated as well. The shadda, the symbol
72
PAUL RODRIGUES & DAMIR CAVAR
representing gemination, was replaced by the letter immediately prior. The learning charts displayed here track the progress of root learning over the 10,000 words. The lighter lines represent moving averages of precision over 50 words, and the darker lines represents a moving average of 50 data points on that curve. We find that this algorithm predicts fairly consistently over the course of the dataset, reaching approximately 75% precision after 10,000 words. Since we do not incorporate supervised knowledge of weak radicals, our score can never be perfect. Our algorithm has a high preference towards negative evidence. Because of this, unvoweled text does not perform as well as fully pointed text. While requiring vowels is not desired for information retrieval of text, the voweled words are a closer correlate to speech. The results of the corpus without short vowels appears below. The concatenative morphology module was then fed the results of the root parse of the voweled corpus. Example output is shown below. (#wa _@0) [39 (((#liX$) 6) ((#biX$) 7)((#biAX$) 2) ((#biAlX$) 2) ((#yaX$) 4)... )] (#lilX$) [5 (((#AlX$) 1) ((#taX$) 2) ((#kaAlX$) 3) ((#Xi$) 1) ... )] (_@0 al$) [10 ( ((#liX$) 2) ((#wakaX$) 1) ((#X$) 5) ((#saX$) 1) ((#Xm$) 1))] (_@0 #taX$) [6 (((#fa) 1) ((#sa) 3) ((#wa) 2))] (#waliX$ _@0 n$) [1 (((u) 1))] (_@0 #AlX$) [10 (((#fa) 8) ((#ka) 1) ((#wa) 1))] (_@0 #liyuX$) [3 (((#fa) 3))] (_@0 #lituX$) [1 (((#fa) 1))] (@0 #taX$) [6 (((#fa) 1) ((#sa) 3) ((#wa) 2))]
Discovered morpheme signatures are within the left set of parentheses. Within the bracketed set, we find the morphemes that allow combination with the signature on the left. The number of times that morpheme is observed is also listed. For example, wa was discovered 39 times, and the this morpheme co-occurred with liX six times, biX seven times, etc. X represents the stem discovered during the root stage. Quantitatative results and learning charts are difficult to produce for the concatenative morphology. Reduplication, for example, is not noted
LEARNING ARABIC MORPHOLOGY
73
in the Buckwalter database. Additional allomorphy occurs, which makes string verification of our splits inaccurate. 5. Conclusions An unsupervised statistical approach towards Arabic morphological learning is a viable one. We have shown that without a dictionary, and using only dependency statistics, Semitic root morphology can be predicted with 75% precision. Additionally, we have introduced an algorithm for Arabic concatenative morphology that shows usable results. Our use of negative evidence in the root identification algorithm throws out the vowel template as possible root characters. This allows natural separation of a triliteral root and a vowel template. This also correlates with one of the most fundamental ideas in computational linguistics, “Zipf’s Law.” Zipf’s laws state that longer words contain more semantic infor-mation, and shorter words are more frequent. Additionally, clustering by length and frequency reveals distinct categories of open- and closed-class words. This is essentially how our Arabic root identification algorithm works at a morphemic level. Promiscuity separates the word into two tiers, one being a root template and the other being a more promiscuous vowel template. This frequency effect supports strongly the notion that the root template, and not the stem, is analogous to an open-class morpheme, while the vowel template is analogous to a separate closed-class and functional morpheme Future work should include a notion of confidence. This will allow a separation of precision and recall scores, as well as the ability to extend the algorithm to four- and five-radical roots. A quantitative analysis of the concatenative morphology must be performed as well. An open source (free to download, modify, and distribute) implementation of the root algorithm described in this paper is available online.2 The concatenative morphology system is available to researchers by contacting the authors.
2
http://jones.ling.indiana.edu/~prrodrig/
74
PAUL RODRIGUES & DAMIR CAVAR
REFERENCES Beesley, Kenneth R. 1996. “Arabic Finite State Morphological Analysis and Generation”. Proceedings of the 16th International Conference on Computational Linguistics 1.89-94. Copenhagen. Beesley, Kenneth R. & Lauri Karttunen. 2000. “Finite-State Non-Concatenative Morphotactics”. Proceedings of the 5th Workshop of the ACL Special Interest Group in Computational Phonology, 1-12. Luxembourg. Buckwalter, Tim. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium. LDC2002L49. http://www.ldc.upenn.edu/ Catalog/CatalogEntry.jsp?catalogId=LDC2002L49. Ćavar, Damir, Paul Rodrigues & Giancarlo Schrementi. 2005. “Unsupervised Morphology Induction for Part-of-Speech Tagging”. Philadelphia: U. Penn Working Papers in Linguistics 10.1. Creutz, Mathias & Krista Lagas. 2002. “Unsupervised Discovery of Morphemes”. Proceedings of the 6th Meeting of the ACL Special Interest Group in Computational Phonology, 21-30. Philadelphia. Darwish, Kareem. 2002. “Building a Shallow Morphological Analyzer in One Day”. ACL Workshop on Computational Approaches to Semitic Languages, 47-54. Philadelphia. Daya, Ezra, Dan Roth & Shuly Wintner. 2004. “Learning Hebrew Roots: Machine learning with lingustic constraints”. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona. Davis, Stuart & Bushra Adnan Zawaydeh. 2001. “Arabic Hypocoristics and the Status of the Consonantal Root”. Linguistic Inquiry 32:3.512-530. Elghamry, Khaled. 2004. “A Constraint-based Algorithm for the Identification of Arabic Roots”. Proceedings of the 1st Midwest Computational Linguistics Colloquium. Bloomington, IN. Goldsmith, John. 2001. “Unsupervised Acquisition of the Morphology of a Natural Language”. Computational Linguistics 27:2.153-198. Grünwald, Peter. 1996. “A Minimum Description Length Approach to Grammar Inference in Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing”. Lecture Notes in Artificial Intelligence ed. By S. Wermter, E. Riloff, G. Scheler 1040:203-216. Springer Verlag: Berlin. Maamouri, Mohamed, Ann Bies, Tim Buckwalter & Wigdan Mekki. 2004. “The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus”. The NEMLAR International Conference on Arabic Language Resources and Tools, 102-109. Cairo. Mace, John. 1998. Arabic Grammar: A reference guide. Edinburgh: Edinburgh University Press. MacKay, David J. C. 2003. “Information Theory, Inference, and Learning Algorithms, Version 6.0”. Cambridge: Cambridge University Press. McCarthy, John 1979. Formal Problems in Semitic Phonology and Morphology. Ph.D. dissertation. MIT. Distributed 1981 by Indiana University Linguistics Club. McOmber, Michael. 1995. “Morpheme Edges and Arabic Infixation”. Perspectives on Arabic Linguistics VII ed. by Mushira Eid, 173-189. Amsterdam & Philadelphia: John Benjamins. Prunet, Jean-Francois, Renee Beland & Ali Idrissi. 2000. “The Mental Representation of Semitic Words”. Linguistic Inquiry 31.609-648.
LEARNING ARABIC MORPHOLOGY
75
Ratcliffe, Robert. 1997. “Prosodic Templates in a Word-Based Morphological Analysis of Arabic”. Perspectives on Arabic Linguistics X ed. by Mushira Eid & Robert Ratcliffe, 147-171. Amsterdam & Philadelphia: John Benjamins. van Zaanen, Menno M. 2001. Bootstrapping Structure into Language: Alignmentbased learning. Ph.D. dissertation. The University of Leeds.
LEARNING TO USE THE PRAGUE ARABIC DEPENDENCY TREEBANK Otakar Smrž, Petr Pajas, Zdeněk Žabokrtský, Jan Hajič, Jiří Mírovský, Petr Němec Charles University in Prague, Institute of Formal and Applied Linguistics
1. Introduction Prague Arabic Dependency Treebank (PADT), recently published in its first version (Hajič et al. 2004a) by the Linguistic Data Consortium, is both a collection of multi-level linguistic annotations over Modern Standard Arabic, and a suite of unique software implementations designed for general use in Natural Language Processing. The underlying theory of this resource is reviewed in Hajič et al. (2004b). In the present paper, we focus rather on the practical aspects of using the PADT data and the computational tools in original research. 1.1 Data survey The corpus of PADT 1.0 consists of morphologically and analytically annotated newswire texts of Modern Standard Arabic, which originate from Arabic Gigaword (Graff 2003) and partly overlap with the plain data of Penn Arabic Treebank, Part 1 (Maamouri et al. 2003) and Penn Arabic Treebank, Part 2 (Maamouri et al. 2004). The rough survey of the annotations is given in Table 1. Data sets AFP, UMH and XIN come from the earlier period of the project when morphological annotations were not based on the MorphoTrees technology (cf. subsection 2.1). Therefore, the files recording the process of
78
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
morphological disambiguation of these data could not be distributed. Still, the resulting morphological information is available in the analytical files, along with the analytical annotations. The other data sets, namely ALH, ANN and XIA, are complete already and provide files of three different types—non-annotated text, MorphoTrees annotations, and analytical annotations. Information from the morphological level is also, as a prerequisite, propagated into the analytical level. Not all the data are processed on both levels, though. Data [A] Tokens [M] Original Data Provider News Period AFP 13 000 — Agence France Presse 2000 / VII UMH 38 500 — Ummah Press Service 2002 / I–III XIN 13 500 — Xinhua News Agency 2003 / V ALH 10 000 73 500 Al-Hayat News Agency 2001 / IX ANN 12 500 25 500 An-Nahar News Agency 2002 / XI XIA 26 500 49 500 Xinhua News Agency 2003 / V 113 500 Analytical level TrEd Netgraph Oraculum Encode::Arabic 148 000 MorphoTrees software + documentation Table 1. Survey of the contents of the Prague Arabic Dependency Treebank 1.0. Columns [A] and [M] represent the number of syntactic units, i.e. tokens, for analytical level and MorphoTrees, respectively.
1.2 Annotation environment The indispensable annotation environment for this and various other treebanking projects is the TrEd tree editor (Hajič et al. 2001) written in Perl/Tk. It is not only a fully programmable and customizable graphical user interface, but also an excellent suite of utilities for automated, optionally parallel, processing of the data (consistency checks and revising, batch conversions, search, difference evaluation, etc.). TrEd is documented on http://ufal.mff.cuni.cz/~pajas/tred/. We will explore some of its features in 4.2. 1.3 Treebank search engines Netgraph (Mírovský & Ondruška 2002) is a client–server application for efficient searching in treebanks. Unlike TrEd, it provides the user with an easy-to-learn graphical query language that does not
LEARNING TO USE THE PADT
79
presume any programming skills. The client application is implemented in Java and is available on http://quest.ms.mff.cuni.cz/netgraph/. Oraculum (Ljubopytnov et al. 2002) supports linguistically even more expressive queries, and operates through a sophisticated web browser interface, which is now being ported to Arabic. 1.4 Other tools Next to several other linguistically significant solutions (cf. section 5), there is the Encode::Arabic module (Smrž 2003) for Perl that supports miscellaneous modes of processing of the non-trivial, yet ingenious ArabTeX encoding notation of the Arabic script and/or its phonetic transcription (Lagally 2004). Encode::Arabic covers the Buckwalter transliteration as well. 2. Data Structures The PADT annotations are distributed as UTF-8 encoded files in the FS format, which is documented on TrEd’s website. TrEd and the array of associated tools and libraries provide options for converting these data into several XML-compliant formats, and vice versa. TrEd’s graphical renderings can be printed as PostScript, PDF, or image files. If independent data processing is desired, the files can best be accessed using the Fslib module for Perl, which is available in the distribution along with many other modules and scripts serving for data flow management, migration of annotations, updating and quality checking, difference evaluation or execution of systematic revisions. The non-annotated textual data are provided in the original XML format of the Arabic Gigaword corpus. 2.1 Functional morphology & MorphoTrees The morphological annotations of PADT used to directly employ the information produced by Buckwalter Arabic Morphological Analyzer (Buckwalter 2002). With the introduction of Functional Arabic Morphology (Smrž in prep., Smrž et al. 2005), all morphological tags were mapped as closely as possible into the current positional notation representing individual grammatical categories in separate columns.
80
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
The new type of annotations required a different disambiguation tool. The flexibility of TrEd made it possible to design and implement MorphoTrees in it as a special annotation context (Smrž & Pajas 2004).
Figure 1. The hierarchy of MorphoTrees and their annotation using restrictions (cf. Smrž & Pajas 2004).
Figure 2. View of annotated paragraph. Note the levels of distinct information.
MorphoTrees is the idea of building effective and intuitive hierarchies over and among the input and output strings of morpho-
LEARNING TO USE THE PADT
81
logical systems. It is especially interesting for Arabic and the functional morphology, but it is not limited to either of these. Figure 1 illustrates how MorphoTrees organize the morphological information/analyses into a multi-level hierarchy. The leaves of these trees are the imaginable tokens with their tags as the atomic units, and the root is the input string being analyzed, or generally an entity (some tree of discourse elements). Rising from the leaves up, there is the level of lemmas of the lexical units, the level of non-vocalized standard orthographic forms, and the level of decomposition of the entity into a sequence of such forms, implying the number of tokens and their spelling. As a convenient extension, the overall solutions of the annotations can also be viewed in a similar hierarchical structure. An example of such a paragraph tree is given in Figure 2. 2.2 Analytical dependency trees Analytical annotations represent the surface syntax of the language in the dependency formalism outlined in Hajič et al. (2004b). They provide a link from morphology to tectogrammatics—the level of linguistic meaning—of the Functional Generative Description theory (cf. Sgall et al. 2004). The analytical level is modeled with dependency trees whose nodes map, one to one, to the tokens resulting from the morphological analysis and tokenization, and whose roots group the nodes according to the division of the discourse into sentences or paragraphs. Edges in the trees establish/reconstruct syntactic relations between the governor and the dependent, or rather, the whole subtree under and including the dependent. The nature of the government is expressed by the analytical functions of the nodes being linked. In addition to this strict dependency structure, information of other kinds and character can be captured in the trees, while computational procedures for inferring any complementary information can be implemented independently of data. In TrEd, resolution of grammatical correference is automated in this manner. Identifying resumptive pronouns and deverbal inner objects by themselves is enough for some algorithm to find their grammatical counterparts and render these pairs.
82
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
In Figure 3, the instances of such non-dependency relations are shown with dashed arcs. Nonetheless, one might begin with Figure 4 for a more elementary example of an analytical tree.
Figure 3. Analytical tree featuring advanced phenomena like ellipsis of another predicate, deverbal inner objects in adverbial function, or composite auxiliary elements. Note the labels [ExD] (on otherwise coordinative expression), [Adv_Msd], [AuxY] / [AuxP] (compound preposition) or [AuxY] / [ExD], respectively.
3. Installation and General Setup PADT 1.0 is distributed by Linguistic Data Consortium, University of Pennsylvania, http://www.ldc.upenn.edu/. The PADT project has its own website, http://ufal.mff.cuni.cz/padt/, where the data and the tools are documented in detail, and from where updates and extensions to the distribution are available.
LEARNING TO USE THE PADT
83
User’s installation should start with TrEd / Perl, and might proceed with downloading the Netgraph client / Java. The software applications are platform independent, and it is relatively easy to set things up. Installation of the data management scripts and modules or the CVS repositories for the FS annotation files is optional. In order to search PADT with Netgraph, the client application must connect to a server accessing the data. Users are welcome to register with our Netgraph server, even though servers can also be run locally.
Figure 4. Analytical representation of the sentence of Figure 2, with displayed morphological tags. Note the topology and functions of the predicate and its participants (subject, direct and indirect objects), and consider differences among the distinct attributive modifications.
84
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
4. The Quest for Improper Annexation Let us have a look into the annotated data. Linguists need to search for a particular phenomenon in the language, evaluate it, contrast it with some other phenomena, consider the contexts of usage, etc. The example case that we will explore in this section is improper annexation in Arabic. A condensed definition of this phenomenon might not be precise—and we will not attempt it. Instead, we will pronounce and eventually refine our intuition that improper annexation is a genitive construction whose first term is an adjective, and whose second term is a [definite] noun (cf. for instance Schulz 2004:131– 133,140,149). We will, of course, use the treebank in order to test and improve the description of this notion. More importantly, we will learn about the applicability of PADT and its tools, and about some limitations. 4.1 Querying PADT with Netgraph A query in Netgraph is a generalized subtree having the properties of the desired treebank structures specified as attributes of its individual nodes or edges. Queries can be created interactively through a graphical interface, or equivalently, they can be linearized in a bracketing-style notation, which we will use here. [tag=A?????????] ( [tag=N?????????,afun=Atr]) Figure 5. Netgraph query for the analytical level—a simple relation.
The example query in Figure 5 will return all occurrences of adjectives that have an attributive noun as one of its children. Such a relation is weaker than what improper annexation requires. In particular, the query ignores any constraints on word order, mutual distance, grammatical case and definiteness that we expect from a genitive construction. Anyway, it is just fine to ask Netgraph again and more specifically, adding some attributes to the nodes and listing the acceptable combinations of morphological categories in the tags. This gradual ruling out of irrelevant solutions is a helpful practice.
LEARNING TO USE THE PADT
85
Netgraph queries need not concern the analytical level only. The structures in MorphoTrees can be investigated as well. Consider the query of Figure 6, which says: look for the paragraph trees, i.e. those whose root (_depth=0) is of type ‘paragraph’, in which we are interested in two immediately succeeding token nodes on the lowest level (_depth=3) such that the first one is a non-indefinite adjective and the second one is a non-indefinite noun either certainly in genitive, or with the value for case unset. Recall Figure 2 for better visualization. [type=paragraph,_depth=0] ( [_transitive=true,_depth=3, _name=N1, type=token_node, tag=A????????C|A????????D|A????????R|A????????-] , [_transitive=true,_depth=3, ord={N1.ord}+3, type=token_node, tag=N???????2C|N???????2D|N???????2R|N???????2-| N???????-C|N???????-D|N???????-R|N???????--]) Figure 6. Netgraph query for improper annexation in MorphoTrees.
Upon submitting this query to the server, we receive much more precise tips of what improper annexation could be. But when browsing through the results in Netgraph and trying to determine which of these are and which are not the appropriate cases, one may usually not see enough context of the surrounding paragraphs, and may not export the information in a very flexible way in order to process it further. Neither may the data be edited directly, if one is supposed to make corrections based on the search. How do we meet such requirements, then? 4.2 Searching and viewing in TrEd TrEd, even in its graphical annotation mode, can work with filelists, by which we define the extent of the corpus where search operations are to take place. Besides the obligatory menu item ‘Node > Find ...’ by its attributes, there is the function ‘User-defined > PerlEval’ that executes a given Perl code in the current environment of TrEd’s data structures.
86
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
The program in Figure 7 keeps iterating over the MorphoTrees data until the configuration of nodes discussed with Figure 6 is encountered. Then, the control returns to TrEd, which sets the cursor to the newly found occurrence of the hypothesized improper annexation. ChangingFile(0);
## $this represents the current node do {
if ($this->root()->{'type'} eq 'paragraph') { $prev = undef; while ($this = $this->following()) { if ($this->{'type'} eq 'token_node') { if (defined $prev and $prev->{'tag'} =~ /^A........[CDR-]$/ and $this->{'tag'} =~ /^N.......[2-][CDR-]$/ and $this->{'ord'} == $prev->{'ord'} + 3) { return; } $prev = $this; } } } } while NextTree() || NextFile();
Figure 7: TrEd evaluation code in Perl, equal to the query of Figure 6.
The program in Figure 8 is designed for the analytical level, where the dependency information, rather than immediate adjacency, can be exploited. The algorithm carefully finds the head of the genitive construction even if its tail actually consists of multiple genitives in (hierarchical) coordination or apposition (cf. Figure 9, ex. E). Plus, there are constraints on the morphological tags of the nodes in question, relaxed a little with respect to the tagset of the former disambiguation. It might be clear by now that this powerful mechanism of computing with trees can be abstracted from, and that the return instruction can be replaced with, say, printing out the current node’s address and some significant attributes of its neighbors, or with code for complex restructuring, or with simple counting. In fact, there are two important modifications of TrEd, named btred and ntred, with which almost every automatic processing, including searching, is done very quickly and conveniently.
LEARNING TO USE THE PADT
87
ChangingFile(0); ## $this represents the current node do { while ($this = $this->following()) { if ($this->{'afun'} eq 'Atr' and $this->{'tag'} =~ /^N.......[23-][CDRX-]$/) { $head = $this; $head = $head->parent() while $head->{'parallel'} =~ /^(?:Co|Ap)$/; $head = $head->parent(); return if $head->{'tag'} =~ /^A........[CDRX-]$/; } } } while NextTree() || NextFile();
Figure 8. TrEd evaluation code for finding improper annexation on the analytical level. Note how coordination/apposition nodes between the two parts of the genitive construction are treated. Values 3 and X in the tags reflect some systematic ambiguity present in the old data sets.
4.3 Improper annexation Having applied the criteria of Figure 7 and Figure 8 on our treebank data, we certainly did not obtain only improper annexations! How can we tell? And why have we not come up with the right kind of queries? For the answer to the first question, we can refer to Schulz (2004) or Badawi et al. (2004). There are crucial semantic distinctions to make as to whether the adjectival head of the genitive construction logically qualifies the dependent noun, or whether this relation is reversed. Such information is neither present in morphology, nor in analytical syntax. On the other hand, our queries do include some looseness. Ideally, the values of the relevant morphological categories should all be set. Then, the definiteness values for the head of a genitive construction could only be R (reduced) or C (complex), as we exemplify in Hajič et al. (2004b), and there would emerge other regularities that we could try to capture, or patterns that we could try to exclude. In Figure 9, we give several examples of true improper annexation that we have found, and compare it with another phenomenon that partly invades the set of search results due to the unset case information
88
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
of the nominatives therein.
Figure 9: Contrasting improper annexation (examples A–F) with nact sababī (examples O–R). Note the patterns of definiteness or agreement in both of these phenomena (cf. e.g. Badawi et al. 2004:110–116).
Needless to say, preferring the recall of a query to its precision helps discover more inconsistencies or mistakes in annotation. The way we process the results in order to filter out false positives, like printing additional information, sorting and uniq-ing it, etc., is also important. In our current situation, roughly one out of six tips provided by the queries happened to be correctly classified as improper annexation.
LEARNING TO USE THE PADT
89
Figure 10 summarizes the most interesting of these as observed in PADT—in its development version growing in size. Some of the phrases are rather idiomatic (cf. Wehr 1980), but what we notice is the actual freedom of expression and productivity of this linguistic construct. In the list, the heads of the annexations are lexicographically normalized, and the numbers in the rightmost column indicate the counts of occurrences within the treebank. 5. Applications and Prospects The applicability of treebanks is very diverse. The annotated structures can be studied in the educational or purely linguistic framework, as we have just illustrated. The other prominent motivation is to use the data for machine-learning purposes, possibly aspiring to machine translation (cf. Čmejrek et al. 2004) or computational modeling of meaning. In the course of the PADT project, we have developed systems for automatic morphological and analytical disambiguation, a.k.a. tagging and parsing (cf. Hajič et al. 2004b, Hajič et al. 2005). This technology is going to be employed in the processing of the Arabic English Parallel News Part 1 (Ma 2004). Alternative automated annotation methods also come into question, like the parallel-corpus-based syntactic projection (Hwa et al. 2005) or the conversion of constituency annotations into dependencies (Žabokrtský & Smrž 2003; cf. Habash & Rambow 2004). We would as well like to implement algorithms for detection of inconsistencies and errors in the annotations (cf. Dickinson & Meurers 2003). The PADT website will offer any eventual updates. The current distribution already includes scripts for safe and maximally efficient migration of annotations if some data need to be synchronized and the changes propagated across the levels of description.
90
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
Figure 10. Selected occurrences of improper annexation found on either level of the treebank.
LEARNING TO USE THE PADT
91
6. Conclusion We have tried to give a practical introduction to the Prague Arabic Dependency Treebank project, with emphasis on PADT 1.0 available to researchers worldwide. Having described the essential data structures in the treebank, we chose to search for and explore a particular linguistic phenomenon. We demonstrated the methodology for posing queries, and outlined how the information in the treebank might be processed in the general case. We have presented and discussed the most noteworthy instances of improper annexation in Arabic that we found in the treebank using this methodology. This is a significant result by itself, and would be extremely hard to achieve without the kind of annotations the treebank provides. We would like to invite others to try their own queries. Treebanking entails many challenging tasks, and we continue to approach them, as well as to improve the existing solutions. 7. Acknowledgements The research described herein was supported by the Ministry of Education of the Czech Republic through projects LN00A063 and MSM113200006, and continues with the support from the Grant Agency of Charles University in Prague, project 207-10/203333. At the time of writing this paper, one of the authors was a grantee of the Fulbright-Masaryk Fellowship awarded by the Fulbright Commission in the Czech Republic. The ‘quest for improper annexation’ was first suggested by Tim Buckwalter, while Iveta Kouřilová helped us with understanding and presenting the topic. We would like to thank them very much as well.
92
SMR1, PAJAS, 1ABOKRTSKY, HAJI3, MIROVSKY, NEMEC
REFERENCES Badawi, Elsaid, Mike G. Carter & Adrian Gully. 2004. Modern Written Arabic: A comprehensive grammar. London: Routledge. Buckwalter, Tim. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. LDC catalog number LDC2002L49, ISBN 1-58563-257-0. Linguistic Data Consortium, University of Pennsylvania. Čmejrek, Martin, Jan Cuřín & Jiří Havelka. 2004. “Prague Czech-English Dependency Treebank: Any hopes for a common annotation scheme?”. HLTNAACL 2004 Workshop: Frontiers in Corpus Annotation, 47–54. Boston. Dickinson, Markus & W. Detmar Meurers. 2003. “Detecting Inconsistencies in Treebanks”. Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT 2003). Växjö. Graff, David. 2003. Arabic Gigaword. LDC catalog number LDC2003T12, ISBN 1-58563-271-6. Linguistic Data Consortium, University of Pennsylvania. Habash, Nizar & Owen Rambow. 2004. “Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank”. Proceedings of Traitement Automatique du Langage Naturel (TALN-04). Fez. Hajič, Jan, Barbora Hladká & Petr Pajas. 2001. “The Prague Dependency Treebank: Annotation structure and support”. Proceedings of the IRCS Workshop on Linguistic Databases, 105–114. University of Pennsylvania. _____, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kráčmar & Kamila Hassanová. 2004a. Prague Arabic Dependency Treebank 1.0. LDC catalog number LDC2004T23, ISBN 1-58563-319-4. Linguistic Data Consortium, University of Pennsylvania. _____, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, Emanuel Beška. 2004b. “Prague Arabic Dependency Treebank: Development in data and tools”. Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, 110–117. Cairo. _____, Otakar Smrž, Tim Buckwalter & Hubert Jin. 2005. “Feature-Based Tagger of Approximations of Functional Arabic Morphology”. Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), 53–64. Barcelona. Hwa, Rebecca, Philip Resnik, Amy Weinberg, Clara Cabezas & Okan Kolak. 2005. “Bootstrapping Parsers via Syntactic Projection across Parallel Texts”. Natural Language Engineering, June 2005. Lagally, Klaus. 2004. ArabTeX: Typesetting Arabic and Hebrew. User Manual Version 4.00, Fakultät Informatik, Universität Stuttgart. Ljubopytnov, Vladimír, Petr Němec, Michaela Pilátová, Jakub Reschke & Jan Stuchl. 2002. “Oraculum, A System for Complex Linguistic Queries”. SOFSEM 2002 Student Research Forum, 27–34. Ma, Xiaoyi. 2004. Arabic English Parallel News Part 1. LDC catalog number LDC2004T18, ISBN 1-58563-310-0. Linguistic Data Consortium, University of Pennsylvania. Maamouri, Mohamed, Ann Bies, Hubert Jin & Tim Buckwalter. 2003. Arabic Treebank: Part 1 v 2.0. LDC catalog number LDC2003T06, ISBN 1-58563261-9. Linguistic Data Consortium, University of Pennsylvania. Maamouri, Mohamed, Ann Bies, Tim Buckwalter & Hubert Jin. 2004. Arabic Treebank: Part 2 v 2.0. LDC catalog number LDC2004T02, ISBN 1-58563282-1. Linguistic Data Consortium, University of Pennsylvania.
LEARNING TO USE THE PADT
93
Mírovský, Jiří & Roman Ondruška. 2002. “Netgraph System: Searching through the Prague Dependency Treebank”. Prague Bulletin of Mathematical Linguistics 77.101–104. Schulz, Eckehard. 2004. A Student Grammar of Modern Standard Arabic. Cambridge: Cambridge University Press. Sgall, Petr, Jarmila Panevová & Eva Hajičová. 2004. “Deep Syntactic Annotation: Tectogrammatical representation and beyond”. HLT-NAACL 2004 Workshop: Frontiers in Corpus Annotation, 32–38. Boston. Smrž, Otakar. In prep. Functional Arabic Morphology. Formal System and Implementation. Ph.D. thesis, Charles University in Prague. _____. 2003. Encode::Arabic. Programming module. Comprehensive Perl Archive Network, http://search.cpan.org/dist/Encode-Arabic/. _____ & Petr Pajas. 2004. “MorphoTrees of Arabic and Their Annotation in the TrEd Environment”. Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, 38–41. Cairo. Wehr, Hans. 1980. A Dictionary of Modern Written Arabic. Arabic–English. New York: Spoken Language Service. Žabokrtský, Zdeněk & Otakar Smrž. 2003. “Arabic Syntactic Trees: From constituency to dependency”. EACL 2003 Conference Companion, 183–186. Budapest.
Section II
Phonology, Morphology, and Syntax
INTONATIONAL AND RHYTHMIC PATTERNS ACROSS THE ARABIC DIALECT CONTINUUM Salem Ghazali*, Rym Hamdi*§ and Khouloud Knis* *Institut Supérieur des langues de Tunis § Université Lumière Lyon 2
1. Introduction If a non-native speaker of Arabic lived in Morocco for a period of time, and after learning MA1 tried to communicate with his newlyacquired language in Egypt, he would be in for a major disappointment. What he/she had thought was Arabic would be just Jabberwocky for an Egyptian. In a comparable situation, illiterate Arabs, too, are very likely to face the same ordeal due to the various lexical, syntactic, morphological and sound pattern differences between the Arabic dialects. There are practically as many distinct words for the possessive pronoun “mine” or “yours”, for example, as there are national anthems in the various Arab countries, and probably not much less different expressions to mean “there is” or “I want”. Syntactically, yes/no questions in MA are phrased like wh-questions in other dialects. The morphosyntax of negation is another case in point. A sentence is negated with a free morpheme maa before a verb in the East, a disjoint bound morpheme maa…ʃ or maa...ʃi surrounding the verb in Egypt, Libya and Tunisia, and this circumfix is even extended to adjectives further West. At the phonetic and phonological level, the variations cover all the aspects of speech manifestation from the realization of segments and their temporal organization to syllable structure and 1
The Arabic dialects will be referred to with the following abbreviations: MA=Moroccan Arabic. AA=Algerian Arabic; TA=Tunisian Arabic; EA= Egyptian Arabic; LA=Lebanese Arabic; SA=Syrian Arabic, JO=Jordanian Arabic; IA= Iraqi Arabic; MSA=Modern Standard Arabic.
98
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
supra-segmental features such as stress rhythm and intonation. Drawing upon several previous studies, some of which are unpublished, this paper aims at comparing different aspects of the supra-segmental or prosodic variations across various Arabic dialects in the light of phonetic and phonological factors underlying their structure. For historical and other reasons, variations exist even within the same country, but speakers of other dialects will still manage to identify someone as roughly coming from a specific Arab country or region rather than another. A perceptual study (Barkat 2000) shows that subjects from both North Africa (hereafter NA) and the Middle East (hereafter ME) were able to correctly identify speech stimuli produced by speakers from Morocco, Algeria, Tunisia, Lebanon, Syria and Jordan as belonging to NA or the ME 97% percent of the time. When asked to be precise and identify the country to which the speaker belongs, correct identification rates dropped for subjects from NA to 78% when the speakers are from NA, i.e., from a neighboring country, and to 32% when they are from the ME. For subjects from the ME, correct identification rates are 90% for the same region (ME) and 59% for speakers from NA. Similar stimuli were presented to Frenchspeaking subjects, who were asked to perform the same task as in the previous experiment. These subjects were successful in distinguishing stimuli as being from NA or the ME only in 56% of the cases. The results are statistically significant at the .05 level but slightly different from chance. Thus, while speech differences within the Arabic dialect continuum are barely perceptible to the foreign ear, speakers of Arabic are sensitive to inter-regional—and to a lesser extent intra-regional— variations. According to the Arab subjects in the above-mentioned experiment, the determining factors in deciding whether a stimulus was from NA or the ME was the fast speaking rate in the dialects of NA as well as their jerky nature, especially with respect to MA and AA. Linguistically, we can interpret this impression as relating to the suprasegmental feature of speech rhythm and perhaps to intonation structure. These prosodic features, however, are not independent of the nature and organization of segmental material. Languages have been classified as belonging to roughly three classes of rhythm: stress-timed, syllable-timed and mora-timed. Roach (1982), Dauer (1983, 1987), Laver (1994) and Ghazali et al. (2002),
INTONATIONAL AND RHYTHMIC PATTERNS
99
among others, include descriptions of these types of rhythm and other pertinent matters, namely evidence against the isochrony hypothesis. With regard to the rhythm type to which Arabic belongs, all the investigations, regardless of the dialect, have classified Arabic as stress-timed (Abercrombie 1967, Miller 1984, Benguerel 1999, Tajima et al. 1999, Cheikhrouhou 2005, Ben Abada 2004). Now, if all Arabic dialects are stress-timed, as these investigations suggest, then either rhythm is not an important factor in discriminating between them, or rhythm is a cline, i.e., there are subclasses of rhythm within the stresstimed group. These sub-classes, should they exist, must be distinct enough to allow for discrimination and not too different to fall in another rhythm category such as syllable timing. In a study that will be described below, Ghazali et al. (2002) attempted to investigate these speech rhythm variations within the Arabic dialect continuum. Before discussing the results of that investigation, it would be useful to present an overview of what might be causing these differences in speech rhythm. 2. Vowel Duration and Syllable Structure along the Continuum 2.1 Vowel duration Segmental duration is phonemic in Arabic (for both vowels and consonants), but both short and long vowels are longer in the dialects of the ME than their counterparts in NA. In fact, as one moves towards the Western pole of the Arab region, not only do all vowels shorten but also the difference between long and short vowels decreases to the point where the opposition may no longer be based on duration as was suggested for Moroccan dialects. Table 1 is a summary of short/long vowel ratios in different Arabic dialects and MSA obtained from various studies. These results are based on measurements of vowel durations reported in Jomaa (1991), Barket (2000) and other investigations. 2 In comparing these percentages, one has to keep in mind that the measurements come from different investigators who did not necessarily carry out their analyses under similar experimental conditions. While the values in Table 1 generally represent the 2
We have sometimes computed the ratios ourselves following inspection of pertinent elements in the data available.
100
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
durations of the vowels [a]-[aa], other factors may vary from one data set to another. We do not always have information on the features of the consonant following the test vowel (voicing, manner and place of articulation), the characteristics of the syllable hosting the experimental vowel (open or closed, stressed or unstressed), the number of syllables forming the test word, the number of subjects from whom the measurements were obtained, etc. Similarly, not all the data summarized in Table 1 are very useful for comparison. For example, the duration percentages from the Iraqi dialect were obtained from measurements of short and long vowels produced in isolation and maintained artificially for 300 and 600 ms, respectively. Note also that for MSA, the ratio increases for some speakers when the test words are produced by subjects from NA. This may be in line with the higher ratios in the dialect of the Western region where subjects may be extending the vowel system of their dialect to that of the Standard. Dialect Short/Long Vowel Ratio Source Standard Arabic a. Eastern 39% Port et al. 1980 b. Western 39-50% Ghazali & Braham 1992 Syrian 46% Irikoussi 1981 Kuwaiti 48% Al-Dossari 1989 Jordanian 62% Mitalb 1984 37-50% Zawaydah & de Jong 1999 Iraqi 50% Al-Ani 1970 Lebanese 50% Obrecht 1968 Sayah 1979 Saudi 52% Al-ghamdi 1992 Egyptian 59% Norlin 1987 Tunisian 59% Jomaa 1991 63% Ghazali unpublished Moroccan 77% Rhardisse et al. 1990 Table 1. Durations of Short Vowels Expressed as Percentages of Long Vowels
On the whole then, the data in Table 1 show a trend towards a decrease in duration differences between short and long vowels as we move into NA. Concerning MA, there has been an ongoing debate on whether this dialect distinguishes between short and long vowels. While studies, such as the one on which the data in Table 1 are based (Rhardisse et al. 1990), argue for the presence of quantity, others (Benkirane 1982, 2002; Embarki 1997) provide a host of evidence
INTONATIONAL AND RHYTHMIC PATTERNS
101
(phonological, phonetic and perceptual) against a short vs. long vowel opposition in MA. They propose instead a system with three full vowels [i, a, u] and one reduced schwa [ǝ] in closed syllables. The phenomenon of vowel reduction is not, of course, limited to MA. Short high vowels are produced as centralized [ɪ] and [ʊ] in closed syllables in all dialects (Ghazali 1979), and in the Western dialects where short vowels cannot occur in open syllable, [ɪ] and [ʊ] are the only short high vowels available. In this case, quantity opposition is accompanied by quality distinction among the high vowels, and phonological contrast preserving vowel quality can only be obtained with the low vowel. Thus, while a long high vowel can occur in both closed and open syllable CVV$(C)CV, its short counterpart can only occur in a closed one CVCC…as in the examples from TA below. High vowels Long “religious holiday” ʕiid ʒiibu “his pocket” ʒiibhæ “her pocket” ʃuufu “look at him” ʃuufhæ “look at her”
Short ʕɪdd ʒɪbnæ ʒɪbbæ ʃʊftu mʊddæ
“count!” “cheese” “a man’s garment” “I saw him” “a period of time”
The low vowel mææʃi “going” xamsæ “five” kælbæ “bitch”
mæʃjæ xaamsæ kæælbæ
“pace, gait” “the fifth” (fem. sing.) “restless, turbulent”
Vowel reduction seems to be gaining more ground in other Western dialects, especially among younger generations (Hammi 2004). In TA, for example, both [i] and [æ] tend to approximate the formant structure of [ǝ] in final open syllables. In a study based on spontaneous speech, Barkat (2000) found that while vowels cluster more around the center of the vowel space in the dialects of NA, they remain more peripheral in the Eastern dialect. She observed that, in her corpus, the most frequent vowel in NA is the central vowel [ǝ], which represents 30% of the occurrences of all the short vowels. She also noted that TA represented somewhat a transition dialect intermediate between NA and the ME where vowel reduction is
102
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
more marked than in the Eastern dialects but less important than in the rest of NA. With regard to vowel duration obtained from spontaneous speech, her results show the same trend as in the studies included in Table 1. Adding up the durations of all vowels produced by all the subjects regardless of vowel quality, she obtained a 41% short/long vowel ratio for the ME region and 51% for NA. 2.2 Syllable structure3 It is a well-known fact that Western dialects do not in general permit short vowels in open syllables. There are, however, exceptions at least in some southern dialects of Tunisia, where very short vowels are preserved [CǝCV]/ [CICV]. Sometimes only high vowels are deleted but the low vowel is kept in that position, as in the dialect of Sfax in Tunisia where, for example, “tomato” is pronounced [ṭamaaṭim] and “waste” is [xaṣaara] (CVCVVCV(C)) as opposed to [ṭmaatim] and [xṣaara] (CCVVCV(C)) elsewhere in NA. There are also occurrences of short vowels in open syllables in words that have been borrowed from MSA or from foreign languages but have become part of the native lexical inventory. The vowels in the first syllable of the words [mudiir] “director”, [muʕællɪm] “teacher” from MSA, and [mækiinæ] “machine”, [biduun] “liquid container” and [babuur] “ship” (from French “machine, bidon and vapeur”) are phonetically short. They are, however, preserved perhaps for one of the two following reasons or for both: first, socio-linguistic factors to distinguish the vernacular from the Standard. For example, the word for “teacher” is used in the dialect to denote “mason” or “foreman”, and in this meaning it is pronounced [mʕallɪm] with the first vowel being deleted. The second reason, which in my view accounts for all the cases, is that these vowels have been reanalyzed as long vowels, by analogy to other words. Let’s illustrate this with examples from TA. These data from Ghazali (1979) are also valid for other Western dialects that have preserved vowel quantity. The vowel in the monosyllabic word [ziit] “oil” is long and of course stressed. In the word [zituun], from underlying /ziituun/ “olives”, the 3
Although they ultimately underlie rhythmic patterns, foot and moraic structure will not be dealt with in this paper. For a discussion of these phonological matters, the reader is referred to Benkirane (1982), Benhallam (1990), Imouzaz (2002) for MA, Kiparsky for a comparison of the different dialects, McCarthy & Prince (1990a) as well as many other scholars.
INTONATIONAL AND RHYTHMIC PATTERNS
103
vowel in the first syllable shortens because stress has shifted to the second syllable containing the long vowel, but does not drop. In the word [zitunææt] “olive trees”, stress shifts to the last syllable which includes a long vowel and, consequently, the vowels of the first two syllables shorten but do not drop. Thus, in a borrowed word such as [biduun], the first vowel is most likely reanalyzed as long by analogy to [zituun]. Note that an underlying short vowel will drop in a similar environment; the last vowel [ɪ] in the verb [jɪktɪb] “he writes” will drop when the last consonant becomes the onset of the following syllable following processes of affixation (or cliticization) that adds a morpheme beginning with a vowel. Thus /jɪktɪb+u/ is phonetically [jɪktbu] “they write, or he writes it”. The deletion of high vowels in open syllables has resulted in a great deal of lexical items comprising syllables with complex onsets; thus, instead of the typical CVCV…and CVCCVCV… in the ME, one finds a predominance of CCV… and CCVCCCV… in NA. If we consider the large number of reduced vowels and the fact that, contrary to what happens in many other languages, consonants in Arabic lengthen in clusters instead of being compressed (Rejili & Ghazali 2003), then there is a great deal of closure and little space left for vowels in the speech chain of Western dialects. This is most likely what gave the impression of fast and jerky speech as reported by Barkat’s subjects. 3. Speech Rhythm 3.1 Production tasks. 3.1.1 Variations within the Arabic continuum In an attempt to relate these vowel duration and syllable structure differences in the bipolar Arabic dialect domain, data were obtained from six Arabic dialects (Ghazali et al. 2002, Hamdi 2001). The dialects investigated were: Moroccan (four speakers), Algerian (two speakers) and Tunisian (two speakers) representing Western Arabic, then Jordanian (two speakers), Syrian (three speakers) and Egyptian (one speaker) representing the Middle East. The speech data were taken from the recordings used in Barkat (2000) where each subject listened sentence by sentence to the story “The North Wind and the Sun” in French and translated each sentence spontaneously into his dialects. The language corpus used consisted of 140 Arabic sentences (ten sentences per subject) with an average duration of 2.5 seconds for each
104
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
sentence. Following the experimental procedures proposed by Ramus et al. (1999), each segment was classified as vowel or consonant. The next step was to measure the duration of (i) each sentence, (ii) each string of consecutive vowels (vocalic intervals), and (iii) each string of consecutive consonants (consonantal intervals). For example, the following sentence from MA is comprised of ten vocalic and ten consonantal intervals. mbʕdha bdtǝriḥ ṭṣuṭ bkul quwwǝtha CCVCCCV CCVCVCVC CCVC CCVC CVCCVCCV And then the wind began to blow with all its force
The following step consisted in computing the proportion of vocalic and consonantal intervals (V% and C%, respectively) in each sentence and the standard deviation of vocalic and consonantal intervals within each sentence (∆V and ∆C, respectively). Details of how these variables are computed are explained in Ramus & Melher (1999). Table 2 shows the average proportions of vocalic intervals (%V) and the average standard deviation of consonantal intervals (∆C) for the subjects in each of the six dialects investigated. Western %V ∆D Eastern Area Area Morocco 32.38 8.52 Egypt Algeria 33.84 6.75 Jordan Tunisia 34.97 5.25 Syria Table 2. Computed values for %V and ∆C
%V
∆C
36.98 41.09 43.66
3.89 4.76 4.54
These results show that while the proportion of vocalic intervals represents less than 50% of the total duration of a sentence in all the dialects, it is more important in the Eastern dialects than the Western ones. In fact, there is a gradual increase of %V as one moves from West to East. Conversely, ∆C decreases from West to East. Figure 1 illustrates the negative correlation between %V and ∆C and dialect location (r = - 0.75); as one moves from West to East ∆C decreases and %V increases.
INTONATIONAL AND RHYTHMIC PATTERNS
105
Using a t-test for the difference between pairs of dialects, we found that the significance level is directly proportional to the distance between the two countries. For example, while these results (for both %V and ∆C) are not significant when Syria is compared to Jordan or when we compared Morocco to Algeria, that is, when two dialects belong to the same region, they are highly significant for pairs of dialects located in the opposite ends of the continuum such as Syrian and Moroccan (p> 0.001). Note, however, that both Tunisia and Egypt, which are located near the center of the continuum, show significant results only when compared to Morocco, a fact that we will try to account for later. Figure 2 is an illustration of the average values of the proportion of vocalic intervals and the standard deviation of the consonantal intervals when the three dialects of each region are grouped together. It clearly shows that %V is higher in the dialects of the Middle East than in the dialects of North Africa (p< 0.0001), while we obtain the opposite results for ∆C. Figure 3 is a three-way comparison: Morocco and Algeria representing the far end of the western pole, Syria and Jordan the eastern end, and Tunisia and Egypt an intermediate zone. This comparison confirms the gradual decrease of %V from East to West with Tunisia and Egypt exhibiting intermediate values, but shows that with respect to ∆C, Tunisia and Egypt are closer to the dialects of the Middle East than to North Africa. 3.1.2 Discussion Languages with the highest ∆C and low %V such as English were those traditionally classified as stress-timed. In Ghazali et al. (2002), the dialects that exhibit these characteristics are those spoken in North Africa. Since Eastern dialects such as those of Iraq (Benguerel 1999) and Jordan (Tajima et al. 1999) have also been classified as stresstimed, then we should perhaps allow for a great deal of variation within the class of stress-timing. To maintain a discrete category ‘stresstiming’ as distinct from some other timing, there should exist one or more key factors, the presence of which constantly induces the perception of stress-timing. Such a conditioning factor could be the tendency in all Arabic dialects for long or heavy syllables to attract stress. Since syllabic weight in these dialects is a cline, we may get the
106
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
Figure 1. Distribution of the dialects along the %V and ∆C dimensions based on Ghazali et al. (2002)
Figure 2. Comparison of %V and ∆C in NA and the ME based on Ghazali et al. (2002) 100 80 60 40 20 0 %V
Delta C Jord-Syria
T un-Egypt
Moro-Alg
Figure 3. Comparison of %V and ∆C in three groups of dialects (Algeria + Morocco; Tunisia + Egypt; and Jordan + Syria) based on Ghazali et al. (2002)
INTONATIONAL AND RHYTHMIC PATTERNS
107
impression of different subclasses of rhythm. Note also how dialects geographically located between the two poles are also intermediate with respect to phonetic facts (Figure 1). Barkat (2000) reported that most of the discrimination errors made by her subjects were the result of not being able to correctly classify Tunisian speakers. In fact, Tunisian speakers have a %V similar to North Africa but a ∆C closer to the Middle East. In other words, their vowels are slightly longer and less reduced than those of Moroccans and Algerians, but significantly shorter than Syrians and Jordanians. They don’t, however, exhibit the same syllabic complexity as the other North African subjects. 3.1.3 Comparing Arabic to other languages Using the same experimental procedures, Hamdi (in preparation) compared the six Arabic dialects to three other languages: English and French categorized in the literature on speech rhythm as stress-timed and syllable-timed respectively, and Catalan, an intermediate category between stress-timing and syllable-timing. Regarding the Arabic dialects in this investigation, Lebanese replaced Syrian and the number of subjects who generated the speech data was increased to five for Egyptian and Algerian and to ten for the other dialects. For English, French and Catalan there were five subjects for each language. Figure 4 shows that there are three overlapping sub-classes of Arabic dialects: Moroccan and Algerian with high ∆C values and low %V, Jordanian and Lebanese with high %V and Low ∆C, and a somewhat intermediate area occupied by Tunisian and Egyptian. These results confirm the previous findings from a smaller sample. Figure 5, which is a comparison of the various Arabic dialects with English, French and Catalan, shows that: a) Moroccan and Algerian (Western dialects) are on one end and French on the other. They are most distinct, with Moroccan and Algerian having the highest ∆C and the lowest %V values and exactly the opposite for French. b) English is somewhere in the middle although higher on the ∆C values than all other languages and dialects except for MA and AG. English, however is not very low in terms of %V. c) Lebanese and Jordanian (the Eastern dialects) are closer to French with respect to %V, but closer to English in terms of ∆C. d) Tunisian and EA are both closer to Catalan and have
108
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
lower values for both %V and ∆C than English. They also have higher ∆C value than ME dialects but lower %V values. Remember that ∆C is a measure of the complexity of consonant sequencing, i.e., it is directly proportional to syllable complexity. The %V parameter is an indicator of vowel reduction. The value of %V is low when there is a predominance of short or reduced vowels. Thus, if these parameters are taken as indicators of rhythm types (Ramus et al. 1999), then MA and AG are more stress-timed than English, a language that is already considered as strongly stress-timed. It may also be the case that a new category of rhythm is needed to account for these two dialects, but we do not know at this stage what that would be. Before discussing the results of other investigations using different experimental techniques, note that Hamdi et al. (2004) attempted to compare results from ∆C and %V measurement with results obtained from Pair-wise Variability Indices (PVI). This technique, described in Grabe & Low (2002), is more sensitive to stress patterns as it measures the mean difference in duration between two successive vowels in an utterance. Results from this method, which is supposed to provide a global representation of speech rhythm, are basically the same as those obtained from ∆C and %V measurements (r2= 0. 83). 3.2 Speech cycling tasks 3.2.1 Comparing Jordanian Arabic (ME) to English and Japanese Tajima et al. (1999) used the speech cycling method (Tajima 1998) to compare English, Arabic (JA) and Japanese, knowing that the latter has been categorized as mora-timed. They reported that Arabic and English exhibited similar rhythmic patterns that were different from Japanese. They concluded that “Arabic and English speakers seem to pay close attention to the stressed syllables, producing them at simple harmonic phases” (288). Comparing English and Arabic, they noted that “stressed syllables within a phrase deviated from a strictly isochronous sequence to a greater extent in Arabic than in English.” These remarks seem to be in line with the differences between English and JA observed in the ∆C and %V values above, which suggested that English was more stress-timed than JA.
INTONATIONAL AND RHYTHMIC PATTERNS
109
Figure 4. %V as a function of ∆C for all subjects in Arabic dialects, from Hamdi (in preparation)
Figure 5. Distribution of the Arabic dialects, English, French and Catalan with respect to %V and ∆C, from Hamdi (in preparation)
3.3 Perceptual tasks 3.3.1 Comparing Tunisian Arabic (Intermediate) to English and French Ben Abda (2004) set up a perceptual experiment using reversed speech stimuli in an attempt to find out if subjects could discriminate
110
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
between spectrally inverted sentences from English, French and TA. The sentences came from the story “The Little Red Riding Hood” and included statements, different types of questions, exclamations, etc. With reversed speech, subjects only have available syllabic structure (although inverted) and supra-segmental information. The subjects who produced the speech samples were one native speaker of American English and four native speakers of TA. The latter produced both the TA and the French stimuli. The subjects for the listening task were ten native speakers of TA and ten native speakers of English. They were all university graduates and were trained by the experimenter to distinguish between stress timing and syllable timing. The first task consisted in identifying the inverted stimuli as English or Arabic. The results from this first task, although statistically significant, show some degree of confusion (116 correct identifications and 84 misses). Looking at the distribution of these answers, one notes that most of the correct answers (82/116) came from the native speakers of English and most of the errors from Tunisian listeners (66/84). When the task was to discriminate between French and English, the overwhelming majority of answers were correct (177) and only 23 were wrong, with Tunisian subjects doing better this time. In the third listening task, stimuli from the three languages were presented at the same time. In this case, French was clearly distinguished from Arabic and English, but English and Arabic were confused, with correct identification occurring only 53% of time, again with Tunisians accounting for most of the incorrect answers. What caused the difference between the behavior of English and Tunisian listeners is not clear. It might be that the English group took the experiment more seriously and were thus more attentive, or that they were particularly sensitive to some cues in the acoustic signal. Note, though, that care was taken to use Arabic sentences that did not contain the back consonants that do not occur in English such as uvulars and pharyngeals. Perceptual experiments using reversed speech confirm then the findings from production experiments with regard to the similarity between English and, at least, Tunisian and JA. They also confirm the differences between French on the one hand, and English and Arabic on the other. In this experiment, stimuli were also used which consisted of just F0 and amplitude and no segmental material. F0 was extracted
INTONATIONAL AND RHYTHMIC PATTERNS
111
from the same test sentences but before spectral inversion. Surprisingly, here, too, the subjects were able to discriminate between French on the one hand, and English and Arabic on the other, but not between Arabic and English in a significant manner. No attempt was made in this experiment to produce stimuli where segmental material is preserved but F0 is kept constant. Controlling those variables could be useful in determining the relative importance of each variable as a discrimination factor. In the next section we will examine pitch variations in the Arabic dialects. 4. Intonation Investigating intonation patterns across Arabic dialects was motivated by the same reasons as those behind our work on rhythm, namely to understand the supra-segmental features underlying the perceived speech differences. Knis (2004) recorded data that consisted of both read sentences and spontaneous speech from six Arabic dialects. The sentences included statements, yes/no questions, whquestions, and sentences in contexts expressing doubt or surprise. The subjects were also asked to tell the story of “The Little Red Riding Hood”. There were two subjects from each of the following countries: Morocco, Tunisia, Egypt, Syria and Iraq. The subjects were intended to be representative of the Western, the Eastern and the intermediate regions. In addition to speech data from the dialects, the subjects were also asked to produce the same speech material in MSA, the aim of the investigation being to find out if the dialects had an effect on the production of the standard. Only the results pertaining to the intonation patterns of statements in the dialect will be discussed in this paper. The analysis of pitch patterns was carried out using a version of the ToBI method (Silverman et al. 1992) 4.1 Results4 4.1.1 Egyptian Arabic The statements produced by the two Egyptian speakers show clear cases of the declination phenomenon. The intonation contour tends to go down from beginning to end with peaks corresponding to stressed syllables that are generally lower compared to the initial pitch accent. 4
Figures 6 to 17 are based on data from Knis (2004).
112
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
Figures 6 and 7 illustrate the intonation contours of the sentences Ɂinnaharda elgawwi baarid “it’s cold today” and Ɂibna ʕammak ʕajjaan “your cousin is ill”, res-pectively. In general the patterns comprise two pitch accents with two peaks that may sometimes have the same level occurring on the first and the last stressed syllables. The boundary tone in statements is almost always a low tone (L%). In summary, the most typical patterns for statements in EA are the following: either HL*L!H*LL% (57% of the sentences) or LH*LL% (30% of the sentences) where HL*=falling, !H*=down-stepped high, L%=low boundary tone. 4.1.2 Syrian Arabic Statements in SA exhibit a rising-falling pattern with the high tone extending over an important stretch of segments. This results in plateau tunes or hat patterns where the maximum pitch does not form a peak but stretches over the whole stressed syllable extending sometimes to the next one as in the sentence maama bilbeet “my mother is at home” in Figure 8. There are also bi-tonal pitch contours where pitch rises to the same level over the two stressed syllables, and the last pitch accents falls on the final or penultimate syllable. The intonation of statements in this sample is also characterized by declinations. There is an initial high peak then pitch starts to fall with rises corresponding to stressed syllables as in the sentences ṭaɁṣ beerid eljoom “it’s cold today” illustrated in Figure 9. Another important observation one can make regarding SA is that pitch is sometimes falling on stressed syllables. The first syllable of the initial word kaanit in the sentence kaanit elɁasɁilǝ ṣaʕbǝ “the questions were difficult” (Figure 10) is stressed and exhibits a HL* pattern. This pattern is rare in Western dialects. On the whole, the major patterns observed are either LH*HLL% or LH*L!H*L% which account for 53% and 30% of all the patterns, respectively.
INTONATIONAL AND RHYTHMIC PATTERNS
Figure 6. Ɂinnaharda elgawwi baarid
Figure 7. Ɂibna ʕammak ʕajjaan
Figure 8. maama bilbeet
113
114
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
Figure 9. ṭaɁṣ beerid eljoom
Figure 10. kaanit elɁasɁilǝ ṣaʕbǝ
4.1.3 Iraqi Arabic. What characterizes statements in the Iraqi sample is the frequency of the HL* pattern as illustrated in the sentence eldʒaw baarid eljoom “it’s cold today” (Figure 11). This recurrent falling pattern is very rare in the other dialects. One-pitch accent contours are seldom found here, and the typical patterns are intonation contours with continuous pitch variations on the syllables that bear lexical stress. Syllable prominence is achieved either through pitch rise or pitch fall. In general, the salient feature of this dialect is the predominance of peaks and valleys within the contour, which leads to a continually changing melody as in sentence ṣabaaḥ elxeer jaa ʒaddati “good morning grandmother”. It seems as if there was a pitch accent on each lexical stress even when
INTONATIONAL AND RHYTHMIC PATTERNS
115
the highest pitch falls on one particular syllable, which in turn varies from one position to another in the sentence. The final boundary tones are mostly low; the non-low tone at the end of the sentence in Figure 12 could be explained by the fact that it was extracted from a story where the speaker had more to say in this particular situation. The most recurrent pitch patterns in statements for this dialect are: HL*H*L% or LH*L!H*L% and to a lesser extent LH*HL%. These three patterns account for 92% of all the sentences examined.
Figure 11. eldʒaw baarid eljoom
Figure 12. ṣabaaḥ elxeer jaa ʒaddati
4.1.4 Moroccan Arabic The declination which characterizes Eastern dialects is not present in the intonation contours of MA. There is usually one rising pitch that corresponds to a stressed syllable, and in this sample the typical pattern is a rising-falling one, with the peak being on the penultimate stressed syllable, especially in a neutral context with no particular focus as in the sentence wǝld ʕammɪk mriδ “your cousin is ill” (Figure 13).
116
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
Sentences with more constituents in their phrase structure may have two pitch accents or more. The sentence in Figure 13 is made up of one NP (a genitive construction) and an AP and exhibits one pitch accent. The sentence lʒaw barid ljum “it’s cold today” (Figure 14) is comprised of a NP, an AP, and an ADV and is rendered with two pitch accents separated by a short pause. The first accent rises and falls then slightly rises again because it is non-terminal, and the second pitch is rising-falling, indicating the end of the sentence. These patterns are similar to those described by Benkirane (2002) who observed this same rising-falling contour for statements in MA. He noted that in the sentence amina mreḍa “Amina is ill”, the peak was on the penultimate stressed syllable, and that on the whole there was no evidence for declination in the intonation contour and no alternations of peaks and valleys. The most prominent intonation patterns for statements in MA are LH*LL% (40% of the cases) and LH*HLL% (30% of the cases).
Figure 13. wǝld ʕammɪk mriδ
Figure 14. lʒaw barid ljum
INTONATIONAL AND RHYTHMIC PATTERNS
117
4.1.5 Tunisian Arabic For statements, the sentences from the TA sample show a simple rising-falling pitch pattern with the peak being at the beginning of the contour on the first stressed syllable as in the sentence baarda ljuum “it’s cold today” in Figure 15. This LH*LL% pattern is very common in short statements comprised of no more than two or three words. In the sentence wɪld ʕammɪk mrii δ “your cousin is ill” (Figure 16), the pitch rise is on the second syllable, that is, the first syllable of the second word in the compound noun wɪld ʕammɪk, literally “son-uncle (uncle’s son)”, which is the typical stress pattern in these types of endocentric compounds where stress falls on the non-head noun. The final boundary tone was falling in most of the cases, but both subjects produced the sentence ommi fɪddaar “my mother is at home” with a final rise (Figure 17). The major pitch patterns for statements in TA are then LH*LL% (46%), H*LL% (20%) and LH*HLL% (13%).
Figure 15. baarda ljuum
Figure 16. wɪld ʕammɪk mriiδ
118
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
Figure 17. ommi fɪddaar
4.2 Summary We may conclude from comparing the intonation patterns of statements in the samples representing the six dialects above that: a. They generally end with a low boundary tone as is the case for the great majority of world languages. Most of the exceptions to this pattern can be explained by utterances that remain incomplete. b. In Western dialects, and to a lesser extent in EA, nuclear accents and accompanying pitches are predominantly rising-falling (LH*LL %). c. High pitches in SA are rarely represented by rapid changes in the contour, but are maintained over an extended period of time forming a plateau. The pattern LH*HL% represents more than 50% of the statements in this dialect. d. IA is characterized by the predominance of the HL* pattern which is encountered to a limited extent in SA but almost absent from the other dialects. e. Declinations are frequent in the Eastern dialects, giving them a greater melodic variation compared to Western dialects. 5. Conclusion When we set out to obtain the speech data that would allow us to compare the different Arabic dialects, we were hopeful our informants would produce equivalent if not identical utterances. Although we tried various techniques to elicit the desired responses, there was not much we could do about the lexical item a speaker chose to designate something in his/her dialect. In the intonation study, for example, the
INTONATIONAL AND RHYTHMIC PATTERNS
119
expression “it’s cold today”, came out as as baarda ljuum in TA, lʒaw barid lyum in MA, ṭaɁs beerid eljoom in SA, and Ɂinnaharda- lgawwi baarid in EA, with two different words for “weather” (if we ignore some phonetic variations). Note that we could have ended up with three separate words had the TA subject produced the usual expression iddinja baarda ljuum, literally, “the world is cold today”. Lexical differences in themselves may thus be sufficient cues for discriminating between the various dialects, especially those on the opposite ends of the Arab region. Beyond lexical diversity, however, there is an array of inter-related and inter-dependent segmental and supra-segmental variables that contributes to the coloring of this dialectal panorama. In addition to specific intonational patterns distinguishing the Western from the Eastern varieties, there are salient features characterizing different dialects or sub-regions. Short and reduced vowels coupled with complex syllabic structure, as in the Western pole of the continuum, seem to be correlates of stress-timing. But some Eastern dialects were also shown to be closer to stress-timed languages such as English than to mora-timed languages like Japanese (Tajima et al. 1999, Benguerel 1999). However, when samples representing all the Arabic varieties were compared at the same time to other syllable-timed and stresstimed languages, we no longer obtain a homogenous group of Arabic dialects with respect to rhythm types. Some Western dialects (MA and AA) with very high ∆C and very low %V seemed to be more stresstimed than English. The Eastern dialects (JO and LA), exhibiting high %V and very low ∆C, were closer to syllable-timed languages such as French than to MA or AA. A third intermediate group (TA and EA) appeared somewhere in between, although there was a slight difference between them with respect to ∆C. Since rhythm categorization is mainly a matter of perception, controlled perceptual experiments could serve to indicate whether these differences in the acoustic signal really correspond to distinct rhythm types or to subclasses within a major rhythm category. .
120
SALEM GHAZALI, RYM HAMDI & KHOULOUD KNIS
REFERENCES Abercrombie, David. 1967. Elements of General Phonetics. Edinburgh: Edinburgh University Press. Al-Ani, Salman. 1970. Arabic Phonology: An acoustic and physiological investigation. The Hague: Mouton, Al-Dossari, A. 1989. Le phasage des gestes mandibulaires vocaliques et consonantiques en arabe koweïtien. Mémoire de DEA, Université de Grenoble III. Barkat, Melissa. 2000. Détermination des indices acoustiques robustes pour l’identification automatiques des parlers arabes. Thèse de Doctorat, Université Lumière Lyon. Ben Abda, Imen. 2004. The Perception of Rhythm in English and Tunisian Arabic: A comparative study. M.A. thesis, Institut Supérieur des Langues de Tunis, Tunis. Benguerel, A. 1999. “Stress-timing vs. Syllable-timing vs. Mora-timing: The perception of speech rhythm by native speakers of different languages”. VARIA, Etudes & Travaux 3. Benhallam, A. 1990. “Moroccan Arabic Syllable Structure”. In Langues et Littératures VIII, Publications de la Faculté des Lettres et des Sciences Humaines, Rabat. Benkirane, Thami. 1982. “Durée, prosodie et syllabation en arabe marocain. Travaux de l’Institut de Phonétique d’Aix 8. 49-83. _____. 2002. Codage Prosodique de l’énoncé en arabe marocain. Thèse de doctorat d'état. Cheikhrouhou, Maha. 2005. Tunisian Arabic and English Speech Rhythm: A comparative analysis. Doctoral thesis, University of Manouba, Tunis. Cruttenden, Alan. 1986. Intonation. New York: Cambridge University Press. Dauer, R. M. 1983. “Stress-timing and Syllable-timing Reanalyzed”. Journal of Phonetics 51-69. _____. 1987. “Phonetic and Phonological Components of Language Rhythm”. Proceedings of the XIth ICPhS, Tallinn, Estonia 5.447-450. Embarki, M. 1997. “La quantité vocalique en arabe marocain: entre l'apparentement historique et la réalité acoustique”. Actes des Journées d'Etude Linguistique: La Voyelle dans tous ses Etats, Nantes 44-49. Ghazali, Salem. 1979. “Du Statut des voyelles en Arabe.” Etudes Arabes; Analyses Théorie 2/3:8.199-219. Ghazali, Salem & Abdelfattah Braham. 1992. “Voyelles longues et voyelles brèves en arabe standard: Organisation temporelle”. 19ème JEP du GCP de la SFA. Bruxelles. Ghazali, Salem, Rym Hamdi & Melissa Barkat. 2002. “Speech Rhythm Variation in Arabic Dialects.” Proceedings of the First International Conference on Speech Prosody. Aix-en-Provence, pp 331-334. Grabe, E. & E. L. Low. 2002. “Durational Variability in Speech and the Rhythm Class Hypothesis”. Papers in Laboratory Phonology 7. The Hague: Mouton, Hamdi, Rym. 2001. ?al-?iqaaʕ fi-llahajaat ?al-ʕarabiyya: diraasasamaaiyya. M.A. thesis, Institut Supérieur des Langues de Tunis, Tunis. Hamdi, Rym, Melissa Barkat-Defradas, E. Ferragne & François Pelligrino. 2004. “Speech Timing and Rhythmic Structure in Arabic Dialects: A comparison of two approaches”. ICSLP.
INTONATIONAL AND RHYTHMIC PATTERNS
121
Hamdi, Rym. In preparation. Détermination d’indices prosodiques robustes en vue de l’identification automatique des parlers arabes. Ph.D. dissertation, Institut Supérieur des Langues de Tunis, Tunis. Hammi, Rihab. 2004. English Vowel Reduction to Schwa by EFL Tunisian Students. M.A. thesis, Institut Supérieur des Langues de Tunis, Tunis. Hirst, Daniel & Albert Di Cristo, eds. 1998. Intonation Systems. A survey of twenty languages. Cambridge: Cambridge University Press. Imouzaz, Said. 2002. Intéraction des contraintes dans la morphologie nongabaritique de l’arabe marocain de Casablanca: témoignage pour la théorie de l'optimalité. Thèse de doctorat, Université Hassen II, Mohammedia. Jomaa, Mounir. 1991. Organisation temporelle acoustique et articulatoire de la quantité en Arabe tunisien. Thèse de doctorat, Université Stendhal, Grenoble III. Kiparsky, Paul. “Syllables and Moras in Arabic”. Available from the Internet. Knis, Khouloud. 2004. ?atharu ?allahajaat ?alʕarabiyya fi tanghiim ?al-fuSHaa. M.A. thesis, Institut Supérieur des Langues de Tunis, Tunis. Laver, John. 1994. Principles of Phonetics. Cambridge: Cambridge University Press. McCarthy, John & A. Prince. 1990a. “Foot and Word in Prosodic Morphology: The Arabic Broken Plural.” Natural Language and Linguistic Theory 8.209-283. Miller, M. 1984. “On the Perception of Rhythm”. Journal of Phonetics 12. 75-83. Mitleb, F. M. 1984. “Vowel Length in Arabic and English: A spectrographic test". Journal of Phonetics 12. 229-235. Norlin, K. 1987. “A Phonetic Study of Emphasis and Vowels in Egyptian Arabic”. Working Papers 30, Lund University, Department of Linguistics, 1-119. Obrecht, Dean. 1968. Effects of the Second Formant on the Perception of Velarization Consonants in Arabic. The Hague: Mouton. Ramus, F. & J. Melher. 1999. “Language Identification with Supra-segmental Cues. A study based on speech re-synthesis”. Journal of the Acoustical Society of America 105.1: 512-521. Ramus, F., M. Nespor & J. Mehler. 1999. “Correlates of Linguistic Rhythm in the Speech Signal”. Cognition l.73:3.265-292. Rejili, Choukri & Salem Ghazali. 2003. “Consonant-cluster Duration in Standard Arabic”. 15th International Congress of Phonetic Sciences, Barcelona, 11051108. Rhardisse, N., R. Sock & C. Abry. 1990. “L’efficacité des cycles acoustiques dans la distinction des quantités vocaliques et consonantiques en arabe marocain". 18ème JEP du GCP de la SFA, pp 108-112. Roach, Peter. 1982. “On the Distinction between ‘Stress-timed’ and ‘Syllabletimed’ Languages”. Linguistic Controversies, ed. by D. Crystal, pp. 73-79. London: Edward Arnold. Silverman, Kim, Mary E. Beckman, John Pitrelli, Mari Ostendorf, Colin Whightman, Patti Price, Janet Pierrehumbert & Julia Hirschberg. 1992. “A Standard for Labeling English Prosody”. Proceedings, Second International Conference on Spoken Language Processing 2: 867-70. Banff, Canada. Tajima, K. 1998. Speech Rhythm in English and Japanese: Experiments in speech cycling. Ph.D. dissertation, Indiana University, Bloomington. Tajima, K., B. Zawaydeh & M. Kitabara. 1999. “A Comparative Study of Speech Rhythm in Arabic, English and Japanese”. Proceedings of the XIV ICPhS, San Francisco.
Roots and patterns in Arabic Lexical Processing*
Abdessatar Mahfoudhi King Saud University
1. Introduction The status of roots and patterns remains controversial both in theories of Arabic morphology and Arabic lexical processing. There are two major views on Arabic morphology. Some argue for roots and patterns as the basis of word formation (e.g., McCarthy 1981), but others defend a stem- or word-based approach (e.g., Benmamoun 1999). Lexical processing studies are also far from being homogenous. The present study was designed to provide further external evidence for the cognitive validity of these two morphemes. It included three experiments that used a lexical decision task with masked visual priming to examine the priming effect of sound roots and patterns with a sound or a weak root in word recognition. The results will be discussed in light of theories of Arabic morphology and lexical processing. The paper is structured as follows. Section 1 gives a brief description of the two major theories of Arabic morphology. Section 2 reviews previous psycholinguistic studies on roots and patterns in Arabic. Section 3 reports on the experiment on the role of roots in Arabic lexical processing. Section 4 reports on the *
This research was partially funded by a grant from the Tunisian Ministry of Higher Education and a grant from the Faculty of Graduate Studies at the University of Ottawa. I would like to thank all the participants in this study, as well as all the people who helped in their recruitment. I am also grateful to Eta Schneiderman for valuable comments on an earlier version of this work and to Sami Boudelaa and Keneth Forster for helpful comments on the design and suggestions on the DMDX software. I am, of course, the only person responsible for any possible errors of fact or interpretation.
124
abdessatar mahfoudhi
experiments on the role of patterns in Arabic lexical processing. The last section discusses the results of the two studies. 2. Theories of Arabic Morphology: Patterns and Roots or Stems and Affixes? There are two major opposing theories of Arabic morphology. On one side, there is the morpheme-based theory, whose proponents (e.g., Cantineau 1950a, 1950b; McCarthy 1981) argue that derivations are based on the process of mapping out roots onto patterns. For instance, the word rakib “ride” is made of the root {r, k, b} and the pattern {CaCiC}. The root carries the core meaning of the word “riding” and the pattern has the syntactic meaning ‘perfective, active’. While the classical theory,1 as adopted in Cantineau’s work, relies on roots and patterns, McCarthy’s Prosodic Morphology proposes that the pattern should be divided into three morphemes represented on separate tiers à la autosegmental phonology (Goldsmith 1976): (i) the skeleton made of vocalic and consonantal slots, (ii) affixal consonants, if any, and (iii) vowels. On the opposite side, there is the stem-based theory (e.g., Ratcliffe 1997, Benmamoun 1999) which maintains that derivations are stem-based. This controversy has implications for lexical representation and lexical processing. The stem/word-based theory is in line with the tenets of the full-listing hypothesis of lexical processing (e.g., Butterworth 1983), which assumes that words are represented and accessed as whole units. The morpheme-based theory is congruent with both the decompositional hypothesis (e.g., Taft 1981) and the dual-access hypothesis (e.g., Caramazza et al. 1988) of lexical processing, both of which assume that (at least some) complex words are accessed and represented as separate morphemes. 3. Previous Studies While there is evidence for the role of the root in lexical processing, evidence for the pattern is still inconclusive.2 Boudelaa & Marslen-Wilson (2005), who used a visual lexical processing task, found a priming effect 1
For an overview of the classical theory of Arabic morphology, the reader is referred to Bohas & Guillaume (1984). Bohas & Guillaume emphasize that unlike the modern structuralist Semitists (e.g., Cantineau 1950a, 1950b) who suggest that all derivations are a mapping of a root to a template, the old Arab grammarians propose word-toword derivations in many cases. It is, however, possible to propose root to template mapping while still proposing that words are derived from others with some additions, deletions of suffixes as well as a change in vowels as done by Watson (2002).
roots and patterns in arabic lexical processing
125
of roots in Modern Standard Arabic deverbal nouns and verbs at prime display time 32, 48, 64 and 80 ms. Abu-Rabia & Awwad (2004), on the other hand, did not find any priming effect at display time of 50 ms. using both masked priming and naming tasks. As for the pattern, Boudelaa & Marslen-Wilson (2005) found priming effect of this construct at display time 48 and 64 ms. in deverbal nouns and only at SOA 48 ms. in verbs. Mimouni et al. (1998) tested both normals and aphasic speakers of Algerian Arabic at SOA 250 ms. using a cross-modal priming lexical decision task and found no effect of the pattern in nouns. Abu-Rabia & Awwad (2004) also found no priming of patterns in derived Modern Standard Arabic nouns at an SOA (stimulus onset asynchrony) of 50 ms using masked priming and naming tasks. 4. Study 1: Sound Roots 4.1 Objectives The goal of this study/experiment was to validate previous findings related to the role of roots in Arabic lexical processing. The experiment tested whether a masked word prime would facilitate the recognition of a target word having the same root. Given some evidence for the importance of the root in lexical processing in Arabic (Boudelaa & MarslenWilson 2005) and Hebrew (e.g., Frost et al. 1997), I expected to find a priming effect of the root. The assumption behind priming is that a significant facilitation effect is evidence that the shared morpheme is being decomposed from a complex word and used to activate the target word. To ascertain that any potential facilitatory effect is morphological, a semantic condition and an orthographic/phonological condition, as well as an unrelated condition, were included. The data were divided into two sets. The first set included 24 targets and was paired with primes that belonged to one of these three conditions: (i) +Root +Semantics, (ii) +Orthography/ +Phonology, and (iii) Unrelated. In the first condition, primes and targets had the same root as well as a transparent semantic relationship. In the second condition, primes shared with targets roughly the same number of letters/phonemes in the same order as the related conditions, but not the 2
Other evidence for the cognitive relevance of the root morpheme in Arabic comes from speech errors (Abd-el-Jawad & Abu-Salim 1987 and Berg & Abd-el-Jawad 1996), speech of aphasic patients (Prunet et al. 2000), and well-formedness judgments (Frisch & Zawaydeh 2001).
126
abdessatar mahfoudhi
same root. The third condition included primes that had no semantic or formal relation with their targets. The second set of targets was paired with primes that belonged to one of these conditions: (i)+Root-Semantics, (ii)+Orthography/+Phonology, and (iii) Unrelated (see Figure 1, below). In the first condition, primes and targets shared the same root but an opaque semantic relationship. In the second condition, the primes shared the same number of letters/phonemes with the targets, as did the morphological primes. In the third condition, the primes shared no morphological or orthographic/phonological relationship with the targets. The unrelated condition in both target sets served as the baseline against which the two other conditions were measured. 4.2 Participants The participants were 36 Arabic-speaking students from Tunisia, where all the experiments were conducted. They were aged between 22 and 27 and all had at least 12 years of formal education in Arabic. They had normal or corrected to normal vision. The participants in all experiments were volunteers. 4.3 Stimuli and design The targets were 48 triliteral Arabic verbs in the third person singular perfective past, a rather neutral form. They had a mean letter length of 4.42 letters and a mean syllable length of 3.65 syllables. As indicated above, the targets were divided into two sets, each containing 24 words. Each target of the first half was paired with three primes, one from each of the first three experimental conditions mentioned above: (i) the morphologically and seman-tically related, (ii) the orthograph-ically/phonologically related, and (iii) the unrelated conditions. The second set of targets was also paired with three types of primes: (i) the morphologically related (+Root-Semantics), (ii) the orthographically/ phonologically related, and (iii) the unrelated. The letter and syllable lengths of the primes were kept very similar. The primes in the morphologically and semantically related condition had an average of 3.83 letters and 3.38 syllables. The mean length of letters and syllables in the morphological condition with opaque semantic relationship was 4.17 and 3.55, respectively. In the orthographically/phonologically related condition, the mean length of letters and syllables was 4.19 and 3.56, respectively. The primes in the unrelated condition had a mean length of letters of 3.81 and a mean
roots and patterns in arabic lexical processing
127
length of syllables of 3.37. For the list of experimental items in the three conditions, see Appendix 1. The number, position, order, and continuity of the overlapping letters in the orthographic/phonological control condition mimicked as much as possible those in the related conditions. The average amount of primetarget overlap was 3.13 letters and 3.13 phonemes in the morphologically and semantically related condition; 3.13 letters and 3.13 phonemes in the morphologically related condition; and 2.87 letters and 3.02 phonemes in the orthographically/ phonologically related condition. The semantic relatedness between both morphologically related primes and their targets was based on the judgment of twenty native speakers of Arabic on a seven-point relatedness scale, with 1 being ‘unrelated’ and 7 being ‘very much related’. The semantically related set included items whose mean rating was 4 or more, with an overall mean of 5.03. The semantically unrelated set included items whose mean ranking was less than 3.5, with an overall mean of 2.47. The final selection of the 48 target words and all primes was based on a judgment of familiarity, which consisted of ranking words on a sevenpoint familiarity scale by 30 native speakers, 1 being ‘unfamiliar’ and 7 ‘very familiar’. This procedure was followed in all the following experiments. Only words that had a familiarity mean score between 4 and 6 were finally included. The targets had an overall mean familiarity score of 5.16. The unrelated primes were given an average score of 4.87. The overall mean was 4.88 in the orthographically related condition and 5.09 in the morphologically related condition (5.20 in the [+Semantic +Root] condition and 4.97 in the [-Semantic +Root] condition). In addition to the 48 words and their corresponding primes in every condition, 48 unrelated word-word fillers were selected. Another 96 word-nonword filler pairs were added, 48 of which were formally related, while the other 48 pairs were unrelated. To familiarize the participants with the task, 34 practice trials were also included. The nonwords in all experiments were created by mixing legal non-existing roots with existing word patterns. All the stimuli were divided into three lists, each containing a total of 226 pairs, half of them were words and half were nonwords. The stimuli were rotated within the four conditions in a Latin-square design in such a way that each participant was assigned the same number and type of prime-target pairs. The stimuli in this and other experiments were pre-
128
abdessatar mahfoudhi
sented in the unvowelled version of Arabic orthography, but caution was taken to include only words that had only one reading. Set1 Prime Target تــقــسـّـم قــاســم [qaasama] [taqassama] 1. +Root+Sem “was divided” “shared” تــقــاعــس قــاســم [qaasama] 2. +Orthog/+Phono [taqaaʕasa] “was uninterested” “shared” تــصــدّ ر قــاســم [qaasama] 3. Unrelated [taṣaddara] “occupied the leading position” “shared” Set 2 1. +Root–Sem 2.+Orthog/+Phono 3. Unrelated
احــتــرم [ʔiħtarama] “respected” تــكــرّ م [takarrama] “showed one’s generosity” تــوطــّـد [tawaṭ ̣ṭada] ̣ “was strengthened”
حــرّ م [ħarrama] “forbid” حــرّ م [ħarrama] “forbid” حــرّ م [ħarrama] “forbid”
Figure 1. Examples of Prime-target Pairs Used in Study/Experiment 1, with Arabic Script, Phonetic Transcription, and Gloss
4.4 Procedure and apparatus One third of the 36 participants were arbitrarily assigned to each of the three lists. They were tested individually in a quiet room. The participants were instructed to respond as quickly and as accurately as possible by pressing the Yes key for a word response and the No key for a nonword response. The dominant hand was used for word (Yes) responses and the non-dominant hand for the nonword (No) responses. The experiment lasted about 15 minutes. The experiment was conducted on an HP portable computer running the display system DMDX3. Each trial consisted of three events. The first event was a mask of 28 vertical lines (following Boudelaa & MarslenWilson 2001) that was displayed for 500 ms. The second event that im3
The DMDX software was developed by J. C. Forster at the University of Arizona.
roots and patterns in arabic lexical processing
129
mediately followed was a prime word that appeared for 50.25 ms. The last event that immediately followed the prime was a target word, which remained on the screen for 2000 ms. or until a response was provided. The mask was presented in 30-point Traditional Arabic font size, the prime in 24-point font size and the target in 34-point font size. 4.5 Results The averages of correct response times (RT) and mean error frequencies were obtained for both participants and items. Both types of data were analyzed using separate analyses of variance (ANOVAs). For correct responses, outliers that were two standard deviations above or below the mean were eliminated without being replaced. Participants who had more than 20% error on the experimental words were excluded and replaced. The effect of priming in the related conditions was evaluated against the orthographic condition. The means, standard deviations, and error rates for all conditions are presented in Table 1. Three sets of two-way ANOVAs were run for subjects (F1) and items (F2). The two independent variables were prime condition and list, with each containing three levels. However, the effect of list will not be reported because this between-subjects factor was introduced to reduce variance. To check whether the root had a special priming effect, I ran a set of ANOVAs on the first three conditions, (1) +Root (with either a transparent or an opaque semantic relationship), (2) +Orthography/+Phonology and (3) Unrelated. Prime condition was significant only in subject analysis, F1(2, 66)=5.82, p.05).
134
abdessatar mahfoudhi
Condition
RT(ms)
SD
% error
1.+Sound Pattern 2. +Weak Pattern 3. +Orthography
727 730 729
90 90 89
7.2 5.3 5.9
Table 2. Lexical Decision Reaction Times (RTs), Standard Deviations (SD), and Percentage Error Rates (% error) in Experiment 2A
5.1.6 Discussion The results show that there is no difference at all between the two related conditions and the control condition. The lack of priming with a sound pattern (only 2 ms more than the control) suggests that, unlike the root, the pattern does not play a role at this stage of lexical processing (50 ms). One possible explanation for the lack of priming with patterns could be the fact that the short vowels of the pattern are not orthographically represented. This result does not support the findings on Arabic reported by Boudelaa & Marslen-Wilson (2005). They found priming effects of the pattern at display time 48 and 64 ms. in deverbal nouns and only at SOA 48 for verbs. As Boudelaa & Marslen-Wilson’s stimuli were not published, I cannot compare their stimuli to mine. The results in Hebrew are equally intriguing. Studies on Hebrew showed that pattern priming had a priming effect at SOA of 42-43 ms in verbs and not in nouns (Frost et. al. 1997; Deutsch, Frost & Forster 1998). The lack of priming with Arabic patterns could be due to the fact that the overlap is minimal. 5.2 Experiment 2B 5.2.1 Objectives Although I found no priming with sound patterns, I wanted to test if there was no priming with exact weak patterns where the overlap is maximal. That is, I tested the condition in which both the prime and the target were weak and shared orthographic vowels and consonants. I also included a condition where primes and targets shared a weak pattern with a different site of weakness. Two slightly different morphological conditions were, therefore, included in addition to the control condition (see Figure 3, below). In the first related condition, both primes and targets shared the same patterns with the same site of weakness and therefore had very similar prosodic templates. In the second condition, primes
roots and patterns in arabic lexical processing
135
and targets shared the same patterns but differed as to the site of the weakness. The dissimilarity in the site of the weakness also affected the orthography. The site of the weakness is a long/orthographic vowel, particularly a long aa which is written in two different ways: as an ʔalif ا in the middle of the word and as yaaʔ ىat the end. This discrepancy in orthography was controlled in the three conditions by selecting half of the primes with an ʔalif and the other half with a yaa;. In the +orthography/phonology condition, primes shared the same number of letters with targets. The overlap was both in consonants and long vowels (24 verbs overlapped in long vowels and consonants and 24 in consonants). 5.2.2 Participants Another 36 Arabic-speaking students from the same population as in the previous experiments volunteered to take part in this experiment. None of them participated in the other experiments. 5.2.3 Stimuli and design The target words were 48 verbs that were derived from the following patterns: fa’ala (24), ʔafʕala (11), ʔiftaʕala (9), ʔistafʕala (2), and ʔinfaʕala (2). They had a mean letter length of 3.81 and a mean syllable length of 2.5. Each target word was paired with three primes, one from each of the three conditions: (i) Same Pattern, with a weak root (ii) Slightly Different Pattern, with a weak root and (iii) +Orthography/Phonology. The primes that shared the same patterns with targets were, on average, 3.81 letters long and 2.50 syllables long. The mean letter and syllable length of the primes that shared slightly different patterns with targets were 3.81 and 2.35, respectively. In the +Orthography/Phonology condition, primes had a mean letter length of 3.92 and a mean syllable length of 2.94. The number, position, order, and continuity of the overlapping letters in the control condition mimicked as much as possible those in the morphological conditions. The average amount of prime target overlap was 1.32 letters and 1.81 phonemes in the Same Pattern condition; 1.31 letters and 1.81 phonemes in the Slightly Different Pattern condition; and 1.35 letters and 1.82 phonemes in the Orthography/Phonology condition (see Figure 3, below). The familiarity score of the selected items ranged between 3.75 and 5.75 over a seven-point scale. The overall means of the targets and the
136
abdessatar mahfoudhi
primes in the different conditions were as follows: 4.16 for the targets; 4.59 for the orthographically/phonologically-related primes; 4.50 for the primes that shared a slightly different pattern with the targets; and 4.36 for the primes that shared the same pattern with the targets. As in the previous experiments, 48 unrelated word-word fillers were selected. Another 96 word-nonword filler pairs were added, 48 of which were formally related while the other 48 pairs were unrelated. There were also 34 practice pairs. The overlap in the related nonword-word pairs/fillers was either morphological or phonological. As in the experimental word-word pairs, the morphological overlap in these word-nonword fillers was in the shared word patterns. The phonological overlap was in some of the root consonants and affix consonants. The stimuli were finally divided into three lists. Each list was presented to a different group of twelve participants. Prime Target انــتــقــى ارتــمــى [ʔintaqaa] [ʔirtamaa] 1. Same pattern with weak root “threw oneself” “selected” احــتــاط انــتــقــى [ʔintaqaa] 2. Slightly different pattern [ʔiħtaaṭa] with weak root “was cautious” “selected” أتــقــن انــتــقــى [ʔintaqaa] 3. Control: +orthog/ phono [ʔatqana] “mastered” “selected” Figure 3. Examples of Prime-target Pairs Used in Experiment 2B, with Arabic Script, Phonetic Transcription, and Gloss
5.2.4 Procedure and apparatus The procedure was the same as in the previous experiments. 5.2.5 Results The data cleaning led to the elimination of 4% of the data. The effect of priming in the related conditions was compared to the orthographic/ phonological condition. The means, standard deviations, and error rates for all experimental conditions are presented in Table 3. The analysis of RT data showed a significant effect of the prime condition variable, F1 (2, 66)=4.20, p