Approaches to Phonological Complexity
≥
Phonology and Phonetics 16
Editor
Aditi Lahiri
Mouton de Gruyter Berlin · New York
Approaches to Phonological Complexity
edited by
Franc¸ois Pellegrino Egidio Marsico Ioana Chitoran Christophe Coupe´
Mouton de Gruyter Berlin · New York
Mouton de Gruyter (formerly Mouton, The Hague) is a Division of Walter de Gruyter GmbH & Co. KG, Berlin.
앝 Printed on acid-free paper which falls within the guidelines 앪 of the ANSI to ensure permanence and durability.
Library of Congress Cataloging-in-Publication Data Approaches to phonological complexity / edited by Franc¸ois Pellegrino … [et al.]. p. cm. ⫺ (Phonology and phonetics ; 16) Includes bibliographical references and index. ISBN 978-3-11-022394-1 (hardcover : alk. paper) 1. Complexity (Linguistics) 2. Phonetics. 3. Grammar, Comparative and general ⫺ Phonology. I. Pellegrino, Franc¸ois, 1971⫺ P128.C664A77 2009 414⫺dc22 2009043030
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.
ISBN 978-3-11-022394-1 ISSN 1861-4191 쑔 Copyright 2009 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Cover design: Christopher Schneider, Laufen. Printed in Germany.
Table of contents Introduction François Pellegrino, Egidio Marsico, Ioana Chitoran and Christophe Coupé
1
Part 1: Complexity and phonological primitives Complexity in phonetics and phonology: gradience, categoriality, and naturalness Ioana Chitoran and Abigail C. Cohn
21
Languages’ sound inventories: the devil in the details John J. Ohala
47
Signal dynamics in the production and perception of vowels René Carré
59
Part 2: Typological approaches to measuring complexity Calculating phonological complexity Ian Maddieson Favoured syllabic patterns in the world’s languages and sensorimotor constraints Nathalie Vallée, Solange Rossato and Isabelle Rousset Structural complexity of phonological systems Christophe Coupé, Egidio Marsico and François Pellegrino Scale-free networks in phonological and orthographic wordform lexicons Christopher T. Kello and Brandon C. Beltz
85
111
141
171
vi Table of contents
Part 3: Phonological representations in the light of complex adaptive systems The dynamical approach to speech perception: From fine phonetic detail to abstract phonological categories Noël Nguyen, Sophie Wauquier and Betty Tuller
193
A dynamical model of change in phonological representations: The case of lenition Adamantios Gafos and Christo Kirov
219
Cross-linguistic trends in the perception of place of articulation in stop consonants: A comparison between Hungarian and French Willy Serniclaes and Christian Geng
241
The complexity of phonetic features’ organisation in reading Nathalie Bedoin and Sonia Krifi
267
Part 4: Complexity in the course of language acquisition Self-organization of syllable structure: a coupled oscillator model Hosung Nam, Louis Goldstein and Elliot Saltzman
299
Internal and external influences on child language productions Yvan Rose
329
Emergent complexity in early vocal acquisition: Cross linguistic omparisons of canonical babbling Sophie Kern and Barbara L. Davis
353
Index
377
List of Contributors
383
Introduction François Pellegrino, Egidio Marsico, Ioana Chitoran and Christophe Coupé 1. The study of complexity in phonology and phonetics What is complex? What is not complex, or simple? Is there a gap between simple and complex? Or is complexity gradient? While universal answers to these questions are probably of limited relevance, their resolution in specific fields of research may be crucial, especially in biology or social sciences, where complexity factors may play a highly significant role in the emergence and the evolution of systems, whatever they are (Edmonds, 1999). In phonetics and phonology, these questions have been present for more than a century. For example, according to Zipf (1935:49), “there exists an equilibrium between the magnitude or degree of complexity of a phoneme and the relative frequency of its occurrence”. In this controversial work, he thus tried to evaluate the magnitude of complexity of phonemes from articulatory effort (Zipf, 1935:66; but see also Joos, 1936) under the assumption that it plays a major role in phonetic changes as well as in the structure of phonological systems. Soon afterwards, Trubetzkoy reanalysed this interaction in terms of markedness (1938:282), leading the way to a long-lasting tradition of intricate relationships between the notions of markedness, frequency, complexity and functional load, well exemplified by this quotation from Greenberg, forty years later: “Are there any properties which distinguish favored articulations as a group from their alternatives? There do, as a matter of fact, appear to be several principles at work. [There is one] which accounts for a considerable number of clusters of phonological universals (…) This is the principle that of two sounds that one is favored which is the less complex. The nature of this complexity can be stated in quite precise terms. The more complex sound involves an additional articulatory feature and, correspondingly, an additional acoustic feature which is not present in the less complex sound. This additional feature is often called a “mark” and hence the more complex, less favored alternative is called marked and the less complex, more favored alternative the unmarked. (…) It may be noted that the approach outlined here avoids the circularity for which earlier formulations, such as those of Zipf,
2
François Pellegrino et al. were attacked. (…) In the present instance, panhuman preferences were investigated by formulating universals based in the occurrence or nonoccurrence of certain types, by text frequency and other evidence, none of which referred to the physical or acoustic nature of the sounds. Afterward, a common physical and acoustic property of the favored alternatives was noted employing evidence independent of that used to establish the universals” (Greenberg, 1969:476-477).
Indeed, the notion of phonological complexity is implicitly present in numerous works dealing with linguistic typology and universals (as in Greenberg’s quotation), language acquisition (e.g. Demuth, 1995) and historical linguistics. Articulatory cost, perceptual distinctiveness and systemic constraints have thus been proposed as driving forces for explaining sound changes (Lindblom & Maddieson, 1988; Lindblom, 1998:245), beside an undisputed social dimension. The role of such mechanisms has also been extended to the structure of language systems, leading some linguists to postulate a balance of complexity within language grammar, a lack of complexity in one component being compensated by another more complex component (e.g. Hockett, 1958:180-181). However, this assumption is highly debated and still unsolved (Fenck-Oczlon & Fenk, 1999, 2005; Shosted, 2006). However, one must acknowledge that the word complexity itself has not been often explicitly referred to, even if it underlies several salient advances in phonetics and phonology. For example, when Ohala pointed out that an ‘exotic consonant inventory’ such as { ɗ k’ ts ɬ m r | } is not observed in languages with few consonants, he suggested that a principle of economy is at work at the systemic level (Ohala, 1980; but see also, Ohala, this volume). Consequently, one can infer that the above system is too complex to be viable; but too complex with respect to what? And how to measure this complexity; is it a matter of global number of articulatory features, of intrinsic phonemic complexity, of the overall size of the phonetic space used in this language? Contrary to what Greenberg said, measuring complexity is not straightforward, even when the problem is narrowed, for instance to articulatory complexity (e.g. Ohala, 1990:260) and we still lack relevant tools. Lindblom and Maddieson (1988) began to address this question and proposed to divide consonants into three sets (simple, elaborated and complex) according to their articulatory complexity. They analysed the distribution of these segments among the UPSID database (Maddieson, 1984) and they suggested that languages have a tendency to use consonants and vowels picked
Introduction
3
from an adaptive phonetic space according to the number of elements in their inventories. This influential paper combined a typological survey and a theoretical attempt to decipher the mechanisms responsible for the observed patterns (see also, Lindblom, 1998; Lindblom, 1999). In this sense, it threw a bridge between the bare issue of complexity measurement and the use of methods fostered by physics and cybernetics to account for the general behaviour of languages, viewed as dynamical systems. In a late work, Jakobson judged that: “Like any other social modelling system tending to maintain its dynamic equilibrium, language ostensively displays its self-regulating and selfsteering properties. Those implicational laws which build the bulk of phonological and grammatical universals and underlie the typology of languages are embedded to a great extent in the internal logic of linguistic structures, and do not necessarily presuppose special 'genetic' instructions” (Jakobson, 1973:48).
Phenomena such as self-organisation, evoked above, and emergence, which also comes to mind in this view, are commonly found in the study of complex adaptive systems, a subfield of the science of complexity. These approaches connect the microscopic level (the components and their interactions) to the macroscopic level (the system and its dynamic behaviour), and they aim at explaining complex patterns with general mechanisms without any teleological considerations1. As far as phonetics and phonology are concerned, these perspectives have already generated a noteworthy literature (e.g. Kelso, Saltzman & Tuller, 1986; Lindblom, 1999) and several recent developments are described in this book (mostly in Part 3). The next paragraph provides some landmarks necessary to grasp the aims and content of this book.
2. Complex adaptive systems and the science of complexity Since the middle of the twentieth century, scientists from numerous fields of research, ranging from physics to graph theory, and from biology to economics and linguistics, have built a web of theories, models and notions, known today as the Science of Complexity. This paradigm pertains to our everyday experience, and has provided us with insights in phenomena as distinct – at first glance – as properties of ferromagnetic materials with respect to temperature, motion patterns of persons on crowded sidewalks or of fish schools, social behaviour of ants or termites, fluctuations of finan-
4
François Pellegrino et al.
cial markets, etc (e.g. Markose, 2005; Theraulaz et al., 2002; Gazi and Passino, 2004). The strength of this approach probably dwells in its protean capacity, an adaptability that has been described by Lass (1997:294) as “a syntax without a semantics” preventing any “ontological commitment”. The exact scope of disciplines and methodologies which can potentially benefit from this new science is therefore not restricted, and a reanalysis of long-lasting open issues in the light of complexity leads to exciting connections to most areas of research, and linguistics is not an exception. The main focus of the Science of Complexity is the study of complex systems. A system is said to be complex when its overall behaviour exhibits properties that are not easily predicted from the individual description of the parts of the system. Hence, a car is not really complex but just complicated: it consists of many interacting parts, but the behaviour of the car is predictable from its components (and that is why we can safely drive it). On the contrary, when the same car is caught in a traffic jam, it becomes very difficult to predict the evolution of the blockage and even the individual trajectory of this particular car: the interaction of the cars (and of their respective drivers) generates a complex collective pattern. An essential element of complex systems lies in the interaction between each component and its environment. Systems may differ in terms of the reactivity of their components – an ant cannot match a human being when it comes to analyzing and reacting to the environmental input – but a minimal threshold has to be exceeded for complex behaviours to appear. Besides, complex systems are generally explained by recourse to the notions of nonlinearity and emergence. Nonlinearity refers to phenomena for which the effect of a perturbation is not proportional to its initial cause, due to the complex network of interactions in which it is entangled. The famous butterfly effect, popularized in chaos theory, illustrates the sensitivity to the initial conditions that derives from this condition. Emergence refers to the appearance of structures at the overall level, from the interactions of the components of a dynamic complex system. Such structures can apply to relevant dimensions of the system – like its spatial organization – but also unroll in time with the consistent occurrences of transient or stable states. These emergent properties often result from trade-off between conflicting constraints and from selforganizing processes that can stabilize the system enough for such regularities to appear. Most complex systems do not follow deterministic paths because of the existence of degrees of freedom leading to a wide range of possible states
Introduction
5
in answer to internal and external constraints. Thus, various evolutionary trajectories may be observed, and from a given initial state, these systems may reach various final configurations, whose likelihood is a function of the self-organizing forces at hand. In other words, it is impossible to predict the evolution of a single system, but it is possible to draw reliable conclusions for a large enough set of them: a collection of complex systems may indeed exhibit a diversity of states, with some more frequent than others, and some very unlikely but still explainable in terms of probability laws for “rare events”, etc2. The human language faculty is a complex system, both as an outcome of interacting linguistic components within each individual and as a collective set of conventions resulting from the interactions among individuals. On the one hand, linguistic products themselves – words, sentences, sets of sentences – are the outputs of a cognitive system composed of linguistic components, as well as a set of complex relationships between them. Competing pressures over lexicon and grammar (such as articulatory/auditory constraints mentioned above) widely influence language production and understanding by human beings, and dynamical processes (e.g. activation propagation and decay in the mental lexicon, or interactive meaning construction of sentence from lexicon and grammar). On the other hand, language seen as a dynamical distributed system of conventions in a community can also be analysed as a complex system, given the intricacy of the linguistic interactions taking place between the speakers. Indeed, the science of complexity has successfully addressed tremendous challenges in our understanding of the human language faculty. Theoretical approaches that integrate self-organization, emergence, nonlinearity, adaptive systems, information theory, etc., have already shed new light on the duality between the observed linguistic diversity and the human cognitive faculty of language. Most of the recent literature written in this framework focuses either on the syntactic level addressed through computational complexity (Barton et al., 1987; Ristad, 1993; among others) or performance optimization (e.g. Hawkins, 2004), or explicitly on the emergence and evolution of language as a communication convention (e.g. Galantucci, 2005; Steels, 2005, 2006; Ke, Gong and Wang, 2008). Other linguistic components have been less thoroughly investigated, Dahl (2004) and Oudeyer (2006) providing noteworthy exceptions offering stimulating approaches to long-lasting questions. However, no unified framework has yet come into sight, and the field is characterized by a wide variety of approaches.
6
François Pellegrino et al.
3. Goal and contribution of the present volume This book is the first one to propose an outline of this multi-faceted field of research in the general framework of phonetics and phonology. It is organized in four parts and covers a large spectrum of issues addressed by the community of specialists in two directions shaped by the concepts of complexity and complex systems. The first branch ranges from the measurement of complexity itself to the assessment of its relevance as an explanation to typological phonology and to phylogenetic or ontogenetic trajectories. The second branch ranges from the quest for phonetic/phonological primitives to the dynamical modelling of speech communication (perception and/or production) as a complex system in an emergent and self-organized attempt to explain phonetic and phonological processes. Beyond this diversity, all the contributors of this book consider that the notions of complexity and complex adaptive systems offer today a huge potential for developing groundbreaking research on language and languages, to the extent that they may partially reveal the “invisible hand” for the organization and evolution of speech communication – a metaphor borrowed from Adam Smith’s work in economics and already developed in Keller (1994) in a diachronic perspective. As said above however, no unified framework exists yet, and the contributions gathered here bring together different pieces of the puzzle investigated from several points of view and methodologies. Consequently, a reflection on phonological complexity is present in all chapters to some degree, and the analyses are always based on experimental data or cross-linguistic comparison. In Part I, the questions of the nature of the relevant primitives in sound systems is addressed in the light of complexity at the phonetics/phonology interface. In chapter 1, Ioana Chitoran and Abigail C. Cohn bring together a number of different notions that correspond to interpretations of phonological complexity (e.g., markedness, naturalness), building on them a clear and comprehensive overview of the main points of debate in phonetics and phonology. These debates revolve around: (i) the interaction between phonetics and phonology; (ii) their gradient vs. categorical nature; (iii) the role of phonetic naturalness in phonology; (iv) the nature of units of representation. Ioana Chitoran and Abigail C. Cohn argue that a clear and complete understanding of what complexity is in phonetics and phonology must necessarily engage these four points, and must take into account phenomena that have generally been interpreted as lying at the interface between pho-
Introduction
7
netics and phonology. As such, it must crucially take into account variability. Chapters 2 (John J. Ohala) and 3 (René Carré) in this section both address the issue of variability and challenge traditional representations of phonetic primitives. Ohala further develops the idea that the degree of complexity of a sound system should not be limited to the number and combination of distinctive features. Rather, one has to consider the balance between symmetry and economy as described in phonology, and asymmetry and absence of categorical boundaries, as found in phonetics. Starting from the idea that distinctive phonetic features in language X can be present non-distinctively in language Y, Ohala argues that phonetic variation must be included in a measure of phonological complexity, because it is part of a speaker’s knowledge of the language. The concept of coarticulation, for example, is not entirely relevant for a phonological system, but the systematic variation it introduces in the speech signal can, over time, affect the composition of segmental inventories. Carré (chapter 3) presents results from production and perception experiments suggesting that the identification of vowels in V1V2 sequences is possible based exclusively on dynamic stimuli, in the absence of static targets. Carré proposes that reliable information on vowel identities in V1V2 sequences lies in the direction and rate of transitions. He connects this finding to the known importance of transition rate in the identification of consonants. The implication of this connection is a possible unified theory of consonant and vowel representation, based on the parameter of transition rate: consonants are characterized by fast transitions and vowels by slow transitions. Carré’s dynamic approach thus presents an intriguing challenge to more traditional views of phonetic specification, based primarily on static primitives. Part II starts with a contribution by Ian Maddieson where he proposes several factors contributing to phonological complexity, departing from the traditional counts of consonant and vowel inventories, tone systems or syllable canons. The approach benefits from tests on a large representative sample of the world’s languages and from a thorough analysis of the literature. The first factor deals with “inherent phonetic complexity”. The author proposes various ways of establishing a scale of complexity for the segments, on which we can then base the measure of the system complexity by summing the complexity of its particular components. The second factor assesses the combinatorial possibilities of the elements (segments, tones,
8
François Pellegrino et al.
stress) present in a given phonological system, and one possibility suggested by the author is to calculate the number of possible distinct syllables per language. The third factor focuses on the frequency of types of the different phonological elements of a system. The idea put forward by the author is that the complexity of a language regarding a particular element is the inherent complexity of that element weighted by its frequency of occurrence in the lexicon. In other words, the more a language uses a complex element, the more complex it is. The major concern then lies in the way one calculates the type frequencies for a large sample of languages: shall it be based on lexicons or texts? The last potential complexity factor is called “variability and transparency”. It has to do with phonological processes and no more with inventories. The author suggests evaluating the motivations behind phonological alternations; these variations can be ranked from more conditioned ones, (highly motivated, thus less complex) to free variations (no motivations, thus more complex). This complexity value can be weighted against the number of resulting variants in the alternation, giving a combined score of variability and transparency. The author concludes by reckoning that even if all the proposed complexity factors are proved to be relevant, the main problem will remain how to combine all of them in one overall measure of complexity. The second contribution of part II is by Nathalie Vallée, Solange Rossato and Isabelle Rousset. It echoes the second factor proposed by Ian Maddieson regarding combinatorial possibilities of segments. The authors analyze some languages’ preferred sound sequences (syllabic or not) using a 17-language syllabified lexicon database (ULSID) in the light of the frame/content theory (MacNeilage, 1998). They focus on the alternations of consonants and vowels looking at their place of articulation. They confirm previous findings stating preferred associations like coronal consonants with front vowels, bilabial consonants with central vowels, and velar consonants with back vowels. They also examine the so-called “labial-coronal effect” according to which CV.CV words are predominantly composed of a labial first consonant and a coronal second one. Their data extend this result by showing the existence of the labial-coronal effect in other syllabic patterns as well. Finally, they look at sequences of plosive and nasal consonants, revealing that preferred associations question the validity of the sonority scale. To account for all their typological findings, the authors put forward convincing arguments from articulatory, acoustic and perceptual domains; they conclude that the patterns of sound associations encountered
Introduction
9
in the world’s languages find their source partly outside of phonology in the sensorimotor capacities that underlie them. The third contribution by Christophe Coupé, Egidio Marsico and François Pellegrino departs from the two previous papers, as it does not aim at proposing any measure or scale of phonological complexity for phonological segments or sound patterns. The contributors rather consider phonological systems as complex adaptive systems per se and consequently, they propose to characterize their structure in the light of several approaches borrowed from this framework. The main rationale is that, by applying models designed outside phonology and linguistics to a typological database of phonological systems (namely UPSID), the influence of theoretical a priori is limited and consequently allows the emergence of data-driven patterns of organisation for the phonological systems. They propose two different approaches. The first one, inspired from graph theory, consists in analysing the structure of phonological systems by constituting graphs in which phonemes are nodes and connections receive weights according to the phonetic distance between these phonemes. Using a topological measure of complexity, this approach is used to compare the distribution of the structural complexities among broad areal groups of languages. In the second approach, they model the content of phonological inventories by considering the distribution of co-occurrences of phonemes in order to define attraction and repulsion relations between them. These relations are then used to propose a synchronic measure of coherence for the phonological systems, and then diachronically extended to a measure of stability. Emergent patterns of stability among phonological systems are demonstrated, supporting that this approach is efficient in extracting a part of the intrinsic information present in the UPSID database and avoiding as much as possible the use of any linguistic a priori. The last contribution of this second part is by Christopher T. Kello and Brandon C. Beltz who propose an exciting hypothesis on the dynamical equilibrium leading to a relationship between phonological systems and phonotactics on the one hand, and the process of word formation in the lexicon on the other. Their approach, like the one by Coupé, Marsico and Pellegrino, imports the mathematical theory of graph into linguistics. Phenomena that exhibit behaviour described by power-laws are widespread in physics, biology and social systems. When observed, these laws generally signify that a principle of least-effort is operating, and that a dynamical equilibrium results from the interaction between several competing constraints. Christopher T. Kello and Brandon C. Beltz observe power-law
10
François Pellegrino et al.
behaviours in word forms and phonological networks of American English, built according to inclusion relation between the lexical items (in contrast with semantic or purely morphological rules). As endorsed by the contributors, this result may stem from a trade-off between distinctiveness and efficiency pressures. In other words, a ”valid” language deals both with the need to maintain a sufficient distance between the words of its lexicon and with a constraint of parsimony leading to the reuse of existing phonotactic or orthographic sequences. Their assumption is extended to the lexical networks of four other languages, and then assessed by comparison with artificial networks. While power-laws were first shown in lexical networks in Zipf’s seminal work a half-century ago (Zipf, 1949), Kello and Beltz’ work goes further by demonstrating that several kinds of constraints interact and generate the same type of behaviour in word formation mechanisms. In a sense, this study fills a part of the gap between the lexicon and the phonology of a language, and provides a convincing link that will be essential for developing a systemic view of languages able to take all linguistic components into account. Part III is specifically dedicated to approaches that aim at revealing the nature and organisation of human phonological representations in a multidisciplinary framework and in the light of complexity. Noël Nguyen, Sophie Wauquier and Betty Tuller’s contribution develops a dynamical approach to explore the nature of the representations activated during speech perception. In the first section, they set the debate between abstractionist and exemplar-based models of speech perception. Since arguments exist in favour of these two antagonistic hypotheses, they argue that these statements result from the dual nature of speech perception. In this view, phonetic details are retained, not as exemplars but as a dynamical tuning of a complex and continuous “shape”, and an abstractionistlike behaviour is also possible, based on the existence of several stable attractors. A dynamical model is developed and a review of several tasks of speech categorization is proposed. The existence of a hysteresis cycle in the behavioural performances observed during the task indicates that perception does not operate in a basic deterministic manner since it is sensitive to the previous state of the system in a way typical of nonlinear dynamical systems. These results strongly support the proposal of a hybrid and dynamical model of speech perception bringing together the properties of both exemplar and abstractionist models.
Introduction
11
In connection with John Ohala’s contribution in the first part of this book, and with the dynamical model of speech perception detailed in the previous chapter (Nguyen, Wauquier and Tuller), Adamantios Gafos and Christo Kirov implemented a nonlinear dynamical model of phonetic change illustrated with the case of lenition. They assume that phonological representations consist of feature-like components that could be theoretically modelled using activation fields borrowed from the dynamic field theory, and ruled by differential equations. In their view, production/perception loops self-generate the well-known word frequency effect reported for lenition. More specifically, the interaction between field activation (biased toward the inputs of the perception stage) and memory decay is the backbone that enables gradual phonetic change. Production/perception loops are thus responsible for both the potential shift of the phonetic realization and the positive feedback that lead to the emergence of a new stable variant of the phonetic parameters. The third contribution of this part, from Willy Serniclaes and Christian Geng, investigates the bases of categorical boundaries in the perception of the place of articulation of stop consonants. It compares the perceptual boundaries of Hungarian and French, using artificial stimuli differing in terms of formant transitions and generated with the DRM model (see René Carré’s contribution in the first part of this volume). Four places of articulation are distinctive in Hungarian, while only three are phonologically relevant in French. Consequently, comparing the positions of their boundaries is informative on the influence of universal phonetic predispositions on the organisation of phonological categories. Results show that the perceptual boundaries are similar for the two languages, dividing the formant transition space into three salient areas. It happens that the palatal-alveolar boundary is not as salient as the other boundaries and that an additional feature (besides burst and formant transition) probably plays a role. These results are discussed in the perspective of the emergence of distinctive boundaries from coupling between natural phonetic boundaries; they also echo John Ohala’ contribution on the importance of so-called secondary features in language evolution (see the first part of this book). Nathalie Bedoin & Sonia Krifi’s contribution deals with the fundamental issue of the organisation of phonetic features, as revealed in the context of reading tasks. They provide a thorough review of this literature, and a series of visual priming and metalinguistic experiments. These experiments explore the temporal course of reading by manipulating not only phonetic feature similarity between primes and targets, but also the nature of these
12
François Pellegrino et al.
features. Taken as a whole, their results suggest that voicing, manner and place are processed at different rates and that a complex pattern of activation propagation and lateral inhibition is involved. More specifically, voicing seems to be processed first but, depending on the experimental conditions, a prominent impact of manner over place and voicing may also be evidenced when processing time is no more a relevant factor. Nathalie Bedoin & Sonia Krifi’s contribution thus highlights the complexity of the organisation of phonetic features both in the temporal and the hierarchical dimensions. Additional information is provided by the replication of these experiments with second and third grade children, revealing a gradient setting of the underlying processes during language acquisition and development. The relevance of the approaches to phonology borrowed from the science of complexity can only be assessed by evaluating whether such models succeed in tackling some of the challenges that limit our knowledge and understanding of human language capacity and linguistic diversity. If granting a significant role to complexity is correct, one of the most salient fields in which it will radically change our comprehension is the domain of language acquisition, and especially along two directions. First, computational dynamical models of emergence of linguistic patterns may assess hypotheses related to the mechanisms of linguistic bootstrapping (e.g. Morgan & Demuth, 1996; Pierrehumbert, 2003). Second, cross-linguistic comparison of courses of language acquisition may reveal universal tendencies, not necessarily in terms of phonological units (gestures, features, segments or syllables) but in terms of their intrinsic complexity and of their interactions in the communication system. In the longer run, these two approaches will probably give rise to unified models of phonological acquisition, and they already have reached significant results on the balance between universal and language specific constraints in acquisition, as shown in Part IV. In the first paper, Hosung Nam, Louis Goldstein and Elliot Saltzman promote a dynamical model of the acquisition of syllable structures, compatible with what is attested in the world’s languages. More specifically, the emergence of asymmetries between the frequencies of syllables with onsets (CV structure) versus syllables with codas (VC structure) is observed with their model which avoids the partially circular notion of the unmarkedness of the CV structure. These effects emerge as a consequence of the interaction between the ambient language and the intrinsic characteristics of the oscillators that control the phasing of the articulatory gestures
Introduction
13
in the “child model”. By implementing a nonlinear coupling between these oscillators, multiple stable modes can emerge as attractors, given the generic assumption that in-phase and anti-phase coordination of gestures are preferred. Without additional hypotheses, differences between the duration in the acquisition processes of CV and VC structures also emerge, regardless of the target ambient distribution. Next, the computational model is proved to efficiently model the faster acquisition of VCC over CCV. Hence, the model successfully manages to reproduce two seemingly contradictory phenomena regarding the course of acquisition of CV vs. VC structures on the one hand and of CCV vs. VCC on the other. In this framework, the contributors demonstrate that data from linguistic typology and from longitudinal studies of language acquisition can foster methodologies inspired from complex adaptive systems in an extraordinarily fruitful approach. The two last chapters of this book (contributed by Yvan, and Sophie Kern & Barbara L. Davis) do not implement any computational models. However they thoroughly explore the driving forces underlying phonological acquisition in a multi-language framework. Yvan Rose argues for the necessary synthesis between diverging approaches and he urges development of a multi-faceted approach in order to overcome some failures of current approaches in accounting for patterns observed during early phonological acquisition. Reanalysing several papers from the literature on early acquisition, he suggests that the role of the statistical patterns of the ambient language has been overestimated and an alternative explanation based on structural complexity is introduced. In the rest of the chapter, the contributor discusses a series of phonological patterns taken from data published on early acquisition in terms of interactions of driving forces grounded in several potentially relevant “facets” (articulation, perception, statistics of the ambient language, child grammar as a cognitive system). Yvan Rose’s contribution thus offers a strong argumentation in favour of the multi-faceted approach and a rich and stimulating interpretation of existing data built upon factors of phonological complexity. Sophie Kern & Barbara L. Davis’s contribution tackles the issue of cross-linguistic variability in canonical babbling, thanks to an unprecedented amount of empirical data from five languages. The contributors take advantage from this unique material to investigate the universality and/or language-specificity of canonical babbling. The theoretical bases supporting the existence of universal driving forces are introduced and developed in the vein of the Frame/Content perspective, and the impact of the ambient
14
François Pellegrino et al.
language is discussed from a review of the literature. More specifically, Sophie Kern and Barbara L. Davis highlight that the lack of common ground strongly limits the cross-linguistic relevance of these studies (because of the different procedures applied in different languages). Then, they analyse the similarities and differences observed in their data, at the segmental level and in terms of subphonemic and phonemic co-occurrences in the babbling structures. The discussion of these results draws a coherent scheme that emphasizes the role of speech-like prelinguistic babbling as a first step into language complexity, but predominated by universal characteristics of the production system. The editors would like to warmly thank all the authors of this volume. We are also greatly indebted to the colleagues who have contributed as reviewers for the submitted chapters: René Carré, Barbara Davis, Christian DiCanio, Christelle Dodane, Emmanuel Ferragne, Cécile Fougeron, Adamantios Gafos, Ian Maddieson, Noel Nguyen, Pierre-Yves Oudeyer, Gérard Philippson, Yvan Rose, Willy Serniclaes, Caroline Smith, Kenny Smith (in alphabetic order), and to the participants of the Workshop “Phonological Systems & Complex Adaptive Systems”, held in Lyon in July 2005, and more specifically Didier Demolin, Björn Lindblom and Sharon Peperkamp, for their comments. We also thank Aditi Lahiri for her thorough and fruitful suggestions and Mouton de Guyter’s anonymous reviewer. The editors fully acknowledge the financial support from the French ACI “Systèmes complexes en SHS” and from the French Agence Nationale de la Recherche (project NT053_43182 CL², P.I. F. Pellegrino).
Notes 1. It may be more correct to state that these approaches imply a limited teleology, in the sense that they are often based on the optimization of a given criterion and thus can be seen as ‘targeted’ to this optimization. See Blevins, (2004:71-78) for a thorough discussion about the nature of teleological and functional explanations in sound change. 2. This statement can obviously be put in perspective with considerations on language universals and the distribution of patterns among languages, e.g. see Greenberg (1968): “In general one may expect that certain phenomena are widespread in language because the ways they can arise are frequent and their stability, once they occur, is high. A rare or non-existent phenomenon arises only by infrequently occurring changes and is unstable once it comes into existence. The two factors of probability of
Introduction
15
origin from other states and stability can be considered separately” (Greenberg, 1978:75-76).
References Barton, G. Edward, Robert Berwick & Eric Sven Ristad 1987 Computational Complexity and Natural Language. The MIT Press: Cambridge, MA, USA. Blevins, Juliette 2004 Evolutionary phonology: the emergence of sound patterns. Cambridge University Press: New York. Dahl, Östen 2004 The Growth and Maintenance of Linguistic Complexity. Studies in Language Companion Series 71. John Benjamins. Demuth, Katherine 1995 Markedness and the Development of Prosodic Structure. In Proceedings of the North East Linguistic Society. Jill N. Beckman (ed.). Amherst: Graduate Linguistic Student Association. pp. 13-25. Fenk-Oczlon Gertraud & Fenk August 1999 Cognition, quantitative linguistics, and systemic typology. Linguistic Typology 3.2: 151-177. 2005 Crosslinguistic correlations between size of syllables, number of cases, and adposition order. In Sprache und Natürlichkeit, Gedenkband für Willi Mayerthaler. G. Fenk-Oczlon & Ch. Winkler (eds). Tübingen. Galantucci, Bruno 2005 An experimental study of the emergence of human communication systems. Cogn. Sci. 29: 737–767. Gazi, Veysel & Kevin M. Passino 2004 Stability analysis of social foraging swarms. IEEE Transactions on Systems, Man, and Cybernetics Part B, 34.1: 539- 557. Greenberg Joseph. H. 1969 Language Universals: A Research Frontier. Science 166: 473-478. 24 October 1969. 1978 Diachrony, synchrony, and language universals. in Universals of human language. J. H. Greenberg, C. A. Ferguson & E. A. Moravcsik (eds.). Vol. 1. Stanford University Press: Stanford. pp. 61-93. Hawkins, John A. 2004 Efficiency and Complexity in Grammars, Oxford University Press: Oxford. Hockett, Charles F. 1958 A Course in Modern Linguistics. The MacMillan Company: New York.
16
François Pellegrino et al.
Jakobson, Roman 1973 Main trends in the Science of Language. Main trends in the social Sciences Series. Harper & Row: New-York. Joos, Martin 1936 Review: The Psycho-Biology of Language by George K. Zipf, Language 12.3: 196-210. Ke Jinyun, Tao Gong & William S-Y Wang 2008 Language Change and Social Networks. Communications in computational physics 3.4: 935-949. Keller, Rudi 1994 On language change : the invisible hand in language. Routledge, London ; New York. Kelso, J. A.S., Saltzman, E.L. &Tuller, B. 1986 The dynamical perspective on speech production: Data and theory. Journal of Phonetics 14.1: 29-59. Lass, Roger 1997 Historical Linguistics and Language Change, Cambridge University Press: Cambridge. Lindblom, Björn 1998 Systemic constraints and adaptive change in the formation of sound structure. In Approaches to the evolution of language. Hurford J.R., Studdert-Kennedy M. & Knight C. (eds). Cambridge University Press: Cambridge. pp. 242-264. 1999 Emergent phonology. Perilus. XXII: 1-15. Lindblom, Björn & Maddieson, Ian 1988 Phonetic universals in consonant Systems. In Language, Speech, and Mind. Hyman L.M. & Li C.N. (eds). Routledge: New York. pp. 62-78. MacNeilage, Peter F. 1998 The Frame/Content Theory of Evolution of Speech Production. Behavioral and Brain Sciences 21: 499-511. Maddieson, Ian 1984 Patterns of Sounds. Cambridge University Press: Cambridge. Markose, Sheri M. 2005 Computability and Evolutionary Complexity: Markets as Complex Adaptive Systems (CAS). The Economic Journal 115.504: F159–F192. Morgan, James L. & Katherine Demuth 1996 Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition, Lawrence Erlbaum Associates: Mahwah. Ohala, John J. 1980 Moderator’s summary of symposium on ‘Phonetic universals in phonological systems and their explanation’. Proceedings of the 9th International Congress of Phonetic Sciences Vol. 3. Institute of Phonetics: Copenhagen. pp. 181-194.
Introduction 1990
17
The phonetics and phonology of aspects of assimilation. In Papers in Laboratory Phonology I: Between the grammar and the physics of speech. J. Kingston & M. Beckman (eds). Cambridge University Press: Cambridge. pp. 258-265. Oudeyer, Pierre-Yves 2006 Self-Organization in the Evolution of Speech. Studies in the Evolution of Language. Oxford University Press. (Translation by James R. Hurford) Pierrehumbert, Janet B. 2003 Phonetic diversity, statistical learning, and acquisition of phonology. Language and Speech 46.2-3: 115-154. Ristad, Eric S. 1993 The Language Complexity Game. The MIT Press: Cambridge, MA. Shosted, Ryan K. 2006 Correlating complexity: A typological approach. Linguistic Typology 10.1: 1-40. Steels, Luc 2006 Experiments on the emergence of human communication. Trends in Cognitive Sciences 10.8: 347-349. 2005 The emergence and evolution of linguistic structure: from lexical to grammatical communication systems. Connection Science 17.3:213-230. Theraulaz, Guy, Eric Bonabeau, Stamatios C. Nicolis, Ricard V. Solé, Vincent Fourcassié, Stéphane Blanco, Richard Fournier, Jean-Louis Joly, Pau Fernández, Anne Grimal, Patrice Dalle, & Jean-Louis Deneubourg 2002 Spatial patterns in ant colonies. PNAS 2002 99: 9645-9649. Trubetzkoy, Nikolai. S. 1938 Grundzüge der phonologie. French Edition. 1970., Klincksieck: Paris. Zipf, George.K. 1935 The Psycho-Biology of Language: An Introduction to Dynamic Philology. MIT Press: Cambridge. First MIT Press paperback edition, 1965 1949 Human behaviour and the principle of least effort: an introduction to human ecology. Addison-Wesley: Cambridge.
Part 1: Complexity and phonological primitives
Complexity in phonetics and phonology: gradience, categoriality, and naturalness Ioana Chitoran and Abigail C. Cohn 1. Introduction In this paper, we explore the relationship between phonetics and phonology, in an attempt to determine possible sources of complexity that arise in sound patterns and sound systems. We propose that in order to understand complexity, one must consider phonetics and phonology together in their interaction. We argue that the relationship between phonology and phonetics is a multi-faceted one, which in turn leads us to a multi-faceted view of complexity itself. Our goal here is to present an overview of the relevant issues in order to help define a notion (or notions) of complexity in the domain of sound systems, and to provide a backdrop to a constructive discussion of the nature of complexity in sound systems. We begin in §2 by considering possible definitions of phonological complexity based on the different interpretations that have been given to this notion. The issue of complexity has previously been addressed, implicitly or explicitly, through notions such as markedness, effort, naturalness, information content. Concerns with a measure of phonological or phonetic complexity are therefore not new, even though the use of the term “complexity” per se to refer to these questions is more recent. In this section we survey earlier endeavors in these directions. We then turn to the multi-faceted nature of the relationship between phonology and phonetics. In this regard, we address two main questions that have traditionally played a central part in the understanding of the phonetics-phonology relationship. The first of these, addressed in §3, is the issue of gradience vs. categoriality in the domain of linguistic sound systems, and its implications for the question of an adequate representation of linguistic units. In §4 we discuss the second issue: the role of phonetic naturalness in phonology. These are major questions, and we do not attempt to provide a comprehensive treatment here. Rather, our goal is simply to consider them in framing a broader discussion of phonological complexity.
22
Ioana Chitoran and Abigail C. Cohn
In §5 we return to the question of complexity and consider the ways in which measures of complexity depend on the type of unit and representation considered. The conclusion of our discussion highlights the multifaceted nature of complexity. Thus more than one type of unit and more than one type of measure are relevant to any characterization of complexity.
2.
Definitions of complexity
In our survey of earlier implicit and explicit definitions of complexity, we review past attempts to characterize the nature of phonological systems. We discuss earlier concerns with complexity in §2.1; then we turn in §2.2 to the issue of theoretical framing in typological surveys, where we compare two types of approaches: theory-driven and data-driven ones. The first type is illustrated by Chomsky and Halle’s (1968) The Sound Pattern of English (SPE), and the second by Maddieson’s (1984) Patterns of Sounds. 2.1. Early approaches to complexity A concern with complexity in phonetics and phonology can be traced back to discussions of several related notions in the literature: markedness, effort, naturalness, and more recently, information content. While none of these notions taken individually can be equated with complexity, there is an intuitive sense in which each one of them can be considered as a relevant element to be included in the calculation of complexity. Studies of phonological complexity started from typological surveys, which led to the development of the notion of markedness in phonological theory. The interpretation of markedness as complexity is implicit in the original understanding of the term, the sense in which it is used by Trubetzkoy (1939, 1969): the presence of a phonological specification (a mark) corresponds to higher complexity in a linguistic element. Thus, to take a classic example, voiced /d/ is the more complex (marked) member of an opposition relative to the voiceless (unmarked) /t/. Later, the interpretation of markedness as complexity referred to coding complexity (see Haspelmath, 2006 for a detailed review). Overt marking or coding is seen to correspond to higher complexity than no coding or zero expression. This view of complexity was adopted and further developed
Complexity in phonetics and phonology
23
into the notion of iconicity of complexity, recently critiqued by Haspelmath (to appear). What is relevant for the purposes of our paper is noting the actual use of the terms complex and complexity in this literature. Several of the authors cited in Haspelmath (2006; to appear) use these terms explicitly. Thus, Lehmann (1974) maintains the presence of a direct correlation between complex semantic representation and complex phonological representation. Givón (1991) treats complexity as tightly related to markedness. He considers complex categories to be those that are “cognitively marked”, and tend to be “structurally marked” at the same time. Similarly, in Newmeyer’s formulation: “Marked forms and structures are typically both structurally more complex (or at least longer) and semantically more complex than unmarked ones” (Newmeyer, 1992:763). None of these discussions includes an objective definition of complexity. Only Lehmann (1974) proposes that complexity can be determined by counting the number of features needed to describe the meaning of an expression, where the term feature is understood in very broad, more or less intuitive terms. The study of complexity through the notions of markedness or iconicity has not been pursued further, and as highlighted by both Hume (2004) and Haspelmath (2006), neither notion constitutes an explanatory theoretical tool. Discussions of complexity in the earlier literature have also focused on the notion of effort, which has been invoked at times as a diagnostic of markedness. It is often assumed, for example, that phonetic difficulty corresponds to higher complexity, and things that are harder to produce are therefore marked. While many such efforts are informal, see Kirchner (1998/2001) for one attempt to formalize and quantify the notion of effort. Ironically, however, Jakobson himself criticized the direct interpretation of this idea as the principle of least effort, adopted in linguistics from the 18th century naturalist Georges-Louis Buffon: “Depuis Buffon on invoque souvent le principe du moindre effort: les articulations faciles à émettre seraient acquises les premières. Mais un fait essentiel du développement linguistique du bébé contredit nettement cette hypothèse. Pendant la période du babil l’enfant produit aisément les sons les plus variés… ” (Jakobson, 1971:317) [“Since Buffon, the principle of least effort is often invoked: articulations that are easy to produce are supposedly the first to be acquired. But an essential fact about the child’s linguistic development strictly contradicts this hypothesis. During the babbling stage the child produces with ease the most varied sounds…”]1
24
Ioana Chitoran and Abigail C. Cohn
Jakobson’s critique is now substantiated by experimental work showing that articulatory effort is not necessarily avoided in speech production. Convincing evidence comes from articulatory speech error experiments carried out by Pouplier (2003). Pouplier’s studies show that speech errors do not involve restricted articulator movement. On the contrary, in errors speakers often add an extra gesture, resulting in an even more complex articulation, but in a more stable mode of intergestural coordination. Similarly, Trubetzkoy cautions against theories that simply explain the high frequency of a phoneme by the less difficult production of that phoneme (Trubetzkoy, 1969, chapter 7). He advocates instead a more sophisticated approach to frequency count, which takes into account both the real frequency of a phoneme and its expected frequency: “The absolute figures of actual phoneme frequency are only of secondary importance. Only the relationship of these figures to the theoretically expected figures of phoneme frequency is of real value. An actual phoneme count in the text must therefore be preceded by a careful calculation of the theoretical possibilities (with all rules for neutralization and combination in mind)” (Trubetzkoy, 1969:264).
We return to this view below, in relation to Hume’s (2006) proposal of information content as a basis for markedness. In general, however, the usefulness of insights gained by considering speculative notions such as effort, described in either physical or processing terms, has been limited. Nevertheless these attempts have at least served to show, as Maddieson (this volume) points out, that: “difficulty can itself be difficult to demonstrate”. Another markedness diagnostic that has been related to complexity is naturalness. Even though the term naturalness is explicitly used, it overlaps on the one hand with the diagnostic of effort and phonetic difficulty, and on the other hand with frequency. The discussion of naturalness can be traced back to Natural Phonology (Donegan and Stampe, 1979, among others). A natural, unmarked phenomenon is one that is easier in terms of the articulatory or acoustic processes it involves, but also one that is more frequent. In the end it becomes very difficult to tease apart the two concepts, revealing the risk of circularity: processes are natural because they are frequent, and they are frequent because they are natural. Information content is proposed by Hume (2006) as an alternative to markedness. In her proposal she accepts Trubetzkoy’s challenge, trying to determine a measure of the probability of a phoneme, rather than just its frequency of occurrence. She argues that what lies at the basis of marked-
Complexity in phonetics and phonology
25
ness is information content, a measure of the probability of a particular element in a given communication system. The higher the probability of an element the lower its information content, and conversely, the lower its probability the higher its information content. Markedness diagnostics can thus be replaced by observations about probability, which can be determined based on a number of factors.2 While the exact nature of these factors, their interaction, and the specific definition of probability require further empirical investigation, it is plausible to hypothesize a relationship between complexity and probability. For example, if low probability correlates with higher information content, then it may in turn correlate with higher complexity. At the same time, a related hypothesis needs to be tested, one signalled by Pellegrino et al. (2007): it is possible that information rate (the quantity of information per unit per second) may turn out to be more relevant than, or closely related to information content (the quantity of information per unit). 2.2. Theory-driven vs. data-driven approaches Overall we identify two main types of studies of phonological complexity, which we refer to as theory-driven and data-driven, respectively. The theory-driven approach is well illustrated by Chomsky and Halle’s (1968) SPE, where counting distinctive features is considered to be the relevant measure of complexity, not unlike Lehmann’s (1974) proposal, albeit restricted to phonology. In chapter 9 of SPE, Chomsky and Halle develop a complexity metric. Starting from the assumption that a natural class should be defined with fewer distinctive features than a non-natural (or less natural) class, Chomsky and Halle observe some contradictions. For example, the class of voiced obstruents is captured by more features than the class of all voiced segments, including vowels. Nevertheless, the first class is intuitively more natural than the second one, and would therefore be expected to have the simpler definition. The solution they propose is to include the concept of markedness in the formal framework, and to “revise the evaluation measure so that unmarked values do not contribute to complexity” (Chomsky and Halle, 1968:402). This adjustment allows them to define complexity, and more specifically the complexity of a segment inventory, in the following way: “The complexity of a system is equal to the sum of the marked features of its members” (Chomsky and Halle, 1968:409), or in other words, “related to the sum of the complexities of the
26
Ioana Chitoran and Abigail C. Cohn
individual segments” (Chomsky and Halle, 1968:414). Thus, a vowel system consisting of /a i u e o/ is simpler (and therefore predicted to be more common) than /æ i u e o/. By counting only the marked features, the first system has a complexity of 6, while the second one has a complexity of 8. The authors themselves acknowledge the possible limitations of their measure: summing up the marked features predicts, for example, that the inventory /a i u e / is as simple and common as /a i u e o/, both with a complexity of 6. One potentially relevant difference between these systems, which the measure does not consider, is the presence vs. absence of symmetry, the first inventory being more symmetrical than the second one. In general in theory-driven approaches, complexity is defined through a particular formal framework, and thus the insights gained are inevitably limited by the set of operational assumptions. A data-driven study of phonological complexity is Maddieson’s (1984) Patterns of Sounds and the UPSID database that it is based on. The database focuses on the segment, so the implicit measure of complexity involves counting segments. This raises the crucial issue of representation, which we will return to later. In Maddieson’s survey “each segment considered phonemic is represented by its most characteristic allophone” (Maddieson, 1984:6). The representative allophone is determined by weighing several criteria: (i) the allophone with the widest distribution, when this information is available; (ii) the allophone most representative of the phonetic range of variation of all allophones; (iii) the allophone from which the others can be most easily derived. Maddieson thus codifies an atheoretical, descriptive definition of the segment, adopting a somewhat arbitrary, intermediary level of representation between phonology and phonetics, that is in between the underlying contrastive elements, and the phonetic output characterizable as a string of phones. The database captures the output of the phonology, a discrete allophonic representation, which is neither purely phonemic nor purely phonetic, and described as: “phonologically contrastive segments (…) characterized by certain phonetic attributes” (Maddieson, 1984:160). Following Maddieson’s example, linguists have continued to make sophisticated use of typological surveys for many purposes, including that of evaluating complexity (e.g., Lindblom and Maddieson, 1988; Vallée, 1994; Vallée et al., 2002; Marsico et al., 2004).
Complexity in phonetics and phonology
27
2.3. Summary Both theory-driven and data-driven approaches can offer useful insight in the nature and organization of phonological systems. It is also important to bear in mind the implicit assumptions even in what are taken to be “datadriven” approaches. (See also Hayes and Steriade, 2004, pp. 3-5 discussion of inductive vs. deductive approaches to the study of markedness.) One critical aspect of these efforts is the question of the relevant linguistic units in measuring complexity. This question is addressed explicitly by Marsico et al. (2004) and Coupé et al. (this volume). Feature-hood and segment-hood can both tell us something about complexity. But neither concept is as clear-cut as often assumed. Under many views (such as SPE), features are taken as primitives. Segments are built out of bundles of features. Other views take the segment to be primary, or even suggest that segments are epiphenomenal, as is argued by some exemplar theorists. We take the view that in adult grammar, both segments and features have a role to play in characterizing the inventories and patterns of sound systems. As seen above, the question of the nature of segments is also a complex one: do we mean underlying contrastive units, do we mean something more concrete, such as Maddieson’s surface allophones? The question about the nature of segments leads to broader questions about the nature of phonology and phonetics and their relationship. In the next section, we turn to this relationship.
3. The relationship between phonology and phonetics Chomsky and Halle provided an explicit answer about the nature of representations, drawing a distinction between underlying representations, captured in terms of bundles of binary feature matrices, and surface forms, which were the output of the phonology. At this point in the derivation a translation of binary values to scalar values yielded the phonetic transcription. They assumed a modular relationship between phonology and phonetics, where phonology was categorical, whereas phonetics was gradient and continuous. It was also assumed that phonology was the domain of the language specific and phonetics the domain of universal (automatic) aspects of sound patterns. Research since that time has investigated this relationship from many angles, enriching the view of phonetics in the grammar, showing that the dichotomy between phonology and phonetics is not as
28
Ioana Chitoran and Abigail C. Cohn
sharp as had been assumed. (See Cohn, 1998, 2006a & b for discussion). We briefly review the nature of this relationship. First, as discussed by Cohn (2006b), there are actually two distinct ways in which phonology and phonetics interact. A distinction needs to be drawn between the way phonology affects or drives phonetics–what Cohn terms phonology in phonetics and the way that phonetics affects phonology–what Cohn terms phonetics in phonology. In the first, the nature of the correlation assumed by SPE, that is, that phonology is discrete and categorical, while phonetics is continuous and gradient – is important. In the second, the place of naturalness, as internal or external to the grammar, is central. From both of these perspectives, we conclude that phonology and phonetics are distinct, albeit not as sharply delineated as implied by strictly modular models. 3.1. Phonology in Phonetics Phonology is the cognitive organization of sounds as they constitute the building blocks of meaningful units in language. The physical realization of phonological contrast is a fundamental property of phonological systems and thus phonological elements are physically realized in time. Phonology emerges in the phonetics, in the sense that phonological contrast is physically realized. This then is the first facet of the relationship between phonology and phonetics: the relationship between these cognitive elements and their physical realization. Implicit in the realization of phonology is the division between categorical vs. gradient effects: phonology captures contrast, which at the same time must be realized in time and space. This leads to the widely assumed correlations in (1). (1)
The relationship between phonology and phonetics: phonology = discrete, categorical ≠ phonetics = continuous, gradient
The correlations in (1) suggest the following relationships: (2)
a. Categorical phonology c. Categorical phonetics
b. Gradient phonology d. Gradient phonetics
Complexity in phonetics and phonology
29
If the correlation between phonology and categoriality on one hand and between phonetics and gradience on the other were perfect, we would expect there to be only categorical phonology (a) and gradient phonetics (d). There are reasons why the correlation might not be perfect, but nevertheless strong enough to re-enforce the view that phonology and phonetics are distinct. On the other hand, perhaps there is in fact nothing privileged about this correlation. In §3.2, we review the evidence for categorical phonology and gradient phonetics. We consider categorical phonetics and gradient phonology in §3.3. 3.2. Categorical phonology and gradient phonetics A widely assumed modular view of grammar frames our modeling of more categorical and more gradient aspects of such phenomena as belonging to distinct modules (e.g. phonology vs. phonetics). We refer to this as a mapping approach. Following a mapping approach, categorical (steady state) patterns observed in the phonetics are understood to result from either lexical or phonological specification and gradient patterns are understood to arise through the implementation of those specifications. Growing out of Pierrehumbert’s (1980) study of English intonation, gradient phonetic patterns are understood as resulting from phonetic implementation. Under the particular view developed there, termed generative phonetics, these gradient patterns are the result of interpolation through phonologically unspecified domains. Keating (1988) and Cohn (1990) extend this approach to the segmental domain, arguing that phenomena such as long distance pharyngealization and nasalization can be understood in these terms as well. Within generative phonetics, the account of gradience follows from a particular set of assumptions about specification and underspecification. It is generally assumed that categoriality in the phonology also follows directly from the nature of perception and the important role of categorical perception. The specific ways in which perception constrains or defines phonology are not well understood, although see Hume and Johnson (2001) for recent discussions of this relationship. A modular mapping approach has been the dominant paradigm to the phonology-phonetics interface since the 1980’s and such approaches have greatly advanced our understanding of phonological patterns and their realization. The intuitive difference between more categorical and more gra-
30
Ioana Chitoran and Abigail C. Cohn
dient patterns in the realization of sounds corresponds to the division of labor between phonology and phonetics within such approaches and this division of labor has done quite a lot of work for us. Such results are seen most concretely in the success of many speech synthesis by rule systems both in their modeling of segmental and suprasegmental properties of sound systems. (See Klatt, 1987 for a review.) A modular approach also accounts for the sense in which the phonetics, in effect, acts on the phonology. In many cases, phonological and phonetic effects are similar, but not identical. This is the fundamental character of what Cohn (1998) terms phonetic and phonological doublets, cases where there are parallel categorical and gradient effects in the same language, with independent evidence suggesting that the former are due to the phonology and the latter result from the implementation of the former. For example, this is seen in patterns of nasalization in several languages (Cohn, 1990); palatalization in English (Zsiga, 1995); vowel devoicing in Japanese (Tsuchida, 1997, 1998); as well as vowel harmony vs. vowel-to-vowel coarticulation and vowel harmony, investigated by Beddor and Yavuz (1995) in Turkish and by Przezdziecki (2005) in Yoruba. (See Cohn, 2006b for fuller discussion of this point.) What these cases and many others have in common is that the patterns of coarticulation are similar to, but not the same as, assimilation and that both patterns cooccur in the same language. The manifestations are different, with the more categorical effects observed in what we independently understand to be the domain of the phonology and the more gradient ones in the phonetic implementation of the phonology. To document such differences, instrumental phonetic data is required, as impressionistic data alone do not offer the level of detail needed to make such determinations. Following a mapping approach, assimilation is accounted for in the phonological component and coarticulation in the phonetic implementation. Such approaches predict categorical phonology and gradient phonetics, but do they fully capture observed patterns? What about categorical phonetics and gradient phonology? 3.3. Categorical phonetics and gradient phonology We understand categorical phonetics to be periods of stability in space through time. These result directly from certain discontinuities in the phonetics. This is precisely the fundamental insight in Stevens’s (1989) Quan-
Complexity in phonetics and phonology
31
tal Theory, where he argues that humans in their use of language exploit articulatory regions that offer stability in terms of acoustic output.3 There are numerous examples of this in the phonetic literature. To mention just a few, consider Huffman’s (1990) articulatory landmarks in patterns of nasalization, Kingston’s (1990) coordination of laryngeal and supralaryngeal articulations (binding theory), and Keating’s (1990) analysis of the high jaw position in English /s/. There are many ways to model steady-state patterns within the phonetics without calling into question the basic assumptions of the dichotomous model of phonology and phonetics. To mention just one approach, within a target-interpolation model, phonetic targets can be assigned based on phonological specification as well as due to phonetic constraints or requirements. Such cases then do not really inform the debate about the gray area between phonology and phonetics. The more interesting question is whether there is evidence for gradient phonology, that is, phonological patterns best characterized in terms of continuous variables. It is particularly evidence claiming that there is gradient phonology that has led some to question whether phonetics and phonology are distinct. The status of gradient phonology is a complex issue (for a fuller discussion see Cohn, 2006a). Cohn considers evidence for gradient phonology in the different aspects of what is understood to be phonology – contrast, phonotactics, morphophonemics, and allophony – and concludes that the answer depends in large part on what is meant by gradience and which aspects of the phonology are considered. The conclusions do suggest that strictly modular models involve an oversimplification. While modular models of sound systems have achieved tremendous results in the description and understanding of human language, strict modularity imposes divisions, since each and every pattern is defined as either X or Y (e.g., phonological or phonetic). Yet along any dimension that might have quite distinct endpoints, there is a gray area. For example, what is the status of vowel length before voiced sounds in English, bead [bi:d] vs. beat [bit]? The difference is greater than that observed in many other languages (Keating, 1985), but does it count as phonological? An alternative to the types of approaches that assume that phonology and phonetics are distinct and that there is a mapping between these two modules or domains are approaches that assume that phonology and phonetics are understood and modeled with the same formal mechanisms— what we term unidimensional approaches. A seminal approach in this regard is the theory of Articulatory Phonology, developed by Browman and
32
Ioana Chitoran and Abigail C. Cohn
Goldstein (1992 and work cited therein), where it is argued that both phonology and phonetics can be modeled with a unified formalism. This view does not exclude the possibility that there are aspects of what has been understood to be phonology and what has been understood to be phonetics that show distinct sets of properties or behavior. This approach has served as fertile ground for advancing our understanding of phonology as resulting at least in part from the coordination of articulatory gestures. More recently, a significant group of researchers working within constraint-based frameworks has pursued the view that there is not a distinction between constraints that manipulate phonological categories and those that determine fine details of the representation. This is another type of approach that assumes no formally distinct representations or mechanisms for phonology and phonetics, often interpreted as arguing for the position that phonology and phonetics are one and the same thing. The controversy here turns on the question of how much phonetics there is in phonology, to what extent phonetic detail is present in phonological alternations and representations. Three main views have been developed in this respect: (i) phonetic detail is directly encoded in the phonology (e.g., Steriade, 2001; Flemming, 1995/2002, 2001; Kirchner, 1998/2001); (ii) phonetic detail (phonetic naturalness) is only relevant in the context of diachronic change (e.g., Ohala, 1981 and subsequent work; Hyman, 1976, 2001; Blevins, 2004); (iii) phonetic detail is indirectly reflected in phonological constraints, by virtue of phonetic grounding (e.g., Hayes, 1999; Hayes and Steriade, 2004). While there is general agreement on the fact that most phonological processes are natural, that is, “make sense” from the point of view of speech physiology, acoustics, perception, the three views above are quite different in the way they conceptualize the relationship between phonetics and phonology and the source of the explanation. The first view proposes a unidimensional model, in which sound patterns can be accounted for directly by principles of production and perception. One argument in favor of unidimensional approaches is that they offer a direct account of naturalness in phonology, the second facet of the relationship: phonetics in phonology, a topic we will turn to in §4. Under the second view the effect of naturalness on the phonological system is indirect. Under the third view, some phonological constraints are considered to be phonetically grounded, but formal symmetry plays a role in constraint
Complexity in phonetics and phonology
33
creation. The speaker/learner generalizes from experience in constructing phonetically grounded constraints. The link between the phonological system and phonetic grounding is phonetic knowledge (Kingston and Diehl, 1994). An adequate theory of phonology and phonetics, whether modular, unidimensional, or otherwise needs to account for the relationship between phonological units and physical realities, the ways in which phonetics acts on the phonology, as well as to offer an account of phonetics in phonology. We turn now to the nature of phonetics in phonology and the sources of naturalness.
4.
Naturalness
In this section we consider different views of the source of naturalness in phonology (§4.1). We then present evidence bearing on this question (§4.2). The case we examine concerns patterns of consonant timing in Georgian stop clusters (Chitoran et al., 2002; Chitoran and Goldstein, 2006). 4.1. Sources of naturalness Many understand naturalness to be part of phonology. The status of naturalness in phonology relates to early debates in generative phonology about natural phonology (Stampe, 1979, Donegan and Stampe, 1979). This view is also foundational to Optimality Theory (e.g. Prince and Smolensky, 2004), where functional explanations characterized in scalar and gradient terms are central in the definition of the family of markedness constraints. Contrary to the view that “the principles that the rules subserve (the “laws”) are placed entirely outside the grammar […] When the scalar and the gradient are recognized and brought within the purview of theory, Universal Grammar can supply the very substance from which grammars are built.” (Prince and Smolensky, 2004:233-234.) Under such approaches the explanations of naturalness are connected to the notion of markedness. It is sometimes argued that explicit phonological accounts of naturalness pose a duplication problem. Formal accounts in phonological terms (often attributed to Universal Grammar) parallel or mirror the phonetic roots of such developments, thus duplicating the phonetic source or historical de-
34
Ioana Chitoran and Abigail C. Cohn
velopment driven by the phonetic source (see Przezdziecki, 2005 for recent discussion). We return to this point below. Others understand naturalness to be expressed through diachronic change. This is essentially approach (ii), the view of Hyman (1976, 2001). Hyman (1976) offers an insightful historical understanding of this relationship through the process of phonologization, whereby phonetic effects can be enhanced and over time come to play a systematic role in the phonology of a particular language. Under this view, phonological naturalness results from the grammaticalization of low-level phonetic effects. While a particular pattern might be motivated historically as a natural change, it might be un-natural in its synchronic realization (see Hyman, 2001 for discussion). Phonetic motivation is also part of Blevins’s (2004) characterization of types of sound change. According to this view only sound change is motivated by phonetic naturalness, synchronic phonology is not. A sound change which is phonetically motivated has consequences which may be exploited (phonologized) by synchronic phonology. Once phonologized, a sound change is subject to different principles, and naturalness becomes irrelevant (see also Anderson, 1981). Hayes and Steriade (2004) propose an approach offering middle ground between these opposing views, worthy of close consideration. They argue that the link between the phonetic motivation and phonological patterns is due to individual speakers’ phonetic knowledge. “This shared knowledge leads learners to postulate independently similar constraints.” (p. 1). They argue for a deductive approach to the investigation of markedness: “Deductive research on phonological markedness starts from the assumption that markedness laws obtain across languages not because they reflect structural properties of the language faculty, irreducible to non-linguistic factors, but rather because they stem from speakers’ shared knowledge of the factors that affect speech communication by impeding articulation, perception, or lexical access.” (Hayes and Steriade, 2004:5).
This view relies on the Optimality Theoretic (OT) framework. Unlike rules, the formal characterization of an OT constraint may include its motivation, and thus offers a simple way of formalizing phonetic information in the grammar. Depending on the specific proposal, the constraints are evaluated either by strict domination or by weighting. Phonetically grounded constraints are phonetically “sensible”; they ban structures that are phonetically difficult, and allow structures that are phonetically easy, thus relying heavily on the notion of “effort”. Such constraints are induced by speakers based on their knowledge of the physical conditions under which speech is pro-
Complexity in phonetics and phonology
35
duced and perceived. Consequently, while constraints may be universal, they are not necessarily innate. To assess these different views, we consider some evidence. 4.2. Illustrating the source of naturalness and the nature of sound change We present here some evidence supporting a view consistent with phonologization and with the role of phonetic knowledge as mediated by the grammar, rather then being directly encoded in it. We summarize a recent study regarding patterns of consonant timing in Georgian stop clusters. Consonant timing in Georgian stop clusters is affected by position in the word and by the order of place of articulation of the stops involved (Chitoran et al., 2002; Chitoran and Goldstein, 2006). Clusters in word-initial position are significantly less overlapped than those in word-internal position. Also, clusters with a back-to-front order of place of articulation (like gd, tp) are less overlapped than clusters with a front-to-back order (dg, pt). (3)
Georgian – word-initial clusters Front-to-back bgera ‘sound’ h h ‘hair lock’ p t ila dg-eb-a ‘stands up’
Back-to-front g-ber-av-s thb-eb-a gd-eb-a
‘fills you up’ ‘warms you’ ‘to be thrown’
The authors initially attributed these differences to considerations of perceptual recoverability, but a subsequent study (Chitoran and Goldstein, 2006) showed that this explanation is not sufficient. Similar measures of overlap in clusters combining stops and liquids also show that back-to-front clusters (kl, rb) are less overlapped than front-to-back ones (pl, rk), even though in these combinations the stop release is no longer in danger of being obscured by a high degree of overlap, and liquids do not rely on their releases in order to be correctly perceived. The timing pattern observed in stop-liquid and liquid-stop clusters is not motivated by perceptual recoverability. Consequently, the same explanation also seems less likely for the timing of stop-stop clusters. It suggests in fact that perceptual recoverability is not directly encoded in the phonology after all, but rather that the systematic differences observed in timing may be due to language-specific coor-
36
Ioana Chitoran and Abigail C. Cohn
dination patterns, which can be phonologized, that is, learned as grammatical generalizations. Moreover, in addition to the front-to-back / back-to-front timing patterns, stop-stop clusters show overall an unexpectedly high degree of separation between gestures, more than needed to avoid obscuring the release burst. Some speakers even tend to insert an epenthetic vowel in back-tofront stop clusters, the ones with the most separated gestures. While this process of epenthesis is highly variable at the current stage of the language, it occurs only in the “naturally” less overlapped back-to-front clusters, suggesting a further step towards the phonologization of “natural” timing patterns in Georgian. The insertion of epenthetic vowels could ultimately affect the phonotactics and syllable structure of Georgian. This would be a significant change, especially in the case of word-initial clusters. Word-initial clusters are systematically syllabified as tautosyllabic onset clusters by native speakers. The phonologization of the epenthetic vowels may lead to the loss of wordinitial clusters from the surface phonology of the language, at least those with a back-to-front order of place of articulation. Although the presence of an epenthetic vowel is not currently affecting speakers’ syllabification intuitions, articulatory evidence shows that the syllable structure of Georgian is being affected in terms of articulatory organization. In a C1vC2V sequence with an epenthetic vowel the two consonants are no longer relatively timed as an onset cluster, rather C1 is timed as a single onset relative to the epenthetic vowel (Goldstein et al., 2007). In the model recently developed by Browman & Goldstein (2000) and Goldstein et al. (2006) syllable structure emerges from the planning and control of stable patterns of relative timing among articulatory gestures. A hypothesis proposed in this model states that an onset consonant (CV) is coupled in-phase with the following vowel. If an onset consists of more than one consonant (CCV), each consonant should bear the same coupling relation to the vowel. This would result in two synchronous consonants, which would make one or the other unrecoverable. Since the order of consonants in an onset is linguistically relevant, it is further proposed that the consonants are also coupled to each other in anti-phase mode, meaning in a sequential manner. The result is therefore a competitive coupling graph between the synchronous coupling of each consonant to the vowel, and the sequential coupling of the consonants to each other. Goldstein et al. (2007) examined articulatory measures (using EMMA) which showed that in Georgian, as consonants are added to an onset (CV – CCV – CCCV) the
Complexity in phonetics and phonology
37
time from the target of the rightmost C gesture to the target (i.e., the center) of the following vowel gesture gets shorter. In other words, the rightmost C shifts progressively to the right, closer to the vowel. This is the predicted consequence of the competitive coupling. In this study, two Georgian speakers produced the triplet rial-i ‘commotion’ – k’rial-i ‘glitter’ – ts’k’rial-a ‘shiny clean’. One of the speakers shows the rightward shift of the [r], as expected. This effect has previously also been observed in English, and is known as the ‘c-center’ effect (Browman and Goldstein, 1988, Byrd, 1996). The second speaker, however, did not show the shift in this set of data. This speaker produced an audible epenthetic vowel in the back-to-front sequence [k’r] in all forms. This suggests that [k’] and [r] do not form an onset cluster for this speaker, and in this case no rightward shift is predicted by the model. The rightward shift is absent from this speaker’s data because the competitive coupling is absent. Instead, [r] is coupled in-phase with the following [i], and [k’] is coupled in-phase with the epenthetic vowel. The longer separation observed in Georgian back-to-front clusters may have been initially motivated by phonetic naturalness (perceptual recoverability in stop-stop clusters). But the generalization of this timing pattern to all back-to-front clusters, regardless of segmental composition, and the further development of epenthetic vowels in this context can no longer be attributed directly to the same phonetic cause. An appropriate conclusion to such facts is the phrase coined by Larry Hyman: “Diachrony proposes, synchrony disposes” (Hyman, 2005). Once phonologized, synchronic processes become subject to different factors, therefore the study of phonetic naturalness is relevant primarily within the context of diachronic change. Phonology is the intersection of phonetics and grammar (Hyman, 1976). The naturalness of phonetics (in our example, the reduced gestural overlap in back-to-front clusters) thus interacts with grammatical factors in such a way that the phonetic naturalness observable in phonology (the insertion of epenthetic vowels) is not the direct encoding of phonetic knowledge, but rather phonetic knowledge mediated by the principles of the grammar. This suggests that, as with the case of phonology in phonetics, here too, phonetics and phonology are not reducible to one and the same thing. Processes may be natural in terms of their motivation. In terms of their effect they can be more categorical or more gradient. Studies such as the one outlined above suggest that examining phonetic variability, both within and across languages, may reveal additional facets of complexity, worthy
38
Ioana Chitoran and Abigail C. Cohn
of investigation. This brings us back to the two facets of the relationship between phonology and phonetics. As discussed above, it is not the case that coarticulation and assimilation are the same thing, since these patterns are not identical and the coarticulatory effects are built on the phonological patterns of assimilation. It is an illusion to say that treating such patterns in parallel in the phonology and phonetics poses a duplication problem as has been suggested by a number of researchers focusing on the source of naturalness in phonology. Rather the parallel effects are due indirectly to the ways in which phonology is natural, not directly in accounting for the effects through a single vocabulary or mechanism. Thus we need to draw a distinction between the source of the explanation, where indeed at its root some factors may be the same (see Przezdziecki, 2005 for discussion), and the characterization of the patterns themselves, which are similar, but not the same. Since assimilation and coarticulation are distinct, an adequate model needs to account for both of them. The view taken here is that while assimilation might arise historically through the process of phonologization, there is ample evidence that the patterns of assimilation and coarticulation are not reducible to the same thing, thus we need to understand how the more categorical patterns and the more gradient patterns relate. In the following section we consider how the issues discussed so far relate to the question of the relevant units of representation.
5.
The multi-faceted nature of complexity
Based on our discussion of the relationship between phonetics and phonology, it becomes increasingly clear that the notion of complexity in phonology must be a multi-faceted one. As discussion in this chapter highlights, and as also proposed by Maddieson (this volume), Marsico et al. (2004) and subsequent work, different measures of complexity of phonological systems can be calculated, at different levels of representation, notably features, segments, and syllables. The question of the relevant primary units is therefore not a trivial one, as it bears directly on the question of the relevant measure of complexity. Moreover, it brings to the forefront the triad formed by perception units – production units – units of representation. The following important questions then arise:
Complexity in phonetics and phonology
-
39
in measuring complexity, do we need to consider all three members of the triad in their interrelationship, or is only one of the three relevant? does the understanding of the triad change depending on the primary categories chosen? In this section we briefly formulate the questions that we consider relevant in this respect, and we provide background to start a discussion. We distinguish here between units at two levels: units at the level of cognitive representation, and units of perception. The fact that these two types of representations may or may not be isomorphic suggests that a relevant measure of complexity should not be restricted to only one or the other. We propose that the choice of an appropriate unit may depend on whether we are considering: (i) representations, (ii) sound systems, or (iii) sound patterns. For example, when considering exclusively sound systems, the segment or the feature has been shown to be appropriate (Lindblom and Maddieson, 1988; Maddieson, 2006; Marsico et al., 2004), but when considering the patterning of sounds within a system, a unit such as the gesture could be considered equally relevant. The number of representation units proposed in the literature is quite large.4 So far, concrete measures of complexity have been proposed or at least considered for features, segments, and syllables. The most compelling evidence for units of perception has been found also for features, phonemes, and syllables (see Nguyen, 2005 for an overview). A clear consensus on a preferred unit of perception from among the three has not been reached so far. This suggests that all three may have a role to play. In fact, recent work by Grossberg (2003) and Goldinger and Azuma (2003) suggests that different types of units, of smaller and larger sizes, can be activated in parallel. Future experiments will reveal the way in which multiple units are needed in achieving an efficient communication process. If this is the case, then multiple units are likely to be relevant to computations of phonological complexity. Obviously, this question cannot be answered until a fuller understanding of the perception of the different proposed units has been reached. Although more representation units have been proposed in the literature, other than features, segments, and syllables, we will limit our discussion to this subset, which overlaps with that of plausible perception units. The relevance of features for complexity has already been investigated. Marsico et al. (2004) compare measures of complexity based on different sets of not so abstract phonetic dimensions, for example features of the type “high”,
40
Ioana Chitoran and Abigail C. Cohn
“front”, “voiced”, etc. Distinctive features as such have not been considered in calculations of complexity, but their role has been investigated in a related measure, that of feature economy (Clements, 2003). The hypothesis based on feature economy predicts that languages tend to maximize the number of sounds in their inventories that use the same feature set, thus maximizing the combinatory possibilities of features. Clements’ thorough survey of the languages in the UPSID database confirms this hypothesis. Speech sounds tend to be composed of features that are already used elsewhere in a given system. The finding that is most interesting relative to complexity is that feature economy is not a matter of the total number of features used per system, but rather of the number of segments sharing a given feature. This is interesting because feature economy can be seen as a measure of complexity at the feature level. Nevertheless, this measure makes direct reference to the segment, another unit of representation. This again brings up the possibility that more than one unit, at the same time, may be relevant for computations of phonological complexity. As pointed out by Pellegrino et al. (2007), the relevance of segments is hard to ignore. While the authors agree that the cognitive relevance of segments is still unclear, they ask: “if we give up the notion of segments, then what is the meaning of phonological inventories?” Thus, at least intuitively, segments cannot be excluded from these considerations. As discussed earlier, this is the level of unit used by Maddieson (1984), and has been the level at which many typological characterizations have been successfully made. More recent approaches to complexity have considered the third unit, the syllable. Maddieson (2007; this volume) has studied the possible correlations between syllable types and segment inventories, and tone contrasts. Other units have not yet been considered in the measure of complexity. Their relevance will depend in part on evidence found for their role in perception and cognitive representation. In addition to this aspect, we believe that relevant measures will also depend on the general context in which the interaction of these units is considered: sound inventories or phonological systems including processes. Moreover, within processes, we expect that the measures will also differ depending on whether we are considering synchronic alternations or diachronic change. Finally, to return to the interaction between phonetics and phonology, the topic with which we started this paper, we believe that understanding phonological complexity may also require an understanding of the relevance of phonetic variation – for example the phoneme-allophone relation – for a measure of phonological complexity.
Complexity in phonetics and phonology
41
Notes 1. Authors’ translation. 2. See Jurafsky et al., (2001) for discussion of the role of predictability of language processing and production. 3. Pierrehumbert et al., (2000) make similar observations. 4. Here we only consider abstractionist models, acknowledging the importance of exemplar models (Johnson, 1997, Pierrehumbert, 2001, 2002, among others). At this point in the development of exemplar models the question of complexity has not been addressed, and it is not easy to tell what, in an exemplar model, could be included in a measure of complexity.
References Anderson, S. 1981 Why phonology isn't “natural”. Linguistic Inquiry 12: 493-539. Beddor, P. and H. Yavuz 1995 The relationship between vowel-to-vowel coarticulation and vowel harmony in Turkish. Proceedings of the 13th International Congress of Phonetic Sciences, volume 2, pp. 44-51. Blevins, J. 2004 Evolutionary Phonology. The Emergence of Sound Patterns Cambridge: Cambridge University Press Browman, C. and L. Goldstein 1988 Some notes on syllable structure in articulatory phonology. Phonetica 45: 140-155. 1992 Articulatory Phonology: an overview. Phonetica 49: 155-180. 2000 Competing constraints on intergestural coordination and selforganization of phonological structures. Bulletin de la Communication Parlée 5: 25-34. Byrd, D. 1996 Influences on articulatory timing in consonant sequences. Journal of Phonetics 24:209-244. Chitoran, I., L. Goldstein, and D. Byrd 2002 Gestural overlap and recoverability: Articulatory evidence from Georgian. In C. Gussenhoven and N. Warner (eds.) Laboratory Phonology 7. Berlin: Mouton de Gruyter. pp. 419-48. Chitoran, I. and L. Goldstein 2006 Testing the phonological status of perceptual recoverability: Articulatory evidence from Georgian. Poster presented at Laboratory Phonology 10, Paris, France, June-July 2006.
42
Ioana Chitoran and Abigail C. Cohn
Chomsky, N. and M. Halle 1968 The Sound Pattern of English. New York, NY: Harper and Row. Clements, G. N. 2003 Feature economy in sound systems. Phonology 20.3: 287–333 Cohn, A.C. 1990 Phonetic and Phonological Rules of Nasalization. UCLA PhD dissertation. Distributed as UCLA Working Papers in Phonetics 76. 1998 The phonetics-phonology interface revisited: Where’s phonetics? Texas Linguistic Forum 41: 25-40. 2006a Is there gradient phonology? In G. Fanselow, C. Fery, R. Vogel and M. Schlesewsky (eds.) Gradience in Grammar: Generative Perspectives. Oxford: OUP, pp. 25-44 2006b Phonetics in phonology and phonology in phonetics. To appear in Working Papers of the Cornell Phonetics Laboratory 16. Coupé, C., E. Marsico, and F. Pellegrino this volume Structural complexity of phonological systems Donegan, P. and D. Stampe 1979 The study of natural phonology. In D.A. Dinnsen (ed.) Current approaches to phonological theory. Bloomington: Indiana University Press Flemming, E. 2001 Scalar and categorical phenomena in a unified model of phonetics and phonology. Phonology 18: 7-44. 2002 Auditory Representations in Phonology. New York, NY: Routledge. [Revised version of 1995 UCLA Ph.D. dissertation.] Givón, T. 1991 Markedness in grammar: Distributional, communicative and cognitive correlates of syntactic structure. Studies in Language 15.2: 335-370. Goldinger, S. and T. Azuma 2003 Puzzle-solving science: the quixotic quest for units in speech perception. Journal of Phonetics 31: 305-320 Goldstein, L., D. Byrd, and E. Saltzmann 2006 The role of vocal tract gestural action units in understanding the evolution of phonology. In M. Arbib (ed.) From action to language: The mirror neuron system. Cambridge: Cambridge University Press. pp. 215-249. Goldstein, L., I. Chitoran, and E. Selkirk 2007 Syllable structure as coupled oscillator modes: Evidence from Georgian vs. Tashlhiyt Berber. Proc. of the 16th International Congress of Phonetic Sciences, Saarbrücken, Germany, August 6-10, 2007. pp. 241-244. Grossberg, S. 2003 Resonant neural dynamics of speech perception. Journal of Phonetics 31: 423-445.
Complexity in phonetics and phonology
43
Haspelmath, M. 2006 Against markedness (and what to replace it with). Journal of Linguistics 42: 25-70. to appear Frequency vs. iconicity in explaining grammatical asymmetries. Cognitive Linguistics 18:4. Hayes, B. 1999 Phonetically-driven phonology: The role of optimality theory and inductive grounding. In M. Darnell, E. Moravscik, M. Noonan, F. Newmeyer, and K. Wheatly (eds.) Functionalism and Formalism in Linguistics, Volume I: General Papers, Amsterdam: John Benjamins, pp. 243-285. Hayes, B. and D. Steriade 2004 Introduction: the phonetic bases of phonological markedness. In B. Hayes, R. Kirchner, and D. Steriade (eds) Phonetically Based Phonology. Cambridge: CUP, pp. 1-33. Huffman, M. 1990 Implementation of Nasal: Timing and Articulatory Landmarks. UCLA PhD dissertation. Distributed as UCLA Working Papers in Phonetics 75. Hume, E. 2004 Deconstructing markedness: A predictability-based approach. To appear in Proceedings of The Berkeley Linguistic Society 30. 2006 Language Specific and Universal Markedness: An InformationTheoretic Approach. Paper presented at the Linguistic Society of America Annual Meeting, Colloquium on Information Theory and Phonology. Jan. 7, 2006. Hume, E. and K. Johnson 2001 The Role of Speech Perception in Phonology. San Diego: Academic Press. Hyman, L. 1976 Phonologization. In A. Juilland (ed.) Linguistic Studies Offered to Joseph Greenberg. Vol. 2. Saratoga: Anna Libri, pp. 407-418. 2001 The limits of phonetic determinism in phonology: *NC revisited. In E. Hume and K. Johnson (eds.) The Role of Speech Perception in Phonology. San Diego: Academic Press, pp. 141-185. 2005 Diachrony proposes, synchrony disposes – Evidence from prosody (and chess). Talk given at the Laboratoire Dynamique du Langage, Université de Lyon 2, Lyon, March 25, 2005. Jakobson, R. 1971 Les lois phoniques du langage enfantin et leur place dans la phonologie générale. In Roman Jakobson : Selected Writings, vol. 1. Phonological Studies. 2nd expanded edition. The Hague: Mouton. pp. 317-327.
44
Ioana Chitoran and Abigail C. Cohn
Johnson, K. 1997 Speech perception without speaker normalization. In K. Johnson and J. Mullinix (eds.) Talker Variability in Speech Processing. San Diego: Academic Press, pp. 146-165. Jurafsky, D., A. Bell, M. Gregory and W. D. Raymond 2001 Probabilistic relations between words: Evidence from reduction in Lexical Production. In J. Bybee and P. Hopper (eds) Frequency and the Emergence of Linguistic Sstructure. Amsterdam: John Benjamins, pp. 229-254. Keating, P. 1985 Universal phonetics and the organization of grammars. In V. Fromkin (ed.) Phonetic Linguistic Essays in Honor of Peter Ladefoged. Orlando: Academic Press, pp. 115-132. 1988 The window model of coarticulation: articulatory evidence. UCLA Working Papers in Phonetics 69: 3-29. 1990 The window model of coarticulation: articulatory evidence. In Kingston and Beckman (eds.) Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: CUP, pp. 451-470. Kingston, J. 1990 Articulatory binding. In J. Kingston and M. Beckman (eds.) Papers in Laboratory Phonology I: Between the Grammar and the Physics of Speech. Cambridge: CUP, pp. 406-434. Kingston, J. and R. Diehl 1994 Phonetic knowledge. Language 70: 419-53. Kirchner, R. 1998/2001 An Effort-Based Approach to Consonant Lenition. New York, NY: Routledge. [1998 UCLA Ph.D dissertation] Klatt, D. 1987 Review of text-to-speech conversion for English. JASA 82.3: 737-793. Lehmann, C. 1974 Isomorphismus im sprachlichen Zeichen. In H. Seiler (ed.) Linguistic workshop II: Arbeiten des Kölner Universalienprojekts 1973/4. München: Fink (Structura, 8). pp. 98-123. Lindblom, B. and I. Maddieson 1988 Phonetic universals in consonant systems. In C.N. Li and L.H. Hyman (eds.) Language, Speech and Mind: Studies in Honor of Victoria H. Fromkin. pp. 62-78. Beckenham: Croom Helm. Maddieson, I. 1984 Patterns of Sounds. Cambridge: Cambridge University Press 2006 Correlating phonological complexity: Data and validation. Linguistic Typology 10: 108-125.
Complexity in phonetics and phonology 2007
45
Issues of phonological complexity: Statistical analysis of the relationship between syllable structures, segment inventories and tone contrasts. In Solé, M-J.,P.S. Beddor, M. Ohala (eds.) Experimental Approaches to Phonology. Oxford University Press, Oxford and New York. this volume Calculating Phonological Complexity. Marsico, E., I. Maddieson, C. Coupé, F. Pellegrino 2004 Investigating the “hidden” structure of phonological systems. Proceedings of The Berkeley Linguistic Society 30: 256-267. Newmeyer, F. 1992 Iconicity and generative grammar. Language 68 : 756-796. Nguyen, N. 2005 La perception de la parole. In Nguyen, N., S. Wauquier-Gravelines, J. Durand (eds.) Phonologie et phonétique: Forme et substance. Paris : Hermès. pp. 425-447. Ohala, J.J. 1981 The listener as a source of sound change. In C. Masek, R. Hendrick, and M. Miller (eds.) Papers from the Parasession on Language and Behavior. Chicago : CLS. pp. 178-203. Pellegrino, F. , C. Coupé, and E. Marsico 2007 An information theory-based approach to the balance of complexity between phonetics, phonology and morphosyntax. Paper presented at the Linguistic Society of America, LSA Complexity Panel, January 2007. Pierrehumbert, J. 1980 The Phonology and Phonetics of English Intonation. MIT Ph.D. dissertation. 2001 Exemplar dynamics: Word frequency, lenition and contrast. In J. Bybee and P. Hooper (eds.) Frequency and the Emergence of Linguistic Structure. Amsterdam: John Benjamins. pp. 137-157. 2002 Word-specific phonetics. In C. Gussenhoven and N. Warner (eds.) Laboratory Phonology 7. Berlin: Mouton de Gruyter. pp. 101-139. Pierrehumbert, J. M. Beckman, and D. R. Ladd 2000 Conceptual foundations in phonology as a laboratory science. In N. Burton-Roberts, P. Carr and G. Docherty (eds.) Phonological Knowledge: Conceptual and Empirical Issues, New York: Oxford University Press. 273-304. Pouplier, M. 2003 Units of phonological encoding: Empirical evidence. PhD dissertation, Yale University. Prince, A. and P. Smolensky 2004 Optimality Theory: Constraint Interaction in Generative Grammar. Malden, MA: Blackwell.
46
Ioana Chitoran and Abigail C. Cohn
Przezdziecki, M. 2005 Vowel Harmony and Coarticulation in Three Dialects of Yorùbá: Phonetics Determining Phonology, Cornell University PhD dissertation. Stampe, D. 1979 A Dissertation in Natural Phonology. New York: Garland Press. [1973 University of Chicago Ph.D dissertation] Steriade, D. 2001 Directional asymmetries in assimilation: A directional account. In E. Hume and K. Johnson (eds.) The Role of Speech Perception in Phonology. San Diego: Academic Press, pp. 219-250. Stevens, K. 1989 On the quantal nature of speech. Journal of Phonetics 17: 3-45. Trubetzkoy, N.S. 1939 Grundzüge der Phonologie. Publié avec l’appui du Cercle Linguistique de Copenhague et du Ministère de l’instruction publique de la République Tchéco-slovaque. Prague. 1969 Principles of Phonology [Grundzüge der Phonologie] translated by C. Baltaxe. Berkeley: University of California Press. Tsuchida, A. 1997 Phonetics and Phonology of Japanese Vowel Devoicing. Cornell University, PhD dissertation. 1998 Phonetic and phonological vowel devoicing in Japanese. Texas Linguistic Forum 41: 173-188. Vallée, N. 1994 Systèmes vocaliques : de la typologie aux prédictions. PhD dissertation, Université Stendhal, Grenoble. Vallée, N., L-J Boë, J-L Schwartz, P. Badin, C. Abry 2002 The weight of phonetic substance in the structure of sound inventories. ZAS Papers in Linguistics 28: 145-168. Zsiga, E.C. 1995 An acoustic and electropalatographic study of lexical and postlexical palatalization in American English. In B. Connell and A. Arvaniti (eds.) Phonology and Phonetic Evidence: Papers in Laboratory Phonology IV. Cambridge: Cambridge University Press. pp. 282-302.
Languages’ sound inventories: the devil in the details John J. Ohala 1. Introduction In this paper I am going to modify somewhat a statement made in Ohala (1980) regarding languages’ speech sound inventories exhibiting the ‘maximum use of a set of distinctive features’. In that paper, after noting that vowel systems seem to conform to the principle of maximal acousticperceptual differentiation (as proposed earlier by Björn Lindblom), I observe: ... it would be most satisfying if we could apply the same principles to predict the arrangement of consonants, i.e., posit an acoustic-auditory space and show how the consonants position themselves so as to maximize the inter-consonantal distance. Were we to attempt this, we should undoubtedly reach the patently false prediction that a 7 consonant system should include something like the following: , k’, ts, , m, r, . Languages which do have few consonants, such as the Polynesian languages, do not have such an exotic inventory. In fact, the languages which do possess the above set (or close to it) such as Zulu, also have a great many other consonants of each type, i.e., ejectives, clicks, affricates, etc. Rather than maximum differentiation of the entities in the consonant space, we seem to find something approximating the principle which would be characterized as “maximum utilization of the available distinctive features”. This has the result that many of the consonants are, in fact, perceptually quite close – differing by a minimum, not a maximum number of distinctive features.1
Looking at moderately large to quite large segment inventories like those in English, French, Hindi, Zulu, Thai, this is exactly the case. Many segments are phonetically similar and as a consequence are confusable. Some data showing relatively high rates of confusion of certain CV syllables (presented in isolation, hi-fi listening condition) (from Winitz et al. 1972) are given in Table 1.
48
John J. Ohala
Table 1. Confusion matrix from Winitz et al. (1972). Spoken syllables consisted of stop burst plus 100 msec of following transition and vowel; high-fidelity listening conditions. Numbers given are the incidence of the specified response to the specified stimulus.
Stimulus:
Response: /pi/ /pa/ /pu/ /ti/ /ta/ /tu/ /ki/ /ka/ /ku/
/p/ .46 .83 .68 .03 .15 .10 .15 .11 .24
/t/ .38 .07 .10 .88 .63 .80 .47 .20 .18
/k/ .17 .11 .23 .09 .22 .11 .38 .70 .58
I do actually believe that the degree of auditory distinctiveness plays some role in shaping languages’ segment inventories—especially when auditory distinctiveness is low. Sound change, acting blindly (i.e., nonteleologically), weeds out similar sounding elements through confusion which results in mergers and loss. The loss in some dialects of English of // and // and their merger with either /f/ and /v/ (respectively) or with /t/ and /d/ (respectively) is a probable example. I also believe it is sound change, again acting blindly, which is largely responsible for the introduction of new series of segments which involve re-use of some pre-existing features. In some cases there is historical evidence of this. Proto-IndoEuropean had only three series of stops: voiced, voiceless, and breathyvoiced (i.e., among labials: /b/, /p/, /b/). The voiceless aspirated series, /p/, exemplified in Sanskrit and retained in many of the modern Indo-Aryan languages (like Hindi) developed by sound change from the (simple) voiceless series. And a fifth (!) series of stops, the voiced implosives, /©, ∫, etc./ in Sindhi developed from geminated versions of the (simple) voiced stops. Similarly we know that the nasal vowels in French and Hindi developed out of the pre-existing oral vowels plus following nasal consonant (with the nasal consonant lost). (E.g., French saint [sɛ̃] < Latin sanctus “holy”; Hindi dant “tooth” [d)t] < IE dont-, dent- “tooth”.) It is also relevant to my case that historically French once had as many nasal as oral vowels and then over the centuries reduced the nasal vowel inventory due to, I have argued, auditory similarity (Ohala and Ohala, 1993).
Languages' sound inventories: the devil in the details
49
But the point that I want to revise or distance myself from somewhat is the idea that re-use of distinctive features always results in a cost-minimal augmentation, vis-à-vis the introduction of segments that are distinguished by virtually all new distinctive features. I suppose the basic message I am emphasizing here is that the apparent symmetry found in many languages’ segment inventories (or possibly the symmetry imposed by the analyst who put segments in matrices where all rows and columns are uniformly filled) obscures a more complicated situation. There is a great deal of what is referred to as allophonic variation, usually lawful contextual variation. What this means is that the neat symmetrical matrices of speech sound inventories are really abstractions. The complications – the devilish details referred to in my title – have been ‘swept under the rug’! Can we ignore this variation when speculating about common cross-language tendencies in the form of languages’ segment inventories? I say ’no’ since in many cases the same principles are at work whether they lead to apparent symmetry or asymmetry, and I’ll give some examples in the following sections.
2. Some examples of devilish details 2.1. [p] is weaker than other voiceless stops Among languages that have both voiced and voiceless stops, the voiceless bilabial [p] is occasionally missing, e.g., Berber, and this gap is much more common than a gap at any other place of articulation among voiceless stop series.2 In Japanese the /p/ has a distribution unlike other voiceless stops: it doesn’t occur in word-initial position except in onomatopoeic vocabulary (e.g., /patapatSa/, ‘splash’) or medially as a geminate (e.g., /kap˘a/ ‘cucumber sushi’) or in a few other medial environments. Phonetically in English and many other languages the burst of the /p/ has the lowest intensity of any of the voiceless stops. The reason, of course, is that there is no downstream resonator to amplify the burst. We should see that the latter phonetic fact is the unifying principle underlying all these patterns. (And this is, in part, the reason why the sequence [pi] is confused with [ti], as documented in Table 1.)
50
John J. Ohala
2.2. Voicing in stops and place of articulation Among voiced stops, the velar, [g], is often missing in languages stop inventory even though they may have a voicing contrast in stops articulated at more forward places of articulation, e.g., in Thai, Dutch, and Czech (in native vocabulary). In some languages, morphophonemic variations involving the gemination of voiced stops shows an asymmetry in their behavior depending on how far front or back the stop is articulated. E.g., in Nubian (see Table 2), the geminate bilabial stop retains its voicing; those made further back become voiceless. Table 2. Morphophonemic variation in Nubian (from Bell, 1971) Noun stem /fab/
Stem + “and” /fab˘çn/
English gloss father
/sd/
/st˘çn/
scorpion
/kadÉ/
/katɢçn/
donkey
/m/
/mkːçn/
dog
The usual descriptions of the allophonic variation of “voiced” stops in English (/b d g/) is that they are voiceless unaspirated in word-initial position but voiced between sonorants. In my speech, however, and that of another male native speaker of American English, I have found that // is voiceless even intervocalically. See Figure 1 (subject DM) which gives the waveform and accompanying pharyngeal pressure (sampled with a thin catheter inserted via the nasal cavity). The utterance, targeted as /´»gA/ is manifested as [´»kA]. However, as shown in Figure 2, when the pharyngeal pressure was artificially lowered (by suction applied via a second catheter inserted in the other side of the nasal cavity), the // was voiced!
Languages' sound inventories: the devil in the details
[
ˈɡ ̥
51
]
Figure 1. Acoustic waveform (top) and pharyngeal pressure (bottom) of the utterance /´/ spoken by subject DM, a male native speaker of American English. Condition: no venting of pharyngeal pressure. Phonetically the realization of the intervocalic stop was voiceless.
[
]
Figure 2. Acoustic waveform (top) and pharyngeal pressure (bottom) of the utterance /´g/ spoken by subject DM, a male native speaker of American English. Condition: artificial venting of pharyngeal pressure. Phonetically the realization of the intervocalic stop was voiced (evident in the pressure signal).
52
John J. Ohala
All of these patterns, from the absence of [] in Thai to the voiceless realization of // intervocalically in American English speakers are manifestations of the same universal aerodynamic principle: the possibility of voicing during stops requires a substantial pressure drop across the glottis and this depends partly on the volume of the cavity between the point of articulation and the larynx and more importantly on the possibility of passive expansion of that cavity in order to ‘make room’ for the incoming air flow. Velars and back-articulated stops have less possibility to accommodate the incoming airflow and so voicing is threatened.
3. On the various cues for obstruent “voicing” in English Lisker (1986) listed several features in addition to presence/absence of voicing or differences in VOT by which the so-called ‘voicing distinction’ in English obstruents is differentiated perceptually. 3.1. F0 perturbations on vowels following stops The vowels immediately after voiced and voiceless obstruents show a systematic F0 variation. Figure 3 shows data from Hombert et al. (1979). These curves represent unnormalized averages of 100 msec of the F0 contours following /b d g/ (lower curve) and /p t k/ (upper curve) from 5 speakers of American English. Each curve is the average of 150 tokens. (Given that /p t k/ have a positive VOT whereas /b d g/ has VOT close to zero, the onset of the curves are phase shifted with respect to moment of stop release.)3 Such F0 differences can be explained as being mechanically caused due to differences in vocal cord state, i.e., they seem not to be purposeful on the part of speakers; see Ohala et al. (2004). Nevertheless Fujimura (1971) has presented evidence that such F0 contours are used by native speakers of English to differentiate this contrast when all other cues have been neutralized. Does this mean that English is a tone language? We would probably answer ‘no’ since the speaker doesn’t have to separately produce and control the tension of the laryngeal muscles to implement these F0 differences. So it is English listeners, if not the speakers, that have the added complexity in their perceptual task of recognizing F0 differences just as native speakers of tone languages do. It is not much of a simplification of the sound system of a language if the language users (in their role as
Languages' sound inventories: the devil in the details
53
listeners) have to have skill in categorical recognition of short-term F0 contours in addition to recognizing voicing itself or VOT differences.
F0 (Hz)
140
p
130
b
120
20
Time (ms)
100
Figure 3. Average fundamental frequency values of vowels following English stops (data from five speakers). The curves labelled [p] and [b] represent the values associated with all voiceless and voiced stops, respectively – regardless of place of articulation. The zero point on abscissa represents the moment of voice onset; with respect to stop release, this occurs later in real time in voiceless aspirated stops (from Hombert et al., 1979).
3.2. Secondary cues to voicing in coda obstruents As is well known, the class of supposedly voiced and voiceless obstruents in coda position are reliably differentiated by vowel duration, longer duration of the vowel before ‘voiced’ obstruents than before voiceless ones (by ratios of up to 3:2). Since this ratio is so large and there is no apparent “me-
54
John J. Ohala
chanical” cause of this difference, as Lisker (1974) concluded, this means that in this case both speaker and listener have to have distinctive vowel length in their grammars. 3.3. Vowel-influenced variations in VOT Several studies have shown that the positive VOT of the voiceless aspirated stops in English show vowel-specific variations (Lisker and Abramson 1964, 1967; Ohala 1981a): VOT is longer before (actually when the stop is coarticulated with) high close vowels than before open vowels. These variations are probably an automatic consequence of differences in degree of aerodynamic resistance to the exiting airflow. The higher resistance offered by the close vowels delays the decay of Po and thus the onset of voicing. Figure 4 (from Ohala, 1981a) gives data from English and Japanese (from unpublished studies by Robert Gaskins and Mary Beckman, respectively). The English data provide further evidence on the tendency of backarticulated “voiced” stops to be voiceless since here, even the so-called “voiced” velar has a positive VOT. Lisker and Abramson (1967) have shown that listeners are sensitive to these vowel-specific variations: crossover points in the identification of the two categories of stops when presented in a VOT continuum also vary with the quality of the following vowel. This phonetic detail therefore must be part of the English-speaking listener’s knowledge about the sound pattern of the language. I could add many other examples where there are numerous acoustic features characteristic of specific consonant-vowel sequences or at least specific classes of sounds in the context of other specific classes. The net result of this is to add complexity to the signaling system of language that goes beyond what is implied by simply adding another row or column to the language’s phoneme inventory.
Languages' sound inventories: the devil in the details
a)
VOT (ms)
100
ENGLISH
55
k_ t_ p_
50 g_
0
i
eI E
Q AI
A
oU u
‘
Postconsonantal Vowel JAPANESE k_
50
p_ VOT (ms)
b)
t_ 0
i
e
A
o
μ
Postconsonantal Vowel
Figure 4. VOT variation for stops as a function of following vowel in English (a) and Japanese (b) (from Ohala, 1981a).
4. Conclusion If we conceive our task as phonologists as one of characterizing and understanding the function of speech to serve as a medium of communication then we want to know the implications for this function of the differences between the phonological system of, say, Rotokas with its 11 phonemes and !Xu with its 141 phonemes. Just by listing the segmental inventories in the traditional articulatory matrix does not tell the whole story. Adding new columns or rows can complicate the task of the language’s speakers. The evidence that experimental phonetics has uncovered about so-called ‘sec-
56
John J. Ohala
ondary distinctive features’ in virtually every language whose sound system has been studied in some detail, especially the findings that these features may be different in different contexts, makes it clear that a language’s phonological complexity is itself a complex issue. In the model of sound change that I have proposed (Ohala 1981b, 1993) the so-called ‘secondary’ features are very important for understanding why change takes place and why it takes a particular direction. A feature that was secondary can become one of the primary distinctive features of a phonological contrast if the primary feature(s) are not detected or are misinterpreted. The important point in this model is that some of the elements of the “after” state were already present in the “before” state, without being explicitely listed in the inventory. What type of (re)presentation would correct the deficiencies found in the familiar matrices of languages’ segment inventories that I have argued reduce their utility? To begin with, to the extent that the detailed phonetic character of sounds has been uncovered through empirical means, then segment inventories should show that. E.g., when presenting the English stops, rather than labeling them as [voiced] and [voiceless], they should be given with more accurate labels: [voiceless unaspirated] and [voiceless aspirated], respectively. Another example: if certain dialects of Swedish have an alveolar or apical-retracted voiced stop vs. a dental voiceless stop (Livijn and Engstrand, 2001), then they should not both appear in the same “place of articulation” column. Many other languages show the same pattern, i.e., where the voiced coronal stop has a place of articulation that is more posterior than that of its voiceless “counterpart” (Sprouse, Solé and Ohala, 2008). Of course, systematic contextual variation should also be noted, e.g., that the voiced apical stop in English is often realized as a tap [] in some contexts. Also, an elaborated but much more useful account of the contrasts in a language should reflect in some way whether there is any evidence that there are differences in the cognitive task/abilities required of speakers vs. listeners in processing allophonic differences due to phonetic context. For example, as mentioned above, the slight differences in VOT before different vowels (close vowels showing slightly longer VOT than those before more open vowels) is probably not something the speaker has to implement purposefully; they will occur due to physical constraints (the greater impedance to the exiting airflow delays the achievement of the transglottal pressure difference necessary to initiate voicing). Nevertheless, listeners are no doubt aware of these differences and use them when identifying pre-vocalic stops. Thus the “knowledge” of the speaker and hearer
Languages' sound inventories: the devil in the details
57
about contextual variation may be different and the “compleat” psychologically valid phonological grammar of a language should reflect this. Thus we need to pay more attention to the devilish details in the implementation of phonological contrasts. It may help us to understand better both sound change and the communicative function of speech.
Notes 1. The IPA symbols in this quote conform to current conventions, not those in 1980. 2. For a survey of gaps in consonant inventories, see Sherman (1975). 3. See also Ohala (1974) for similar data on F0 following the release of /s/ and /z/.
References Bell, H. 1971
The phonology of Nobiin Nubian. African Language Review 9: 115139.
Fujimura, O. 1971 Remarks on stop consonants: synthesis experiments and acoustic cues. In: L. L. Hammerich, R. Jakobson and E. Zwirner (eds.), Form and substance: phonetic and linguistic papers presented to Eli Fischer-Jørgensen. pp. 221-232. Copenhagen: Akademisk Forlag. Hombert, J.-M., Ohala, J. J., & Ewan, W. G. 1979 Phonetic explanations for the development of tones. Language 55: 37-58. Lisker, L. 1974 On ‘explaining’ vowel duration variation. Glossa 8: 233-246 1986 ’Voicing’ in English: A catalogue of acoustic features signaling /b/ vs. /p/ in trochees. Language and Speech 29. pp. 3-11. Lisker, L., & Abramson, A. 1964 A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20: 384-422. 1967 Some effects of context on voice onset time in English stops. Language and Speech, 10: 1-28. Livijn, P. & Engstrand, O. (2001) Place of articulation for coronals in some Swedish dialects. Proceedings of Fonetik 2001, the XIVth Swedish Phonetics Conference, Örenäs, May 30 - June 1, 2001. Working Papers, Department of Linguistics, Lund University 49, pp. 112-115.
58
John J. Ohala
Ohala, J. J. 1974
Experimental historical phonology. In: J. M. Anderson & C. Jones (eds.), Historical Linguistics II, Theory and description in phonology: Proceedings of the 1st International Conference on Historical Linguistics, Edinburgh. pp. 353-389. Amsterdam: North-Holland. 1980 Moderator's summary of symposium on 'Phonetic universals in phonological systems and their explanation.' Proc., 9th Int. Cong. of Phon. Sc. Vol. 3. Copenhagen: Institute of Phonetics. pp. 181-194. 1981a Articulatory constraints on the cognitive representation of speech. In: T. Myers, J. Laver, and J. Anderson (eds.), The Cognitive Representation of Speech. Amsterdam: North Holland. pp. 111-122. 1981b The listener as a source of sound change. In: C. S. Masek, R. A. Hendrick, & M. F. Miller (eds.), Papers from the Parasession on Language and Behavior. Chicago: Chicago Ling. Soc. pp. 178 - 203. 1993 The phonetics of sound change. In Charles Jones (ed.), Historical Linguistics: Problems and Perspectives. London: Longman. pp. 237-278. Ohala, J. J., A. Dunn, & R. Sprouse. 2004 Prosody and Phonology. Speech Prosody 2004, Nara, Japan. Ohala, J. J. & Ohala, M. 1993 The phonetics of nasal phonology: theorems and data. In M. K. Huffman & R. A. Krakow (eds.), Nasals, nasalization, and the velum. San Diego, CA: Academic Press. pp. 225-249. Sherman, D. 1975 Stop and fricative systems: a discussion of paradigmatic gaps and the question of language sampling. Stanford Working Papers in Language Universals. 17: 1-31. Sprouse, R, Solé, M-J. and Ohala, J. 2008. Oral cavity enlargement in retroflex sounds. Paper delivered at the 8th International Seminar on Speech Production, Strasbourg. Winitz, H., Scheib, M. E. & J. A. Reeds. 1972 Identification of stops and vowels, Journal of the Acoustical Society of America, 51.4: 1309-1317
Signal dynamics in the production and perception of vowels René Carré Vowels can be produced with static articulatory configurations represented by dots in acoustic space (generally by formant frequencies in the F1-F2 plane). But because vowel characteristics vary with speaker, consonantal environment (co-articulation) and production rate (reduction phenomena), vowel formant frequencies can also be represented by their mean values and standard deviations, according to different categories (language, age and gender of speaker). The use of ‘targets’ means that they are generally studied from a static point of view. But several questions can be raised: How are vowel representations set up if vowel realizations rarely reach their targets in running speech production (vowel reduction (Lindblom, 1963))? Is representation the same from one person to another? How is a given vowel, produced several times with different acoustic characteristics and in different environments, identified? By using contextual information? By normalization? These questions lead to studying vowels from a dynamic point of view. Here, we first propose a theoretical deductive approach to vowel-tovowel dynamics which leads to a specification in terms of vocalic trajectories in the acoustic space characterized by their directions. Then, results on V1V2 transitions produced and perceived by subjects will be presented. In production, measurements of the F1 and F2 transition rates are represented in the F1 rate/F2 rate plane. In perception, direction and rate of synthesized transitions are studied for transitions situated outside the traditional F1/F2 vowel triangle. This situation enables the study of transitions characterized only by their directions and rates without relation to any vowel targets of the vowel triangle. Such experiments show that these transitions can be perceived as V1V2. Several issues can then be revisited in the light of this dynamic representation: vocalic reduction, hyper and hypo speech, normalization, perceptual overshoot, etc. A fully dynamic representation of both vowels and consonants is proposed.
60
René Carré
1. Introduction Vowels are generally characterized by the first two or three formant frequencies. Each of them can be represented in the acoustic space (F1-F2 plane) by a dot (Peterson and Barney, 1952). This specification is static. Vowels can be produced in isolation without articulatory variations, but in natural speech such cases are atypical since their acoustic characteristics are not stable. They vary with the speaker and with the age and gender of the speaker, with the consonantal context (coarticulation), with the speaking rate (reduction phenomena), and with the language (Lindblom, 1963). So, vowels are classified into crude classes, first according to the language, and then according to speaker categories. Within each category, vowels can be specified in terms of underlying ‘targets’ corresponding to the contextand duration-independent values of the formants as obtained by fitting “decaying exponentials” to the data points (Moon and Lindblom, 1994). The point in focus here is that this specification is static and, significantly, may be taken to imply that the perceptual representation corresponds to the target values (Strange, 1989). At this point, several questions can be raised: How is the perceptual representation obtained if the vowel targets depend on the speaker, and are rarely reached in spontaneous speech production? Are the representations the same from one person to another (Johnson, et al., 1993; Carré and Hombert, 2002; Whalen, et al., 2004)? How is this perceptual representation built: by learning, or is it innate? How is the vowel perceived with its different acoustic characteristics according to the context and the speaker: by normalization (Nordström and Lindblom, 1975; Johnson, 1990; Johnson, 1997)? Why is vowel perception less categorical than consonant perception (Repp, et al., 1979; Schouten and van Hessen, 1992)? Many studies have been undertaken to answer these questions. The results are generally incomplete and contradictory. They cannot be used to set up a simple theory explaining all the results. But they help highlight the importance of dynamics in vowel perception (Shankweiler, et al., 1978; Verbrugge and Rakerd, 1980; Strange, 1989). In view of the fact that sensory systems have been shown experimentally to be more sensitive to changing stimulus patterns than to purely steady-state ones, it appears justified to look for an alternative to static targets - a specification that recognizes the true significance of signal time variations. One possibility is that dynamics can be characterized by the direction and the rate of the vocalic transitions:
Signal dynamics in the production and perception of vowels 61
•
Vowel-vowel trajectories in the F1/F2 plane are generally rectilinear (Carré and Mrayati, 1991). So they can be characterized by their direction. Moreover, privileged directions are observed in the production of vowels (in single-vowel or CV syllables) called ‘vowel inherent spectral changes, VISC’ (Nearey and Assmann, 1986). Moreover, perception experiments show the importance of VISC in improving vowel identification (Hillenbrand, et al., 1995; Hillenbrand and Nearey, 1999). • On the topic of transition rate, we recall the results of Kent and Moll (1969): “the duration of a transition – and not its velocity – tends to be an invariant characteristic of VC and CV combinations”. Gay (1978) confirmed these observations with different speaking rates and with vowel reduction: “the reduction in duration during fast speech is reflected primarily in the duration of the vowel,… the transition durations within each rate were relatively stable across different vowels…”. If the transition duration is invariant across a set of CV’s with C the same and varying Vs, it follows that the transition rate depends both on the consonant and the vowel to be produced. So the time domain could play an important role in the identification of vowels (Fowler, 1980). For example, to discriminate the sequences [ae], [a], and [ai], what acoustic information does the listener need? The answer is that the second vowel V2 can be detected by using the transition rate as a cue. This parameter can be specific to the speaker and/or related to the syllabic rate. At the very beginning of the transition and throughout the transition there is sufficient information to detect V2. There are no privileged points in time (for example the middle of V2 to measure the formant frequencies) for V2 detection. The rate measure is therefore very appropriate in a noisy environment. It can also explain the perceptual results obtained by Strange et al. (1983) in ‘silent center’ experiments that replaced the center of the vowel by silence of equivalent duration. This manipulation preserves the direction and the rate of the transition as well as the temporal organization (syllabic rate). Also relevant are experiments by Divenyi et al. (1995) showing that, in V1V2 stimuli, V2 was perceived even when V2 and the last half of the transition was removed by gating. Finally, it can be observed that both Arabic (Al-Tamimi, et al., 2004) and Vietnamese subjects (Castelli and Carré, 2005) have difficulties in producing and perceiving isolated vowels.
62
René Carré
In this paper, a deductive theoretical approach to the study of vowel-tovowel dynamics is proposed. It leads to a specification of vocalic trajectories in the acoustic space characterized by their directions. Then, results of the production and the perception of V1V2 transitions by French subjects are presented. Production measurements of the F1, F2 transition rates are represented in a F1 rate / F2 rate plane. In a perceptual study, we focus on the direction and rate of synthesized transitions situated outside the traditional F1/F2 vowel triangle. This situation enables the study of transitions characterized only by their directions and rates without reference to any vowel targets in the vowel triangle.
2. Deductive approach and vowel-to-vowel trajectories Here, we try to infer vowel properties, not from data on vowel production and perception, but by a deductive approach starting from general physical properties of an acoustic tube. If the goal is to build an efficient device for acoustic communication, i.e. a device that gives maximum acoustic contrast for minimum area shape deformation, then the tube must be structured into specific regions leading to a corresponding organization of the acoustic space (Carré, 2004; Carré, 2009). A recursive algorithm using the calculation of the sensitivity function (Fant and Pauli, 1974) deforms “efficiently” the shape of the tube in order to increase (or decrease), step by step, the formant frequencies. By “efficiently” we mean that a small or minimal area shape deformation gives maximum formant variations in the F1/F2 plane The acoustic space automatically obtained corresponds to the vowel triangle (which is, consequently, maximal in size; it cannot be larger) (Carré, 2009). [a] is obtained with a back constriction and a front cavity; [i] with a front constriction and a back cavity; [u] with a central constriction and two lateral cavities. In the first two cases, the front end of the tube is open; in the last case the front end is almost closed. The specific regions automatically obtained correspond to the main places of articulation (front, back and central) used to produce vowels (Carré, 2009). The shape deformations can be represented by deformation gestures corresponding to three different “tongue” gestures and one “lip” gesture. The three different tongue gestures are: a transversal deformation gesture from front to back places of articulation (and vice-versa) producing [ia] (or [ai])), a longitudinal deformation gesture from front to central constriction producing [iu] and a longitudinal displacement gesture from back to central place of articulation producing
Signal dynamics in the production and perception of vowels 63
[au] (Carré, 2009). The lip gesture is used to reach [u] (low F1 and F2). The deformations can easily be modelled by the Distinctive Region Model (DRM) (Mrayati, et al., 1988; Carré, 2009). F1-F2 Plane 300
[i]
[e]
[] [y] [ø] [æ] [œ] [] [a] [] [] [] [] [] [] [] [o] [u]
200 F2 (Hz)
100
0 0
20
40
60
80
100
F1 (Hz)
Figure 1. Vocalic trajectories obtained by deduction from acoustic characteristics of a tube and corresponding vowels. Dotted lines are labialized trajectories (Carré, 2009).
From the DRM model, eight more or less rectilinear trajectories structuring the acoustic space were obtained: [ai], [u], [iu], [ay], [], [uy], [i], [yu] (Figure 1). The maximum acoustic space obtained by this approach fits well with the vowel triangle. The use of the DRM does not lead to characterising vowels first, but rather privileges vocalic trajectories. A maximum acoustic contrast criterion (Carré, 2009) would select the endpoints, and intermediate points on the trajectories which correspond well with the vowels given for example by Catford (1988). Recall that the recursive algorithm calculates, from an initial shape of the area function of a tube, a new shape according to a minimum of energy criterion (minimum deformation leads to maximum acoustic variation) (Carré, 2004). This operation is repeated until the maximum acoustic limits of the tube are reached. Thus, the algorithm simulates an evolutionary process (the goal is not pre-specified at the beginning of the process), by simply increasing acoustic contrast, step by step, according to a minimum
64
René Carré
of energy criterion. The resulting trajectories in the acoustic plane can be characterized by their directions. On the basis of the above discussion stressing the importance of the formant trajectories we hypothesize that the perception of vowels might be understood, not in terms of static targets, but in terms of a dynamic measure of the direction and rate of spectral change characterizing such trajectories. We will test this hypothesis in a series of studies of vowel-to-vowel sequences.
3. Vowel-vowel production [V1V2] sequences were produced 5 times by 5 male and 5 female speakers, all French, at 2 different rates (normal and fast). In the following experiments, V1 is always /a/ and V2 is one of the French vowels situated on the [ai] ([i, , e]), [ay] ([y, œ, ø]) or [au] ([u, o, ]) trajectories. A French word containing V2 appears on the computer screen with alphabetic representation to help the subject who may have no phonetic knowledge. To exemplify, the instructions were the following: at ‘fast’ rate: say ‘a-i’ as in the word ‘lit’. The recording process was controlled by PC software that randomly presented the succession of the items to be recorded. In the case of bad pronunciation, or hesitation, the speaker had the possibility of pronouncing the item again. Formant frequencies were measured using Praat software every 6.25 ms. The formant variations were smoothed with a 43.75 ms time window by calculating the mean values of the formants obtained for 7 successive frames (running mean value). Then, the derivation was taken to obtain the formant transition rate. The formant rate was also smoothed with a 43.75 ms window (running mean value). Figure 2 shows, for [ai] as produced by speaker RC at normal rate, the formant transitions, the formant transition rate, the formant transition acceleration, in the time domain and in the plane defined by the F2/F1 parameters. Maxima and minima of the F1 and F2 frequencies, rates and accelerations were measured to characterize the formant transitions.
Signal dynamics in the production and perception of vowels 65 [ai] N(rc1)
[ai] N(rc1) 3000
2000
F1
1000
F2
F2 (Hz)
F (Hz)
3000
2000 1000
0
0
0
200
400
0
500
a)
b) 8 6 4 2 0 -2 0 -4 -6
[ai]
N(rc1)
10 F1 F2 200
400
Δ F2 (Hz/ms)
Δ F (Hz/ms)
[ai] N(rc1)
5
-5
-4
-3
-2
0 -1 0 -5
1
2
3
4
5
-10 Δ F1 (Hz/ms)
t (ms)
c)
d) [ai] N(rc1)
D2 (Hz/ms/ms)
0.6 0.4 0.2
F1
0
F2
-0.2 0
200
400
D2F2 (Hz/ms/ms)
[ai] N(rc1)
-0.4
1 0.5 0 -0.5
-0.5
0
0.5
-1 D2F1 (Hz/ms/ms)
t (ms)
e)
1000
F1 (Hz)
t (ms)
f)
Figure 2. [ai] production, at normal rate, for speaker RC, a) F1 and F2 formant transition in the time domain, b) corresponding formant trajectory in the F1-F2 plane, c) F1 and F2 rate in the time domain, d) formant rate trajectory in the F1 rate-F2 rate plane, e) F1 and F2 acceleration in the time domain, and f) formant acceleration trajectory in the F1 acceleration-F2 acceleration plane.
66
René Carré
3.1. Vowel formants The vowel formant frequencies for [aV] as produced by speaker EM are represented in the F1/F2 plane (figure 3). The data points are the mean values of the 5 occurrences for normal and fast production (N and F). Standard deviations for F1 and F2 are also indicated. There are no significant differences between normal and fast productions which could be expected because, in V1-V2 production, the target V2 is always reached. The vowels can be easily separated.
F2 (Hz)
Max F1-F2 for V, for normal and fast production
N(em) F(em)
2500
[ai]
[ai]
2000
[ae]
[ae]
[a]
[a]
1500
[ay]
[ay]
1000
[aø]
[aø]
500
[aœ]
[aœ]
[au]
[au]
[ao]
[ao]
[a]
[a]
[a]
[a]
0 0
200
400
600
800
F1 (Hz)
1000
Figure 3. Mean F1 and F2 frequencies and standard deviations plotted in the F1/F2 plane for [aV] tokens produced by speaker EM, at normal (N) and fast (F) rates. Each item is produced 5 times (45 times for [a]).
3.2. [aV] transitions 3.2.1. [aV] characteristics in the F1-F2 plane Figure 4 shows the different formant trajectories for [ai], [ae], [a], [ay], [aœ], [aø] and [au], [ao], [a] in the F1-F2 plane for one speaker EM. Each trajectory is a single production pronounced at normal rate. The trajectories are rather rectilinear and follow, as far as [ai] and [au] are concerned, the basic trajectories obtained by deduction (figure 1): [e], [] are situated
Signal dynamics in the production and perception of vowels 67
along the formant movement of [ai]; [o] and [] on the [au] trace. Figure 4 also shows that the end parts of the trajectories corresponding to V can be characterized by small changes along the rectilinearity of the trajectory. This result corresponds to the “vowel inherent spectral changes” observed by Nearey and Assmann (1986) for the final vowel in VCV sequences and by Carré et al. (2004) for isolated vowels. These characteristics are also observed for the other speakers. [aV] trajectories 2500
[i]
[e] []
F2 (Hz)
2000 [y]
1500
[ø] [u]
500
[a]
[œ]
1000 [o]
[]
0 0
200
400
600
800
1000
F1 (Hz)
Figure 4. [aV] formant trajectories for speaker EM (normal rate). The small changes at the ends of the trajectories corresponding to V do not deviate significantly from the rectilinearity of the trajectories.
3.2.2. [aV] transition rate The representation of the F1-F2 transition rate as in figure 2d is used to compare the [aV] transitions for all V. Figure 5a shows the results for speaker EM with one utterance of each [aV]. It can be observed that: if, for example, [ai], [ae], [a] are compared, three distinct rate trajectories can be discriminated. The rate trajectory of [ai] is longer than that of [ae] and still longer than that of [a]. In other words, the maximum rate of [ai] is greater than the maximum rates of [ae] and of [a]. Figure 5b shows, for [ai], [ae], [a] and for one production by speaker EM, the first formant rates in the time domain. The three vowels can be discriminated according to the maximum rates corresponding more or less to the middle of the transition ([ai] maximum rate > [ae] maximum rate > [a] maximum rate). Discrimi-
68
René Carré
nation can also be obtained throughout the transition and especially from the very beginning of the transition (from the very beginning of the production task). Figure 5b shows that the three transitions (for [ai], [ae], [a]) synchronized at the beginning (corresponding to about t=50ms), stop at t=about 150ms. Because of the more or less constant duration of the transition, the 3 rates, throughout the transition, follow the inequality: [ai] rate > [ae] rate > [a] rate. In principle, discrimination between the three final vowels is thus possible throughout the transition and especially at its beginning. [aV] rates 2
[e]
[i] [y]
[]
[ø]
[u]
[a] [o]
[œ]
[]
[]
F1 rate (Hz/ms)
F2 rate (Hz/ms)
[aV] trajectory rates
0 -2
0
100
200
[ai] [ae]
-4
[a]
-6 -8 time (ms)
F1 rate (Hz/ms)
a)
b)
Figure 5. a) [aV] trajectory rates in the F1 rate-F2 rate plane for speaker EM (normal rate) and b) F1 rates in the time domain for [ai], [ae], [aε].
Figure 6a shows the formant transition rates (mean data and standard deviations for the 5 productions) in the F1 rate/F2 rate plane for the speaker EM, for normal and fast production. The rates indicated are the maximum rates of the transitions. We do not observe large differences between normal and fast production and the vowels can be discriminated according to their rates. According to the vowel target approach, identification would be based on formant frequency information at the end of the transition. It would not be necessary to know the characteristics of the preceding vowel (here [a]). In contrast, the dynamic approach assumes that directions and slopes of the transitions are important parameters. The identification of the vowel V would depend on the departure point in acoustic space. Standard deviations can be reduced by normalization based on the formant values of the initial [a] (Figure 6b). If we compare the two results: vowel targets (Figure 3 with F1, F2) and formant transition rates (Figure 6 with F1 rate, F2 rate), the difference in
Signal dynamics in the production and perception of vowels 69
distinctiveness is not evident. However, in this experiment at normal and fast rates of V1V2 production, the vowel targets V2 are always reached. A further analysis on items such as V1V2V1 with possible vowel reduction effect at fast rate production is thus necessary. Max F1-F2 rate from [a] to V, for normal and fast production
F2 rate (Hz/ms)
10 5 0 -6
-4
-2
0 -5 -10
F1 rate (Hz/ms)
N(em) F(em) [ai]
[ai]
[ae]
[ae]
[a]
[a]
[ay]
[ay]
[aø]
[aø]
[aœ]
[aœ]
[au]
[au]
[ao]
[ao]
[a]
[a]
a)
F2 rate (Hz/ms)
Max F1-F2 rate from [a] to V, for normal and fast production
[ai]
[ai]
10
[ae]
[ae]
5
[a]
[a]
[ay]
[ay]
[aø]
[aø]
[aœ]
[aœ]
[au]
[au]
[ao]
[ao]
[a]
[a]
0 -6
-4
-2
0 -5 -10
F1 rate (Hz/ms)
N(em) F(em)
b) Figure 6. a) Vowel transition maximum rates of the transition [aV] for normal (N) and fast production (F) (speaker EM); b) Same data but the formant frequencies F1 and F2 of each [a] vowel at the beginning of the transition are taken into account to normalize the rates.
70
René Carré
3.2.3. [aV] transition duration The preceding results suppose that the transition durations are more or less constant for all the [aV] produced by a same speaker. Figure 7 shows the transition durations for the speaker EM at normal and fast production. The transition duration is defined as the interval between the maximum and minimum of the acceleration curve for F1 (see Figure 2e). The duration of the transition is around 10% smaller for faster production. The standard deviation is small for both. Our results correspond to those of Kent and Moll (1969) and Gay (1978), but have to be confirmed with data from more speakers. Transition Duration (em)
Duration (ms)
100 80 60
Normal
40
Fast
20 0
Figure 7. Transition durations for all the [aV] produced by speaker EM at normal and fast rate.
3.2.4. [aV] transition for 10 speakers Our first results on [aV] production for a single speaker lead us to hypothesize that transition rates ought to be invariant across speakers (male and female). To test the hypothesis, the first two formants of the [aV] tokens as produced by 10 speakers (5 males and 5 females) were calculated with the Praat software. The formants of the V are represented in Figure 8 in the F1F2 plane. The standard deviations indicate that these static target representations show significant variability. However representations of transition maximum rates exhibit even more variability which is the opposite of our hypothesis (Figure 9)! These findings raise two questions: a) How accurate
Signal dynamics in the production and perception of vowels 71
is the formant using classical estimation techniques, (e.g., linear prediction), especially for female voices, and b) Would it be possible to reduce variability by taking syllable rate or transition rate into account?
[aV]-F1/F2 plane
N
F
[i]
[i]
[e]
[e]
2000
[]
[]
1500
[y]
[y]
1000
[ø]
[ø]
500
[œ]
[œ]
0
[u]
[u]
[o]
[o]
[]
[]
3000
F2 (Hz)
2500
0
200
400 F1 (Hz)
600
800
Figure 8. F1-F2 plane representation of the vowels [V] from [aV] produced at normal (N) and fast (F) rate by 10 speakers (5 males and 5 females).
First, good formant estimation is very difficult to obtain especially for female voices. Furthermore formant measurement errors are emphasized by the present derivative process of computing transition rates. For example, a formant frequency error of 10% can lead to an error in transition rate of 100%. Using a large time window to compute mean values can reduce these errors, but delays rate measurement. Problematic aspects of formant detection will be discussed further below. Second, each of the speakers has his own transition rate which might also change slightly with the syllable rate (normal/fast production), for instance see the transition rate for normal and fast production for speaker EM in figure 7.
72
René Carré
[aV] Transition Rate
N
F2 rate (Hz/ms)
15
-10
-8
-6
-4
F
[i]
[i]
[e]
[e]
10
[]
[]
5
[y]
[y]
0
[ø]
[ø]
[œ]
[œ]
[u]
[u]
[o]
[o]
[]
[]
-2
0 -5 -10
F1 rate (Hz/m s)
Figure 9. Maximum transition rates in the plane F1 rate versus F2 rate for [aV] uttered by 10 speakers (5 males and 5 females) for normal (N) and fast (F) production; b) same data after normalization using the transition duration.
[aV] Transition Rate
F2 rate (Hz/ms)
15 10
-10
-8
-6
-4
F
[i]
[i]
[e]
[e]
[]
[]
5
[y]
[y]
0
[ø]
[ø]
[œ]
[œ]
[u]
[u]
[o]
[o]
[]
[]
-2
0 -5 -10
F1 rate (Hz/m s)
N
Figure 10. Maximum F2 and F1 transition rates for [aV] uttered by 10 speakers (5 males and 5 females) for normal (N) and fast (F) production after normalization based on transition durations.
Signal dynamics in the production and perception of vowels 73
In view of these considerations we decided to normalize the rate measurements with respect to transition duration. Transition durations were obtained from the time interval between the maximum and minimum of the acceleration curve for F1 (see Figure 2e). Figure 10 shows the new results (the reference transition duration was 100ms). The standard deviations are clearly reduced but are still greater than the ones obtained with the target formant measurements.
4. Transition perception Since direction and rate of transitions provide discriminating acoustic information on the vowel identities in vowel to vowel sequences, it seems possible that these two attributes could be used in perception. In fact, the rates of F1 and F2 give the direction in the plane F1/F2. With such an hypothesis, the starting point must be known (Carré et al., 2007) but, it can also be considered that, for example, high negative F1 rate and high positive F2 rate leads to /ai/ without prior information on the first vowel. To test this hypothesis, trajectory stimuli outside the vowel triangle were chosen. So, the use of normal target values for the vowels was abandoned but typical rates and directions in the acoustic space were retained. Four different stimuli (A, B, C, D) were synthesized with 2 formants. The trajectories of these sequences are shown in the F1/F2 plane Figure 10, and in the time domain Figure 11. F0 is 300 Hz at the beginning of each sequence, held constant during the first quarter of the total duration, decreasing to 180 Hz at the end. Possible responses for identification tests were chosen during a pre-test, i.e. [iai], [u], [aua], [aoa]. A fifth case, labelled “????”, was offered in case of impossible identification (no response). Twelve subjects took part in the perception tests.
74
René Carré
Formant Transitions 3000 F2 (Hz)
A
B
C
2000 D
1000 0 0
500
1000
1500
2000
F1 (Hz)
Figure 11. The four trajectories (A, B, C, D) in the plane F1-F2 and the vowel triangle. The trajectories are outside the vowel triangle. Their directions and sizes in the acoustic plane vary.
F (Hz)
3000 2500
F1 (A)
2000
F2 (A) F1 (B)
1500
F2 (B)
1000
F1 (C)
500
F2 (C) F1 (D)
0 0
100
200
300
400
500
F2 (D)
t (ms)
Figure 12. F1 and F2 in the time domain for the four sequences (A, B, C, D). The duration of the first part of each sequence was 100 ms, the duration of the transition was constant and equal to 100 ms. The duration of the last part was 150 ms. The rates of the transitions in Hz/ms vary. The first and last parts of each sequence were stable and equal in formant frequency. The transitions of the four sequences reached more or less the same point in the acoustic plane.
Signal dynamics in the production and perception of vowels 75
The responses (in %) are given in Figure 13. The sequence A is identified 71% of the times as [iai]. B is identified 87% as [u]. C is identified 95% and D 96% as [aua] or [aoa] and the long trajectory corresponding to a faster rate of transition is more [aua] than the short one which is more [aoa]. The option of “no response” is generally avoided. The sequence A which has the same direction in the acoustic plane and transition rate as [ai] is perceived as /iai/; B which has more or less the same direction and rate as [u] is perceived as /uu/; C which has the same direction and rate as [au] is perceived as /aua/ and D which has the same direction as [au] but at lower rate is more often perceived as /aoa/. These results can be summarized by saying that the region where the 4 trajectories converge (acoustically close to [a]) is perceived as /a/ or /u/ or /o/ depending on the direction and length (i.e. rate of the transition) of the trajectories.
100 80
60
/aoa/
%
/aua/
40
/u/
20 0
/iai/ A
B
C
D
Figure 13. Results of the perception tests. The sequence A is mainly perceived as [iai], B as [iui], C as [aua] (long trajectory), D as [aoa] (short trajectory).
5. Discussion At different levels our preliminary results raise several problems and questions about the dynamic approach and its consequences for the theory of speech production and perception. Also the findings motivate a closer examination of current speech analysis techniques and the methodology of perception tests. The dynamic approach is very attractive because it may permit consonants and vowels to be integrated within a single theory. Conceivably, us-
76
René Carré
ing the parameter of transition rate, one might propose that fast transitions tend to produce consonants, whereas slow transitions produce vowels. In the case of perceiving V1V2 sequences, we have reported acoustic measurements indicating that signal information on V2 is available throughout the transition and especially at its very beginning. This strategy presupposes that the identity of the previous V1 has been determined. The question is: How is this information to be obtained? According to a target theory of speech perception, V2 can only be identified on the basis of its target, the goal being to reach the target irrespective of the starting point. One of the aims of the present study has been to suggest that dynamic parameters such as direction of spectral change in acoustic space and transition rate could be more invariant across males, females and children than vowel targets. This hypothesis would make normalization in terms of static targets unnecessary. However, normalization of transition rate with respect to the different transition durations observed in production would seem necessary. Such normalization could be readily available perceptually, thanks to temporal coding and the sensitivity of the auditory system to rate (derivatives) and acceleration (Pollack, 1968; Divenyi, 2005). Hyper-hypo speech, and reduction phenomena (Lindblom, 1963; Lindblom, 1990) of fast and normal speech (Kent and Moll, 1969; Gay, 1978; van Son and Pols, 1990; van Son and Pols, 1992; Pols and van Son, 1993) should be further studied with respect to the parameters of transition direction and rate. The results obtained from experiments with ‘silent center’ (Strange, et al., 1983) can be explained in terms of ‘dynamic specification’ in the sense that it is not necessary to compensate for ‘undershoot’ at the production level (target not reached because of coarticulation, fast speech or hypo speech) by perceptual ‘overshoot’ (Lindblom and StuddertKennedy, 1967) calculated ‘not solely by the formant-frequency pattern at the point of closest approach to target, but also by the direction and rate of adjacent formant transitions’. This finding is compatible with the assumption that, given a specification of their point of origin in phonetic (acoustic) space, direction and rate of formant transitions could be sufficient to specify the following vowel. Our preliminary results on vowel production represent a first few steps in support of a full dynamic approach. More studies on the normalization process must also be undertaken. It is well known that predictive coding based on a model of speech production is not well adapted for analyzing speech signals with high fundamental frequencies or with noise. Furthermore, such a technique is ill-
Signal dynamics in the production and perception of vowels 77
suited to measuring spectral variations. A dynamic approach necessitates a reconsideration of analysis techniques in light of our knowledge of the auditory system. The spikes observed in auditory nerve fibers are statistically synchronized by the time domain shape of the basilar membrane excitation around the characteristic frequencies (Sachs et al., 1982). So they can give information not only on the amplitudes of spectral components but also on the shape in the time domain of the components and thus on the phases. The phase variation (-180° around formant frequencies for second order filters describing the transfer function of the vocal tract (Fant, 1960)) could be used to measure the rate of the transitions. To attain some of these goals new tools would be needed. For example Chistovich et al. (1982) described a model of the auditory system which detects spectral transitions without specific formant detection. These considerations make it evident that in order to test the hypothesis of ‘greater invariance in transition rates than in formant targets’, it would be necessary both to improve current analysis techniques and to study more deeply the normalization of transition durations. Perception tests of formant transitions outside the vowel triangle encourage us to study general dynamic properties of the auditory system that may be used in speech. Formant transitions can be converted into sinewaves: preliminary tests have shown results close to those obtained with formants. Differences have to be explained. Many experiments must be undertaken with such a tool creating speech illusions.
6. Conclusions This paper follows up on results previously published on the deductive approach (Carré, 2004) proposing a dynamic view of speech production, on acoustic modelling (Mrayati et al., 1988), on structuring an acoustic tube into regions corresponding to the main places of articulation, and on the prediction of vocalic systems (Carré, 2009). The preliminary results presented here on vowels must be extended to consonants, many of which are intrinsically dynamic. At the same time, the evident importance of dynamic characteristics does not mean that static targets are not used in perception. The limits of the dynamic approach and the balance between the use of static and dynamic parameters in perception must be known. But the dynamic approach needs to develop new ways of thinking and new tools. Formant transitions cannot be obtained from a succession of static values
78
René Carré
but from directions and slopes. It means that a new tool able to measure these characteristics directly has to be developed. The dynamic approach is not a static approach plus dynamic parameters taken into account, it must be an approach intrinsically dynamic. It calls for an epistemologic study of the dynamic nature of speech (Carré et al., 2007).
Acknowledgements The author thanks for their very helpful comments and stimulating discussions Eric Castelli, Pierre Divenyi, Björn Lindblom, Egidio Marsico, François Pellegrino, Michael Studdert-Kennedy, Willy Serniclaes. He thanks also Claire Grataloup for the efficient management of the perception tests.
References Al-Tamimi, J., Carré, R. and Marsico, E. 2004 The status of vowels in Jordanian and Moroccan Arabic: insights from production and perception. Journal of the Acoustical Society of America 116: S2629. Carré, R. 2004 From acoustic tube to speech production. Speech Communication 42: 227-240. 2009 Dynamic properties of an acoustic tube: Prediction of vowel systems. Speech Communication, 51: 26-41. Carré, R. and Hombert, J. M. 2002 Variabilité phonétique en production et perception de parole : stratégies individuelles. In Invariants et Variabilité dans les Sciences Cognitives J. Lautrey, B. Mazoyer and P. van Geert, (Eds.). Presses de la Maison des Sciences de l'Homme: Paris. Carré, R. and Mrayati, M. 1991 Vowel-vowel trajectories and region modeling. Journal of Phonetics 19: 433-443. Carré, R., Pellegrino, F., Divenyi, P. 2007 Speech dynamics: epistemological aspects. In: Proc. of the ICPhS, Saarbrücken, pp. 569-572. Castelli, E. and Carré, R. 2005 Production and perception of Vietnamese vowels. In: ICSLP, Lisbon, pp. 2881-2884.
Signal dynamics in the production and perception of vowels 79 Catford, J. C. 1988 A practical introduction to phonetics. Clarendon Press: Oxford. Chistovich, L. A., Lublinskaja, V. V., Malinnikova, T. G., Ogorodnikova, E. A., Stoljarova, E. I. and Zhukov, S. J. 1982 Temporal processing of peripheral auditory patterns of speech. In The representation of speech in the peripheral auditory system R. Carlson and B. Grandström (Eds.). Elsevier Biomedical Press: Amsterdam. pp. 165-180. Divenyi, P., Lindblom, B. and Carré, R. 1995 The role of transition velocity in the perception of V1V2 complexes. In: Proceedings of the XIIIth Int. Congress of Phonetic Sciences, Stockholm, pp. 258-261. Divenyi, P. L. 2005 Frequency change velocity detector: A bird or a red herring? In Auditory Signal Processing: Physiology, Psychology and Models D. Pressnitzer, A. Cheveigné and S. McAdams (Eds.). Springer-Verlag: New York, pp. 176-184. Fant, G. 1960 Acoustic theory of speech production. Mouton: The Hague. Fant, G., Pauli, S., 1974 Spatial characteristics of vocal tract resonance modes. In: Proc. of the Speech Communication Seminar. Stockholm, pp. 121–132. Fowler, C. 1980 Coarticulation and theories of extrinsic timing. Journal of Phonetics 8: 113-133. Gay, T. 1978 Effect of speaking rate on vowel formant movements. Journal of the Acoustical Society of America 63: 223-230. Hillenbrand, J. M., Getty, L. A., Clark, M. J. and Wheeler, K. 1995 Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97: 3099-3111. Hillenbrand, J. M. and Nearey, T. M. 1999 Identification of resynthesized /hVd/ utterances: Effects of formant contour. Journal of the Acoustical Society of America 105: 35093523. Johnson, K. 1990 Contrast and normalization in vowel perception. Journal of Phonetics 18: 229-254. 1997 Speaker perception without speaker normalization. An exemplar model. In Talker Variability in Speech Processing K. Johnson and J. W. Mullennix (Eds.) Academic Press: New York, pp. 145-165.
80
René Carré
Johnson, K., Flemming, E. and Wright, R. 1993 The hyperspace effect: Phonetic targets are hyperarticulated. Language 69: 505-528. Kent, R. D. and Moll, K. L. 1969 Vocal-tract characteristics of the stop cognates. Journal of the Acoustical Society of America 46: 1549-1555. Lindblom, B. 1963 Spectrographic study of vowel reduction. Journal of the Acoustical Society of America 35: 1773-1781. 1990 Explaining phonetic variation: a sketch of the H and H theory. In Speech Production and Speech Modelling A. Marchal and W. J. Hardcastle (Eds.). NATO ASI Series. Kluwer Academic Publishers. Dordrecht, pp. 403-439. Lindblom, B. and Studdert-Kennedy, M. 1967 On the role of formant transitions in vowel perception. Journal of the Acoustical Society of America 42: 830-843. Moon, J. S. and Lindblom, B. 1994 Interaction between duration, context and speaking style in English stressed vowels. Journal of the Acoustical Society of America 96: 4055. Mrayati, M., Carré, R. and Guérin, B. 1988 Distinctive regions and modes: A new theory of speech production. Speech Communication 7: 257-286. Nearey, T. and Assmann, P. 1986 Modeling the role of inherent spectral change in vowel identification. Journal of the Acoustical Society of America 80: 1297-1308. Nordström, P. E. and Lindblom, B. 1975 A normalization procedure for vowel formant data. In: 8th International Congress of Phonetic Science, Leeds. Peterson, G. E. and Barney, H. L. 1952 Control methods used in the study of the vowels. Journal of the Acoustical Society of America 24: 175-184. Pollack, I. 1968 Detection of rate of change of auditory frequency. J. Exp. Psychol. 77: 535-541. Pols, L. C. W. and van Son, R. J. 1993 Acoustics and perception of dynamic vowel segments. Speech Communication 13: 135-147. Repp, B., Healy, A. F. and Crowder, R. G. 1979 Categories and context in the perception of isolated steady-state vowels. Journal of Experimental Psychology: Human Perception and Performance 5: 129-145.
Signal dynamics in the production and perception of vowels 81 Sachs, M., Young, E. and Miller, M. 1982 Encoding of speech features in the auditory nerve. In The Representation of Speech in the Peripheral Auditory System C. R. and G. B. (Eds.) Elsevier Biomedical: Amsterdam. pp. 115-130. Schouten, M. and van Hessen, A. 1992 Modeling phoneme perception. I: Categorical perception. Journal of the Acoustical Society of America 92: 1841-1855. Shankweiler, D., Verbrugge, R. R. and Studdert-Kennedy, M. 1978 Insufficiency of the target for vowel perception. Journal of the Acoustical Society of America 63: S4. Strange, W. 1989 Evolving theories of vowel perception. Journal of the Acoustical Society of America 85: 2081-2087. Strange, W., Jenkins, J. J. and Johnson, T. L. 1983 Dynamic specification of coarticulated vowel. Journal of the Acoustical Society of America 74: 695-705. van Son, R. J. and Pols, L. C. W. 1990 Formant frequencies of Dutch vowels in a text, read at normal and fast rate. Journal of the Acoustical Society of America 88: 16831693. 1992 Formant movements of Dutch vowels in a text, read at normal and fast rate. Journal of the Acoustical Society of America 92: 121-127. Verbrugge, R. R. and Rakerd, B. 1980 Talker-independent information for vowel identity. Haskins Laboratory Status Report on Speech Research SR-62: 205-215. Whalen, D. H., Magen, H. S., Pouplier, M. and Kang, A. M. 2004 Vowel production and perception: hyperarticulation without a hyperspace effect. Language and Speech 47: 155-174.
Part 2: Typological approaches to measuring complexity
Calculating phonological complexity Ian Maddieson 1. Introduction Several simple factors that can be considered to contribute to the complexity of a language’s phonological system have been investigated in recent papers (including Maddieson, 2006, 2007). The object in these papers was to see whether, in a large sample of languages, these factors positively correlated with each other or displayed a ‘compensatory’ relationship in which the elaboration of one factor was counterbalanced by greater simplicity in others. In general, the factors examined tended to show a pattern of positive correlation. That is, languages tended to be distributed along a continuum from lesser to greater phonological complexity, with several factors simultaneously contributing to the increase in complexity. The factors considered in these studies only involved the inventories of consonant and vowel contrasts, the tonal system, if any, and the elaboration of the syllable canon. It is relatively easy to find answers for a good many languages to such questions as ‘how many consonants does this language distinguish?’ or ‘how many types of syllable structures does this language allow?’ — although care must be taken to ensure that similar strategies of analysis have been applied so that the data are comparable across languages. It is reasonable to suppose that a language which requires its speakers to encode and to discriminate between a larger number of distinctions is in this respect more complex than one with fewer distinctions. However, these properties of a language’s phonological system are far from the only ones that might be considered. Some might even argue that they are not among the most important. In this paper, a number of other aspects that plausibly contribute to phonological complexity will be discussed. The factors that will be considered are: the inherent phonetic complexity of elements in the phonological inventory, the role played by freedom vs limitation of combinatorial possibilities, the contribution of the frequency of occurrence of different properties, and the relative ‘transparency’ of the relationships between phonological variants. The discussion of each factor will consider the cur-
86
Ian Maddieson
rent feasibility of establishing a basis for multi-language comparisons. A final section will consider possible ways of demonstrating that these intuitively plausible contributors to complexity are actual contributors to complexity. Before proceeding with this discussion, however, it might be valuable to reiterate why measures of phonological complexity are of interest. Many linguists assert in one way or another that all natural human languages are equally complex. Such a view seems primarily to be based on the humanistic principle that all languages are to be held in equal esteem and are equally capable of serving the communicative demands placed on them. In rejecting the notion of ‘primitive’ languages linguists seem to infer that a principle of equal complexity must apply. For example, in a widely-used basic linguistics textbook Akmajian et al. (1979:4) state that “all known languages are at a similar level of complexity and detail — there is no such thing as a primitive human language.” A common corrolary derived from this view is that languages vary in the complexity of parts of their grammar but trade off complexity in one sub-part of the grammar against simplicity elsewhere. The introduction to a recent book on language complexity (Miestamo et al., 2008) asks “is the old hypothesis true that, overall, all languages are equally complex, and that complexity in one grammatical domain tends to be complensated by simplicity in another?” Several of the contributors (e.g. Fenk-Oczlon and Fenk, Sinnemäki) answer in the affirmative. This view is not shared by all. In particular, McWhorter has argued in a number of places (e.g. 2001a, b) that languages vary considerably in their complexity, and it is especially a mark of languages that have passed through a stage of creolization to exhibit reduced complexity. Everett (2004) argued that compared to other languages Pirahã is a language with a substantial number of elements missing from its grammar and lexicon. Several commentators on this paper construed this as a suggestion that Pirahã was in fact a ‘primitive’ language. If indeed it is true that languages differ substantially in complexity, then it follows that such tasks as on-line processing of language input as well the initial acquisition of language abilities as a child would place quite variable demands on individuals, depending on the language involved. Performance might be expected to vary commensurate with task difficulty. If languages today vary in complexity then hypotheses positing an original evolution of language through the elaboration over time of a less complex pre-language gain plausibility (see Bickerton, 2007 and contributions to Givón & Maile, 2002).
Calculating Phonological Complexity
87
These are among a number of significant scientific issues that arise in connection with linguistic complexity. However, it is not possible to address these questions without some prior consideration of how to define and how to measure complexity in linguistic systems. This paper is offered as one contribution to this discussion.
2. Inherent phonetic complexity The studies mentioned above looked at the number of consonants, the number of basic vowel qualities, the total number of vowels and the number of tonal contrasts, as well as the syllable canon. The number of basic vowel qualities is the number of vowel distinctions involving the primary properties of height, backness, and rounding as well as tongue root, but not including any distinctions which depend only on such features as length, nasalization or phonation type. The total number of vowels includes any additional vowels distinguished by these features. Only in the case of the syllable canon was any account taken of what might be called ‘inherent’ complexity. Languages were classified as belonging to one of three groups based on the maximally elaborated syllable type permitted. Languages allowing nothing more elaborate than a CV structure were classified as Simple with respect to their syllable structure. Languages permitting no more than a single consonant in the coda and/or onsets of the most common Obstruent + Sonorant types (such as stop + liquid or glide) were classed as Moderately Complex, and those allowing a sequence of two or more consonants in the coda and/or other types of onset sequences, such as Obstruent + Obstruent, Sonorant + Sonorant or Sonorant + Obstruent or any clusters with three or more members were classed as Complex. This particular division into three classes reflects a judgment that in CC onsets it is not just the number of consonants in the sequence that contributes to complexity, but also the nature and order of those consonants. This judgment is based on the likelihood that the constituents of onsets such as /tw, bl, fr/ etc can be more readily recognized than the constituents of onsets such as /sf, tk, ln/ because they display a greater acoustic modulation than the latter (cf. Ohala & Kawasaki, 1984, Kawasaki-Fukumori, 1992). This judgment is indirectly supported by the relative frequency of the two types across the languages of the world. A similar principle of taking into account some evaluation of inherent phonetic complexity might be extended to other do-
88
Ian Maddieson
mains. This is easy to imagine with consonant and vowel inventories, as well as with tones. A more refined evaluation of the complexity of a consonant inventory would thus take into account both the nature and the number of the consonants, rather than simply the number. Each consonant type can be assigned a complexity score, and the complexity of the consonant inventory calculated by summing the complexity scores of the consonants it contains. What would be the basis of such scores? A possible scheme was proposed by Lindblom and Maddieson (1988). Consonants were assigned to one of three categories — Basic, Elaborated and Complex. Elaborated consonants have one of a set of 'complicating' properties, Complex consonants have more than one of these properties; the residue are Basic. These assignments were made in a largely intuitive way which primarily took into account estimations of the articulatory complexity of a prototypical production of the consonant in question, but which was also influenced by factors made less explicit in the article. According to the scheme proposed, any action of the larynx other than simple voicing or voicelessness co-occurring with the oral articulation, such as breathy voicing, aspiration, or ejective articulation is a complicating factor. In addition, because of the aerodynamic difficulty of producing local friction in the oral cavity with the reduced air-flow that occurs with voicing, voiced fricatives and affricates are also classed as Elaborated. Voiceless sonorants are also considered elaborated as they depart ‘from a default mode of phonation’. Any superposition or sequencing of different oro-nasal articulatory configurations is also a complicating factor. For example, prenasalization, lateral release, and secondary articulations are all complicating factors, although simple homorganic affrication is not. Clicks and doubly-articulated consonants such as /k͡p, g͡b/ which require two sets of oral articulatory gestures for their production, are also (at least) in the Elaborated class. Perhaps the least satisfactory part of this classification is the treatment of place of articulation. In principle, ‘configurations representing departures from the near-rest positions of the lips, tongue-tip and tongue-body components of an articulatory model’ are Elaborated. The list of places so defined was given as ‘labio-dental, palato-alveolar, retroflex, uvular and pharyngeal’ (Lindblom and Maddieson, 1988:67). While labiodental and retroflex articulations involve displacements of an articulator toward a surface not opposite that articulator’s rest position, it is unclear that this is equally the case for the remaining places in this list. In standard phonetic textbooks (e.g. Ladefoged, 2006) they are not treated as ‘dis-
Calculating Phonological Complexity
89
placed’ articulations whose articulatory dynamics are for this reason inherently more complex. The classification may well have been influenced by the relative frequency of these places, since retroflex, uvular and pharyngeal consonants are all globally quite rare. Consideration of the interactions between place and manner also seem to have implicitly played a role. Labio-dental fricatives and palato-alveolar fricatives and affricates are common, but plosives at either of these places are rare. Thus, /f/ and /tʃ/ were counted as Basic segments, as was /w/ which has a double articulation. While this scheme could undoubtedly be improved — for example, by finding a more uniform basis for evaluating articulatory complexity or by taking into account considerations of perceptual salience — it provides a basis for a demonstration of cross-language differences in inherent phonetic complexity of consonant systems. In the sample of languages used in Maddieson (2005), the modal number of consonants in the inventory is 22. Three of the languages with this consonant inventory size are Indonesian (Alwi et al., 1998), Birom (Wolff, 1959, Bouquiaux, 1970) and Kiowa (Watkins, 1984, Sivertsen1956). The consonant inventories of these three languages, based on the references cited above, are given in Table 1. The lateral in Kiowa is interpreted as a voiced affricate, although it only has this realization in coda position. Table 1. Consonant inventories of three languages with 22 consonants Indonesian (Austronesian) p t k b d
g tʃ dZ
f s ʃ x h z m n ɲ N lr w j
Birom (Niger-Congo; Nigeria) p t k kp b d
g gb tʃ dZ
f s v z m n N lr w j
h
Kiowa (Kiowa-Tanoan; USA) p t1 k ʔ pʰ t1ʰ kʰ b d1 g p' t1' k' ts ts' s z
m n1 d1L
h
j
Numeric values for each segment corresponding to a consonant class in the Lindblom & Maddieson scheme can be substituted such that Basic =1,
90
Ian Maddieson
Elaborated = 2, and Complex = 3 as in Table 2. For the purposes of this exercise it is stipulated that /f, S, tS, w/ are among the Basic consonants. Otherwise the definitions given above are applied. A summed score for each inventory can then be calculated. Indonesian has a score of 24, Birom a score of 27, and Kiowa, a score of 32. The three languages are thus quite well differentiated when the phonetic content of the inventories enters, however imperfectly, into the picture. These relative rankings correspond quite well with linguistic intuitions. For example, it seems likely that the Kiowa consonant system would be harder for learners to master if prior experience of particular languages could somehow be eliminated from influencing the results (perhaps by selecting a pool of learners with a wide variety of language backgrounds). Table 2. Complexity scores for the consonants of the three languages Indonesian = 24
Birom = 27
1 1
1
1 1
1 2
1 1
1
1 1
1 2
12 1 1 1 1 1 2 1 1 1 1 11
1 2 1 1 2 2 1 1 1 11
1
1
1
1
Kiowa = 32 1 2 1 2
1 2 1 2
1 2 1 2 1 2 1 2
1
1
1
1 1 3 1
A different approach to defining a scale of elaboration based on phonetic content has been proposed by Marsico et al. (2004). Their approach proceeds through an analysis of the structure found in the classification of segments by phonetic categories or phonological features. One principle explored in this work is that a segment is Basic if, when any term is removed from its phonetic description, the remaining properties no longer define an existing segment. Thus, /p/ is a basic segment because, if any one of the terms ‘voiceless, bilabial, plosive’ is removed from the description, the remaining terms define a class of segments, rather than an individual one. Applied in this way, the procedure divides segments into just two sets, Basic and Complex, and therefore produces a considerably larger class of Basic segments than the one used by Lindblom and Maddieson. It is, inter-
Calculating Phonological Complexity
91
estingly, dependent on choices made concerning the set of phonetic properties or features used to define segments. For example, in the classification scheme of the IPA, ‘nasal’ is a primitive term applied to segments with an oral closure and a lowered velum so that all air-flow is directed out through the nose. But in some other feature schemes, any segment with nasal airflow is defined as ‘nasal’. Consequently, nasals, in the IPA sense, must be distinguished from all other segments with a nasal air-flow component by some additional feature such as ‘stop’. If /m/ is defined in the IPA as a ‘voiced bilabial nasal’, it is a Basic segment. However if /m/ is defined as a ‘voiced bilabial nasal stop’ (and ‘nasal’ is taken as a privative feature), then this segment is no longer Basic, because the term ‘nasal’ can be removed, leaving ‘voiced bilabial stop’ which defines the valid segment /b/. An advantage of the Marsico et al. procedure is that, once a feature system has been settled on, the assignment of degree of complexity can proceed algorithmically. Decisions on segmental complexity are not made directly by a linguist, but are derived indirectly from feature assignment, making the procedure more ‘hands-off’. A list of ‘legal’ segments with featural descriptions is required though. Marsico et al. (2004) used the set of 900-odd distinct consonant types listed in the expanded UPSID database (Maddieson and Precoda, 1990) and a slightly modified version of the feature set used in Maddieson (1984). The ‘default’ application of the procedure to the three languages compared above, assigning 1 to Basic and 2 to non-Basic segments, results in complexity scores for their consonant inventories of 22, for both Indonesian and Birom, and 26 for Kiowa. Only the aspirated stops and the lateral affricate of Kiowa are non-Basic, since ejective-stop, ejective-affricate and labial-velar are treated as single features in the default analysis. A reduced feature set in which each is split into two gives scores of 22 for Indonesian, 24 for Birom and 30 for Kiowa, which seems closer to an intuitively-appropriate ranking. This method could easily be extended to include additional levels of complexity. For example, one could take the maximum number of successive removal steps that can be applied as the index of the starting segment’s complexity, or one could jointly factor in some of the other indices discussed by Marsico et al. (2004), like the Derivationality index. The latter measures the number of different legal segment types that can be built by adding features to a given segment’s description. The more segments that can be derived, the more Basic the starting segment. There seems to be no decisive advantage overall to one of these two approaches over the other. A drawback of the Marsico et al. scheme is that it
92
Ian Maddieson
is based on a finite list of segments — those which happen to occur in the languages in a particular sample. The addition or deletion of a language with unique segments (from the point of view of the sample) will change some rankings of remaining segments. In Lindblom and Maddieson’s scheme, any newly encountered segment is evaluated on the basis of its production. Its value is independent of other segments. On the other hand, the approach based on Lindblom and Maddieson gives a, perhaps, unwarranted priority to articulatory aspects of complexity over perceptual ones. The procedure used by Marsico et al. is more neutral in this regard, as different features may be based on the articulatory, acoustic or perceptual properties of segments. It is easier to visualize an intuitively-satisfying extension to vowel inventories in the latter approach, as articulatory complexity does not seem to clearly distinguish among the set of oral modally-voiced vowels in the way that acoustic/perceptual factors can, as employed in the Quantal Theory of Stevens (1972, 1989) or in the definition of the ‘focalization’ parameter of Schwartz et al. (1997). In any event, it is not hard to envisage ways of extending these approaches to include the computation of a measure of the inherent complexity of vowel inventories as well as a language’s inventory of tones. We note that ‘non-peripheral’ or ‘interior’ vowels and those with secondary characteristics such as nasalization are generally considered more complex than plain, peripheral vowels (Crothers, 1978). Unitary contour tones are generally considered more complex than level tones, those that rise more complex than those that fall, and those that include both a rising and a falling component the most complex (see, e.g. Gandour & Harshman, 1978). The calculation of the inherent inventory complexity would be expected to build on these insights. The question remains as to whether any scale of inherent segmental or tonal complexity which seems intuitively appropriate is measuring something real about the complexity of a language’s phonology. There are two main sides to this question. One relates to the underlying concept of what complexity is, the other to the problems of demonstrating in some empirical fashion that what is intuitive has some reality. Since the problems are parallel in regard to the various potential complexity-related factors being considered in this paper, the discussion will be deferred until the final section.
Calculating Phonological Complexity
93
3. Combinatorial possibilities Phonological systems consist, of course, not only of an inventory of segments (and, for some languages, tones), but of patterns of combinations of these elements within larger structures. Languages differ sharply in how freely such elements combine and this again seems to provide a natural scale of complexity. For example, consonant contrasts are typically more limited in coda than onset position, but the degree of limitation varies considerably. In a quasi-random subsample of 25 languages which permit CVC structures, 21 allow fewer of their consonants as singleton codas than as singleton onsets (more complex structures including clusters were not counted), but the number of singleton coda possibilities ranges from 1 (/m/ in Kiliva) to 32 in Tlingit. To a small degree, differences in combinatorial possibility enter into the syllabic complexity scale mentioned above, but the number of different ways that a structure such as CVC can be constituted played no role. One proposal for a more elaborate scale attempts to calculate the number of possible distinct syllables allowed by the language. In a simple case this is given by the number of permitted onsets x number of permitted nuclei x number of permitted codas x number of tones and/or contrastive accent placements. However, almost all languages impose some broad-based constraints limiting the combinatorial freedom between onsets and nuclei, nuclei and codas, tones and rhyme structures, and so on. For example, in some languages with nasalized vowels in their inventory, the contrast between oral and nasalized vowels is absent or limited after an onset nasal (e.g. in Yoruba and Yélî Dnye); in numerous languages with labialized consonants there is no contrast with their plain counterparts before rounded vowels (e.g. in Hausa and Huastec). Frequently, there are also limitations constraining the pairing of particular nuclei and codas. It is essential that such limitations be known and factored into the calculation in order to avoid potentially large errors. Since this calculation is multiplicative, errors compound exponentially. To see how far astray calculations can go, consider Standard Thai. There are 21 consonants in the inventory, only 9 of which can occur as codas (to which zero coda must be added); /ʔ/ occurs only in coda. A limited number of onset clusters consisting of a stop plus a liquid or /w/ increase the number of onsets to 38 (including zero) (Li, 1977). There are also 21 vowel nuclei, made up of 9 vowel qualities which occur long or short, plus the three diphthongs /iə uə ɯə/. Finally, the language has 5 tones — 3 level, one rising and one falling. A stress contrast must also be
94
Ian Maddieson
added. If we assume no further combinatorial limitations, this would produce the estimate of 79800 (38 x 21 x 10 x 5 x 2) as the number of distinct possible syllables. However, short vowels cannot occur without a coda except in unstressed syllables, where only three vowels are distinctive and no coda is permitted. In stressed syllables with a short vowel and an obstruent coda, only two, not five tones contrast. Taking just these limitations into account, the number of unstressed syllables is not over 648, the number of stressed syllables with short vowels not over 1520, and the number of stressed syllables with long vowels or diphthongs not over 22800, for a total of 24968 — less than a third of the original estimate. This number is still an over-estimate because other regularities are not yet incorporated, such as the absence of rhymes containing a long vowel and a glottal stop coda or with a final glide following a cognate vowel (such as /-uuw, -uw/). An advantage of the syllabic computation is that since the numbers of consonants, vowels and tonal/accentual patterns enter into the calculation, it gives a single measure which might substitute for all of these, as well as the syllable complexity measure. It also sidesteps some of the often difficult decisions on segmentation. For example, whether the onset in the Hausa word ƙwai ‘egg’ is analyzed as a sequence of /k’/ and /w/ or as a single labialized segment /kW’/ does not affect the number of onsets computed for the language. On the other hand, decisions must be made regarding syllabification, which can be equally or even more problematical. In Hausa, one question would concern the status of geminates. All consonants in Hausa can occur as geminates in word-medial position between vowels, but few consonants are permitted as singleton codas. Given that /k/ never surfaces as a singleton coda (Newman, 2000:404), is it appropriate to syllabify a word such as bukkàa ‘grass hut’ as /buk.kàa/ rather than as /bu.kkàa/? In this case, tonal patterns help decide in favor of the former, since falling tones are only permitted on syllables containing a long vowel or a coda, and they do occur preceding geminates. But decisive arguments are not always available. A few languages were compared with respect to their syllable count in Maddieson (1984), with Hawaiian having the smallest number of possible syllables among the 9 languages in the set, namely 162, and Thai the largest at 23638. Shosted (2006) made a similar computation for a sample of 32 languages, selected to represent the areal-typological groupings established by Johanna Nichols in Nichols (1992) and later work. Shosted estimates Egyptian Arabic as the highest with 108400 possible syllables, and assigns the lowest total to Koiari with 70 syllables, though Tukang Besi with 115 is
Calculating Phonological Complexity
95
not far behind. Since Dutton (1996:11) indicates that “word stress ... is phonemic in Koiari” (albeit with some strong regularities), the Koiari total should probably be doubled to 140, leaving Tukang Besi with the smallest syllable inventory in the set. There are probably other totals that need adjustment in Shosted’s data, as a number of uncertainties are clearly identified and some factors may have escaped attention altogether. However, if the maximum and minimum values of 115 and 108400 are accepted, this should not be taken as an indication that Egyptian Arabic is over 900 times as complicated as Tukang Besi in its phonological patterning. A better comparison is based on the log transform, given the multiplicative nature of the calculation, which produces exponentially rising values of the syllable total for any increase of any single component of the input. This transform also yields a normal distribution for the index, instead of heavy rightward skewing, making it amenable to analysis by statistical methods that assume normality. The log ratio between Tukang Besi and Arabic is 2.44. Syllable counts of this kind can only be reliably computed for a given language if there is either a careful enough statement of the combinatorial possibilities in an accessible linguistic description, or if a sufficiently large lexicon is available which can be searched for the patterns. The lexicon must be syllabified and transcribed, or at least in a form that enables syllable boundaries and phonemic structure to be determined in the entries. For the economically more important languages of the world, such material is readily available, but for many ‘minor’ languages, no reliable distributional statements have been published and a lexicon of only a few hundred words may be all that is available. At present, it would be impossible to calculate the combinatorial possibilities for a large and diversified sample of languages. However, in some traditions of language description, such as the French structuralist model, the required information is regularly provided and the number of languages for which suitable data is available continues to increase.
4. Frequency of types If segments and sequential patterns differ in complexity, then it is reasonable to propose that the complexity of a phonological system varies according to the relative frequency of more versus less complex elements. That such frequency differences occur is relatively easy to demonstrate. For example, in many of the languages for which segment frequency counts are
96
Ian Maddieson
available the most frequent consonant is one of the most basic ones, such as /k/ as in Andoke (Landaburu, 1979), Kuku Yalanji (Patz, 2002), Hawaiian (Pukui and Elbert.1979), and Northern Khmu/ (Svantesson, 1983), or /t/ as in Russian (Kuc#era and Monroe, 1968), Japanese (Bloch, 1950), Tulu (Bhat, 1967) and Maori (Bauer, 1993). However in the frequency counts for French in Delattre (1965) and Malécot (1974) as well as in more recent computational counts (Pellegrino, p.c.) the most frequent consonant is /ʁ/, which would be considered a Complex consonant in the scheme of Lindblom and Maddieson (1988) (as it is a voiced fricative with a ‘displaced’ articulation). Since a speaker of French is called on not merely to produce this Complex consonant but to produce it quite frequently, French might be considered more complex in this respect than German, say, which has a similar consonant in its inventory, but where the most common consonant is Basic /n/ (Kuc#era and Monroe, 1968). Similarly, the frequency of syllabic patterns also varies. Both Noon (Soukka, 2000) and Maybrat (Dol, 1999) seem to allow CVC as the maximally elaborate syllable type but whereas in Noon this is the predominant pattern (at least in the case of monosyllabic stems, which form a very large part of the vocabulary), in Maybrat CV, not CVC, is the predominant syllable structure. In this respect, therefore Noon might be evaluated as more complex than Maybrat. The problem is how to develop a generally applicable method of comparing a large number of languages, given that the data are incomplete for the majority of them. The frequency data cited above were not distinguished as to whether these frequencies were calculated on the basis of frequency in the lexicon or frequency in text. The two are broadly correlated but far from identical. However, a lexicon is available for more languages than a large body of reliably-transcribed texts, which is required to avoid the biases that text styles and content can introduce. Hence, lexical frequencies provide a better opportunity for large-scale comparison. The size of available lexicons is also very variable. When it is relatively small, an analysis of the total set of frequency patterns is impossible; the relative frequencies of types with lower frequency will be unreliably represented and rare types may be altogether absent. A feasible comparison might, therefore, be based on only the consideration of patterns which are among the most frequent. With respect to segments, it would be feasible to compile a count of, say, just the ten most frequent in lexical forms. This also provides a solution to the problem posed by the variation in the number of segments across
Calculating Phonological Complexity
97
languages (the smallest phonemic inventory reported is 11 segments in Rotokas and Pirahã). A sample comparison is provided in Table 3. The columns on the left show the frequencies of the 10 most frequent consonant segments in English (British RP dialect), as reported by John Higgins of the University of Bristol, based on an analysis of the lexicon included in a popular learner’s dictionary (Hornsby et al., 1973). The columns on the right provide comparable data for Totonac extracted from the considerably smaller lexicon provided by Reid & Bishop (1974). Table 3. Comparison of segment frequency by rank in English and Totonac English (RP) segment
Totonac frequency
segment
frequency
t
34260
t
2138
s
33922
n
2123
n
31934
q
1629
l
27373
k
1529
ɹ
23069
l
1278
k
22453
ʂ
1032
d
21275
p
996
z
19972
m
994
p
15553
s
777
m
14823
tʃ
471
To compare these data, it is useful to calculate some kind of index. There are a number of ways this might be done. One possibility is to calculate a summed frequency x complexity score over the top ten segments, in which each segment contributes decreasingly according to its rank, and increasingly according to its complexity. A simple procedure would be to take the level of complexity of each segment suggested by Lindblom and Maddieson (1988), and multiply it by its rank, expressed as decreasing decimal fractions of 1 (i.e. 1, 0.9, 0.8, 0.7, etc). For instance, the segment /n/ of English would contribute 0.8 x 1 to the score and the segment /z/ would add 0.3 x 2, i.e. 0.6. If all of the top ten segments by frequency are Basic, then the score on this index would be 5.5. Assuming that /ɹ/ is Basic
98
Ian Maddieson
(which is perhaps debatable), this variety of English scores just 5.8. On the other hand, Totonac has two segments in its ‘top ten’ which count as Elaborated, namely /q/ and /ʂ/. The score for this language is 7.1. This method is somewhat crude, but its appeal is that it can be easily applied to a large sample of languages despite differences in their inventory size and the quantity of lexical forms available for counting. A lexicon can also be used to calculate relative frequency of syllable types. A simple procedure would be to calculate the proportion of each of the three categories of syllabic complexity used in Maddieson (2006, 2007) that occur in the lexicon. An index summing the proportions multiplied by the complexity level would then provide a more gradient estimate of a language’s complexity at the syllabic level. All languages whose (native) vocabulary contains only Simple syllables with the pattern (C)V (including syllabic nasals in the V position in some cases) would have an index of 1. This set would include, for example, Fijian, Yoruba and Guarani. But even at the next level of syllabic complexity, different languages would have a considerable range of values. Eight languages which allow only single coda consonants were compared with respect to the proportion of closed versus open syllables found in the lexicon available for counting. The languages are Yup’ik (Jacobson, 1984), Kadazan (Lasimbang & Miller, 1995), Gbaya (Blanchard & Noss, 1982), Darai (Kotapish & Kotapish, 1975), Mandinka (Tarawale et al., 1980), Lhasa Tibetan (Goldstein & Nornang, 1984), Comanche (Wistrand-Robinson & Armagost, 1990) and Wa (Yunnan Minzu Chubanshe, 1981). These languages either have no onset clusters or allow a limited set of common onset clusters (C + liquid or glide) which only occur infrequently. A syllable complexity x frequency index was calculated by multiplying the proportion of open syllables by 1 and the proportion of syllables with codas by 2 and summing the products. The closer the index is to 2, the greater the share of closed syllables in the lexicon. The mean number of syllables per word was also calculated. The results are shown in Table 4.
Calculating Phonological Complexity
99
Table 4. Open syllable frequency in the lexicon of selected languages. Language Comanche
# syllables counted
% open syllables
index
mean syllables per word
17485
85.7
1.14
3.63
Lhasa Tibetan
4698
84.5
1.16
1.84
Darai
5597
77.6
1.22
2.34
Mandinka
9795
78.2
1.22
2.73
10342
75.8
1.24
2.35
Kadazan (Dusun)
5339
63.0
1.37
2.64
Yup’ik
7973
52.7
1.47
3.40
Wa (Parauk)
3180
22.7
1.77
1.00
Gbaya
Comanche, which allows no word-final codas, has an index not much above 1 (and it might be even lower if a different interpretation of the role of /h/ in the phonology was adopted), whereas Yup’ik approaches 1.5, and Wa is over 1.7. If the assumption that CVC syllables are more complex than CV syllables is correct, then it is reasonable to argue that Yup’ik or Wa is more complex than Comanche or Tibetan because a larger proportion of their syllables have the more complex structure. It is interesting to note that these results cannot be predicted from word length. A lower proportion of (C)VC structures might be expected to require words to be longer in order to create sufficiently rich lexical resources. This expectation is not borne out by the comparison of the index and the mean number of syllables per word, as these are not significantly correlated with each other — regardless of whether or not the low wordlength value for Wa (see below) is included. Comanche and Yup’ik have the greatest mean word length but are at opposite ends of the open syllable frequency index. Similarly, Tibetan and Wa have the two lowest index scores but have the shortest mean word length. Rather, the relative frequencies of longer versus shorter words might be considered to be another independent, contributing factor in evaluating the complexity of phonological forms. A major caution in using lexical lists must be noted. The choice of what form is entered as the lemma can have a major impact on the segment and syllable frequencies in the lexicon for the many languages in which wordforms vary according to case, tense, gender, phonological environment, etc.
100
Ian Maddieson
For example, in the dictionary by Pogonowski (1983), which was used to obtain frequency counts of Polish segments, the conventional choice of entering verbs in their infinitive form was made. The vast majority of infinitives end in the affricate /tɕ/ and hence, this segment has a much higher frequency than if some other choice had been made. It accounts for over 12% of the codas in this count, whereas the next most frequent segment in this position, /t/, accounts for only about 3% of the codas. In some languages with elaborate polymorphemic structures, such as those in the Iroqoian family (see Chafe (1967) on Seneca or Doherty (1993) on Cayuga) more fundamental problems arise over what to consider a word and hence, how to select any single form for a lexical entry. Choice of the form of lexical entries obviously also affects the word length calculation. For example, the compilers of the Wa dictionary consulted for this study made the decision to treat all lexical entries as consisting of one or more monosyllabic words. An orthography in use in Burma (Myanmar) instead writes a number of these items as disyllabic words (Watkins & Kunst, 2006), which would yield a mean word-length slightly greater than 1 syllable. A final comment on frequency is in order. It might be argued that as the frequency of any segment or pattern in a given language increases, the more familiar speakers become with it. This familiarity reduces the complexity of the item. While it is undoubtedly true that (over-)learned behaviors require less ‘effort’ than novel behaviors in various ways (e.g. less attention, less time) this does not constitute an argument that different learned behaviors are all equally complex. Any spoken language behavior is highly learned, but specific patterns can differ in the amount of muscular coordination required, the difficulty of identifying them, or other factors that can reasonably be considered as making one more complex than another.
5. Variability and Transparency A further plausible assumption about phonological complexity is that languages for which the patterns of variation in the phonology are more ‘transparent’ are simpler than those for which the variations are more arbitrary. For example, consider patterns of tonal variations in two Chinese languages. In Cantonese, the tone found in the isolation form of a word remains unchanged in context in most cases. The only regular phonological alternation is that the mid-high fall [53] becomes level high [55] when it
Calculating Phonological Complexity
101
precedes a syllable with a high tone onset (Chao, 1947, Hashimoto, 1972) (although some more recent analyses describe this variation as free rather than conditioned). In Fuzhou, an Eastern Min variety, on the other hand, the nonfinal elements of a disyllabic or longer sequence often have a different tone from their isolation tones, and in some cases, the vowel will also differ (Maddieson, 1975, Chan, 1985). A couple of examples are given in Table 5 with tones represented by numbers where 5 is the highest. These changes in vowel quality and tone shape are not straightforwardly explicable in terms of adaptation to the phonological environment, for example via assimilation, as is the case for Cantonese. Surely in this respect, Fuzhou is more complex than Cantonese. Table 5. Tone sandhi examples from Fuzhou Chinese Isolation forms sei 2 ‘ten’
+ Nuo 5 ‘month’
sei 24 + tsʰiu 22 ‘lose’
‘hand’
Combined form sei 2 Nuo 5 ‘October’ si 35 tsʰiu 22 ‘slip from hand’
The most serious attempt to establish a basis for comparison of a significant number of languages at the level of phonological alternation was the compilation of phonological rules undertaken for the Stanford Phonology Archive (see Vihman, 1977 for a general description), a major component of the Language Universals Project conducted at Stanford University under the direction of Joseph Greenberg and Charles Ferguson. Although a few studies were made of particular processes, such as reduplication (Moravcsik, 1978), or alternations between [d] and [D] (Ferguson, 1978), no method of comparing the phonological systems in a more global way was developed and the data remained little exploited. The problem is a formidable one. Very likely the only practicable way to develop a measure would be to tightly circumscribe the scope of the comparison. A standardized basic vocabulary, say of 100 words similar to the Swadesh list employed to get rough-and-ready genetic groupings, but perhaps more carefully selected, might be used. The number of variant forms of these words that arise strictly from different phonological conditions might then be countable and the level of ‘transparency’ of the proc-
102
Ian Maddieson
esses which create these variants could also be rated. A combined score taking into account the number and nature of the variants could then be computed over this restricted wordlist. Unlike the situation with the other suggestions in this paper, no preliminary studies of this kind have been carried out, and a major problem would be to separate strictly phonological processes from those with a morphological or syntactic basis. This proposal, of course, begs the question of whether such a separation is actually possible.
6. Defining and demonstrating complexity In the discussion so far an essentially informal or intuitive sense of relative complexity has been appealed to. In this final section, I want to briefly consider the issues of seeking a more explicit definition, and finding ways to demonstrate the appropriateness of a definition and directly measure the relevant properties. Methods that can be widely applied across different fieldwork settings will be especially considered. One basic understanding of the meaning of the word ‘complex’ is a straightforwardly quantitative one: any structure or system that contains more elements than another is the more complex of the two. This is easy to apply in comparing, say, length of syllables or number of consonants in an inventory. However it does not answer many questions — for example, whether CCV or CVC is a more complex syllable pattern. A more comprehensive approach may be sought by making the common equation of complexity with difficulty. In this view a given linguistic element or pattern is more complex than another if it is more difficult to execute, more difficult to process, more difficult to learn, or more difficult to retain in memory. Difficulty can itself be hard to demonstrate, but there are a number of accepted ways of operationalizing a test for difficulty. One frequently used way to measure the difficulty of human tasks is to compare reaction times for different tasks or different variants of a task. A task that takes longer is assumed to be more difficult. Reaction time experiments can be designed for a range of linguistic execution and processing tasks, and can be adapted for use in remote locations and under a variety of cultural conditions. For example, whether CCV or CVC syllable structures are more difficult might be tested by picture identification or naming tasks using either appropriately structured real words or nonsense forms taught in pre-test training sessions. If the time to react to presentation of a target is slower in one case
Calculating Phonological Complexity
103
than another, it may be taken as evidence of greater difficulty. For purposes of comparing complexity, it may not be necessary to demonstrate the source of the difficulty — problems of execution, recognition, or recall can all be taken as marks of higher complexity. Slowness of learning a task is also commonly equated with its degree of difficulty, and this perspective is often appealed to in explaining the order of acquisition of phonological contrasts and patterns by children. However, observing natural first-language acquisition is a demanding process. There are many idiosyncracies in individual children and we are never likely to have broad enough data on the large number of languages which would be required in order to come up with, say, an overall scale of segmental difficulty covering all segment types and motivated by actual learning difficulty. However, ease of learning for adults can be explored with experimental paradigms in which the measure of difficulty is error rate or a similar variable. Demonstrations that arbitrary phonological alternations are harder to learn than regular ones have been made by Pycha et al. (2003,, 2007) and Wilson (2003) using limited artificial languages. This approach draws on earlier experimental work with children, such as the famous ‘wug test’ of Jean Berko Gleason (1958). The protocols used in the artifical grammar learning paradigm seem adaptable enough to be employed outside the university settings in which they have been developed, and the example provided by research with children confirms that subjects do not need to be familiar with formal experimental settings to participate. A somewhat different notion of complexity might be based on measures reflecting amount of work done. Lindblom and colleagues (Lindblom & Moon, 2000, Davis et al., 1999) have considered whether levels of oxygen consumption can provide an index of effort levels during speech production. Whalen and his co-workers have proposed that brain activation levels revealed by fMRI can indicate which syllable structures are more complex than others (Whalen, 2007). These techniques demand technological tools that are not widely available or portable, but potentially provide direct measures of complexity independent of experimenter judgments. They are unlikely to be applied to speakers of a large number of languages, but could furnish a basis for generalization to untested conditions. One of the most problematic issues concerns how to integrate the various facets of discussed phonetic and phonological complexity into more inclusive measures. At a simple level, the question might be taken as one of finding appropriate weights for each facet considered. However, assigning these weights is problematic. In an inventory, does increasing the number
104
Ian Maddieson
of vowels contribute more complexity than increasing the number of consonants? One more vowel typically increases the number of potential syllables more than one more consonant, but consonants typically require greater articulatory effort than vowels. With respect to phonotactics, are more constrained combinatorial patterns more complex than free combination? One contributes more constraints to learn, but the other increases the total number of possible syllables. These kinds of issues are unlikely to be resolved soon. In the mean time, the best procedure might be to treat each evaluated variable as equivalent to every other, assigning index values for languages, respectively at, above, and below an average complexity level and then summing the results for an overall score. This procedure is crude, but readily achievable. As Nichols (2007) has shown, a similar procedure can be applied to incorporate measures of morphological or syntactic complexity into more global estimates of the complexity of individual languages.
References Akmajian, Andrew, Richard A. Demers, & Robert M. Harnish. 1979 Linguistics: An Introduction to Language and Communication. MIT Press, Cambridge, MA. Alwi, Hasan, Soenjona Dardjowidjojo, Hans Lapoliwa and Anton M. Moeliono. 1998 Tata Bahasa Baku Bahasa Indonesia: Edisi Ketiga. Balai Pustaka, Jakarta. Bauer, Winifred 1993 Maori. Routledge, London. Berko Gleason, Jean 1958 The Child's Learning of English Morphology. Word 14: 150-177 Bhat, D. N. S. 1967 Descriptive analysis of Tulu. Deccan College Postgraduate and Research Institute, Poona [Pune]. Bickerton, Derek. 2007 Language evolution: A brief guide for linguists. Lingua 117: 510-526. Blanchard, Yves and Philip A. Noss 1982 Dictionnaire Gbaya-Français: Dialecte Yaayuwee. Centre de Traduction Gbaya, Mission Catholique de Meiganga et Eglise Evangélique Luthérienne du Cameroun, Meiganga. Bloch, Bernard 1950 Studies in colloquial Japanese, IV: Phonemics. Language 26: 86-125.
Calculating Phonological Complexity
105
Bouquiaux, Luc 1970 La Langue Birom (Nigeria septentrional) Phonologie, Morphologie, Syntaxe. Bibliotheque de la Faculte de Philosophie et Lettres de l'Universite de Liege, Liege. Chafe, Wallace L. 1967 Seneca Morphology and Dictionary. Smithsonian Institution, Washington, DC. Chan, Marjorie K-M. 1985 Fuzhou Phonology: A Non-Linear Analysis of Tone and Stress. Ph. D. dissertation, University of Washington, Seattle. Chao, Yuan-Ren. 1947 Cantonese Primer. Harvard University Press, Cambridge MA. Crothers, John. 1978 Typology and universals of vowel systems. In J. H. Greenberg, ed., Universals of Human Language, Volume 2: Phonology. Stanford University Press, Stanford: 93-152. Davis, J. H., B. Lindblom, R. Spina and Z. Simpson. 1999 Energetics in phonetics. Paper presented at Speech Communication and Language Development Symposium. Stockholm University. Delattre, Pierre 1965 Comparing the Phonetic Features of English, German, Spanish and French. Julius Groos Verlag, Heidelberg. Doherty, Brian 1993 The Acoustic-Phonetic Correlates of Cayuga Word-Stress. Ph. D. dissertation, Harvard University, Cambridge MA. Dol, Philomena 1999 A Grammar of Maybrat. University of Leiden, Leiden. Dutton, Tom E 1996 Koiari. Lincom Europa, München and Newcastle. Fenk-Oczlon, Gertraud and August Fenk. 2008 Complexity trade-offs between subsystems of language. In Miestamo et al, 2008: 43-66. Ferguson, Charles A. 1978 Phonological processes. In J. H. Greenberg, C. A. Ferguson, and E. A. Moravcsik (eds.), Universals of Human Language, vol. 3, Stanford, Stanford University Press, 403–442. Gandour, Jackson, T. and Richard A. Harshman. 1978 Crosslanguage differences in tone perception: A multi-dimensional scaling investigation. Language and Speech 21: 1-33. Givón, T and Bertram F. Maile, editors. 2002 The Evolution of Language out of Pre-language. John Benjamins, Amsterdam.
106
Ian Maddieson
Goldstein, Melvyn C. and Nawang L. Nornang. 1984. Modern Spoken Tibetan: Lhasa Dialect, 3rd edition. Ratna Pustak Bhandar, Kathmandu. Hashimoto, Oi-Kan Yue. 1972 Studies in Yue Dialects 1: The Phonology of Cantonese. Cambridge University Press, Cambridge. Hornsby, A. S., E. V. Gatenby and H. Wakefield. 1973 Advanced Learner’s Dictionary of English (Oxford Text Archive edition). Longman, London. Jacobson, Steven A. 1980 Yup’ik Eskimo dictionary. Alaska Native Language Center, University of Alaska, Fairbanks. Kawasaki-Fukumori, Haruko. 1992 An acoustic basis for universal phonological constraints. Language and Speech 35:73-86. Kotapish, Carl and Sharon 1975 A Darai-English, English-Darai Glossary. Summer Institute of Linguistics, and Institute of Nepal and Asian Studies, Tribhuvan University, Kathmandu. Kuc#era, Henry and George K. Monroe 1968 A Comparative Quantitative Phonology of Russian, Czech, and German. American Elsevier, New York. Ladefoged, Peter 2006 A Course in Phonetics, Fifth Edition. Thompson/Wadsworth, Boston. Landaburu, Jon 1979 La Langue des Andoke (Amazonie colombienne) Grammaire. Centre National de la Recherche Scientifique, Paris. Lasimbang, Rita and John Miller 1995 Kadazan Dusun - Malay - English Dictionary. Kadazan Dusun Cultural Association, Kota Kinabalu. (pre-publication version actually consulted) Li, Fang-Kuei 1977 A Handbook of Comparative Thai. University Press of Hawaii, Honolulu. Lindblom, Björn and Ian Maddieson 1988 Phonetic universals in consonant systems. In Language, Speech and Mind (ed C. Li and L.M. Hyman). Routledge, London. 62-78. Lindblom, Björn. & Seung-Jae Moon 2000 Can the energy costs of speech movements be measured? A preliminary feasibility study, Journal of the Acoustical Society of Korea 19.3E, 25-32. Maddieson, Ian and Kristin Precoda 1990 Updating UPSID. UCLA Working Papers in Phonetics 74: 104-114.
Calculating Phonological Complexity
107
Maddieson, Ian 1975 The intrinsic pitch of vowels and tone in Foochow. San Jose State Occasional Papers in Linguistics 1: 150-161. (Proceedings of the Fourth California Linguistics Conference). 1984 Patterns of Sounds. Cambridge University Press, Cambridge. 2005 Consonant inventories. In M. Haspelmath, M. S. Dryer, D. Gil, & B. Comrie, eds., World Atlas of Language Structures, Oxford University Press, Oxford and New York: 10-13. 2006 Correlating phonological complexity: data and validation. Linguistic Typology 10: 108-125. 2007 Issues of phonological complexity: Statistical analysis of the relationship between syllable structures, segment inventories and tone contrasts. In M-J. Solé, P. Beddor and M. Ohala, (eds), Experimental Approaches to Phonology. Oxford University Press, Oxford and New York: 93-103. Malécot, André 1974 The frequency of occurrence of French phonemes and consonant clusters. Phonetica 29: 158-170. Marsico, Egidio, Ian Maddieson, Christophe Coupé, and François Pellegrino. 2004 Investigating the “hidden” structure of phonological systems. Proceedings of the 30th Berkeley Linguistic Society Meeting: 256-267. McWhorter, John H. 2001a The Power of Babel: A Natural History of Language. Times Books, New York. 2001b The World's Simplest Grammars Are Creole Grammars. Linguistic Typology 5:125-166. Miestamo, Matti, Kaius Sinnemäki and Fred Karlsson, editors 2008 Language Complexity: Typology, Contact, Change. John Benjamins, Amsterdam. Moravcsik, Edith A. 1978 Reduplicative constructions. In J. H. Greenberg, C. A. Ferguson, and E. A. Moravcsik (eds.), Universals of Human Language, vol. 3, Stanford, Stanford University Press, 297–334. Newman, Paul 2004 The Hausa Language: An Encyclopedic Reference Grammar. Yale University Press, New Haven and London. Nichols, Johanna 1992 Linguistic Diversity in Space and Time. University of Chicago Press, Chicago 2007 The distribution of complexity in the world’s languages. Paper presented at symposium on Approaches to Language Complexity. Annual Meeting of the Linguistic Society of America, Anaheim.
108
Ian Maddieson
Ohala, John J. and Kawasaki-Fukumori, Haruko 1997 Alternatives to the sonority hierarchy for explaining segmental sequential constraints. In Stig Eliasson & Ernst Håkon Jahr (eds.), Language And Its Ecology: Essays In Memory Of Einar Haugen. Berlin: Mouton de Gruyter. 343-365 Patz, Elisabeth 2002 A Grammar of the Kuku Yalanji Language of North Queensland (Pacific Linguistics 526). Research School of Pacific and Asian Studies, Australian National University, Canberra. Pogonowski, Iwo 1983 Dictionary Polish-English, English-Polish. Hipprocrene Books, New York. Pukui, M. K. and Elbert, S. H. 1979 Hawaiian Grammar. University Press of Hawaii, Honolulu. Pycha, Anne, P. Nowak, E. Shin and R. Shosted 2003 Phonological rule-learning and its implications for a theory of vowel harmony. G. Garding and M. Tsujimura (eds) Proceedings of WCCFL 22. Somerville, Cascadilla Press: 423-435. Pycha, Anne, Eurie Shin, and Ryan Shosted 2007 Directionality of assimilation in consonant clusters: An experimental approach. Paper presented at Symposium Towards an Artificial Grammar Learning Paradigm in Phonology. Linguistic Society of America Annual Meeting, Anaheim, CA. http://www.linguistics.berkeley.edu/~shosted/dacc.pdf Reid, Aileen A. and Ruth G. Bishop 1974 Diccionario Totonaco de Xicotepec de Juárez, Puebla. Instituto Lingüístico de Verano, Mexico. Schwartz, J-L., L.-J Boë, N. Vallée, & C. Abry 1997 The dispersion-focalization theory of vowel systems. Journal of Phonetics 25: 255-286. Shosted, Ryan K. 2006 Correlating complexity: a typological approach. Linguistic Typology 10: 1-40. Sinnemäki, Kaius. 2008 Complexity trad-offs in core argument marking. In Miestamo et al, 2008: 67-88. Sivertsen, Eva 1956 Pitch problems in Kiowa. International Journal of American Linguistics 22: 117-130 Soukka, Maria 2000 A Descriptive Grammar of Noon: A Cangin Language of Senegal. Lincom Europa, München.
Calculating Phonological Complexity
109
Stevens, Kenneth N. 1972 The quantal nature of speech: evidence from articulatory-acoustic data. In E. E. David & P. B. Denes (eds) Human communication: A unified view. London: Academic Press. 51-66. 1989 On the quantal nature of speech. Journal of Phonetics 17: 3-45. Svantesson, Jan-Olof 1983 Kammu phonology and morphology. (Travaux de L'Institut de Linguistique de Lund, 18.) Lund: Gleerup. Tarawale, Ba, Fatumata Sidibe and Lasana Konteh 1980 Mandinka-English Dictionary. National Literacy Advisory Committee, Bathurst. Vihman, Marilyn 1977 A Reference Manual and User's Guide for the Stanford Phonology Archive. Part I. Department of Lingujistics, Stanford University, Stanford. Watkins, Justin and Richard Kunst 2006 Writing the Wa language. On-line document at http://mercury.soas.ac.uk/wadict/wa_orthography.html Watkins, Laurel 1984 A Grammar of Kiowa. University of Nebraska, Lincoln. Whalen, Douglas H. 2007 Brain activations related to changes in speech complexity. Paper presented at symposium on Approaches to Language Complexity. Annual Meeting of the Linguistic Society of America, Anaheim. Wilson, Colin 2003 Experimental investigation of phonological naturalness. In G. Garding and M. Tsujimura (eds). Proceedings of WCCFL 22. Cascadilla Press, Somerville, MA: 533-546. Wistrand-Robinson, Lila, and James Armagost 1990 Comanche Dictionary and Grammar. Summer Institute of Linguistics and The University of Texas at Arlington, Arlington. Wolff, Hans 1959 Subsystem typologies and area linguistics. Anthropological Linguistics 1/7: 1-88. Yunnan Minzu Chubanshe [Yunnan Minorities Publishing House] (Yan Qixiang, Zhou Zhizhi et. al., compilers) 1981 Pug lai cix ding yiie si ndong lai Vax mai Hox (A Concise Dictionary of Wa and Chinese). Kunming.
Favoured syllabic patterns in the world’s languages and sensorimotor constraints Nathalie Vallée, Solange Rossato and Isabelle Rousset The general aim of this study is to investigate the linear organization of speech sound sequences in natural languages. Using a 17-language syllabified lexicon database (ULSID), we examine several preferred syllabic and lexical patterns and we discuss them in light of data from speech production and perception experiments. First, we observe that some of the preferences found for tautosyllabic relations were consistent with the predictions of the Frame/Content Theory, while others extended them. Then, we investigate trends in the coocurrence of prevocalic and postvocalic consonants that appeared in the same syllable or in the onset of two consecutive syllables. We discover that the Labial-V-Coronal sequence was widespread in many lexica of ULSID. Our results not only confirm the presence of the so-called Labial-Coronal (LC) effect found by MacNeilage et al., (1999) in dissyllabic words, but also show that it occurs in various syllabic patterns. Then, we report results of two experiments which provide phonetic bases for the LC effect. Finally, we focus on consonant sequences involving plosive and nasal, showing that the preferred sequences were inconsistent with the sonority hierarchy. We claim that the disfavored patterns can be predicted by a more complex gesture of the velum due to aerodynamic constraints. More broadly, to better understand the complexity of sound sequences in languages we discuss the relationship between human sensorimotor capacities and phonology.
1. Introduction Although there is no unambiguous definition on what a complex system is, some key aspects like evolution, adaptation, interaction, dynamics and self-organization are common characteristics. These aspects are examined in studies on speech production, speech comprehension and speech acquisition. They therefore make it possible to regard human speech processing as a complex system (for discussion on complexity in phonetics and phonology see the Introduction of this volume by Pellegrino, Marsico, Chitoran,
112
Nathalie Vallée, Solange Rossato and Isabelle Rousset
and Coupé). Speech is known to be structured at many different levels and, at each level, preferred types of organization are found, both within and across languages. There is no doubt that the description and the analysis of a common organization to many languages of different genetic groups allows us to gain a better understanding of such a complex system. This research attempts to answer the following questions: Why are some structures very common in the world’s languages, regardless of linguistic affiliation and geography? Where do these structures come from? How can they be explained? Since Trubetzkoy’s classification of speech sound systems in the 30’s (Trubetzkoy, 1939), several typological studies have demonstrated that languages do not structure their sound systems arbritrarily. According to the approach introduced by Liljencrants and Lindblom (1972), typological analyses associated with advances in predictive models have shown that human languages do not make random use of both the vocal apparatus and the auditory system to organize their phonological structures. Several studies have contributed to building a theory for predicting most of the vowel systems of the world’s languages from principles based on perceptual distinctiveness (Lindblom, 1986 and later; Schwartz, Boë, Vallée and Abry, 1997; Vallée, Schwartz, and Escudier, 1999; De Boer, 2000). To our knowledge, no general theory accounts for predicting consonant structures, though certain works have provided natural phonetic bases for preferences in consonant inventories. According to these works, consonant structures are shaped by acoustic and aerodynamic constraints that provide sufficient auditory contrast (Kingston and Diehl, 1994; Kingston, 2007; Mawass, 1997; McGowan, Koenig and Löfqvist, 1995; Ohala, 1983; Ohala and Jaeger, 1986; Stevens, 1989; 2003; Stevens and Keyser, 1989), and for some of them, by visual contrast (Mawass, Badin and Bailly, 2000; Vallée et al., 2002; Boë, et al., 2005). Moreover, physiological constraints regarding possible or impossible gestures have also been suggested (Lindblom and Maddieson, 1988). Beyond the segmental level, languages structure the linear order of sounds at other levels, within syllables and grammatical units such as morphemes, words, clauses, sentences. It is widely accepted that the syllable is a structural unit of speech, in production and perception. Derwing (1992) specified that the speakers of many languages are able to count the number of syllables of a word, and obtain significant and consistent results in a syllabification task, even for words with geminated consonants. However, even if relevant studies have shown that speakers and listeners are generally
Favoured syllabic patterns and sensorimotor constraints
113
able to agree on the number of syllables in an utterance, sometimes their syllabification differs (e.g. Content, Kearns and Frauenfelder, 2001). Be that as it may, many psycholinguistic studies have found that syllables arise spontaneously when speakers and listeners have to segment the speech stream (e.g. Morais, Cary, Alegria and Bertelson, 1979; Segui, Dupoux, and Mehler, 1990; Treiman, 1989; Sendlmeier, 1995; Treiman and Kessler, 1995). Moreover, the syllable is invoked in speech errors and word games (e.g. Treiman, 1983; Bagemihl, 1995; MacNeilage, 1998) and in secret languages e.g. Verlan, derived from French (Plénat, 1995). It has been also proposed as a processing unit in visual word recognition (Carreiras, Álvarez and De Vega, 1993) and in handwriting production (Kandel, Álvarez and Vallée, 2006). In addition, neurophysiological works have suggested that the syllable is a central unit of language, specifically in its emergence, acquisition and function (MacNeilage and Davis, 1990; MacNeilage, 1998; MacNeilage and Davis, 2000). Despite its well-established role in all these studies, the syllable is an enigmatic linguistic unit. The question of its nature is still unresolved because its definition remains a problem at both the phonetic (see Krakow, 1999 for a review) and phonological (Ohala and Kawasaki, 1984; Kenstowicz, 1994) level. Whatever the linguistic status of the syllable is, the detailed examination of prevalent cross-linguistic sound combinations within syllables, and combinations of syllables within morphemes and words, provides knowledge of the syllabic organization of segments, on both phonetic and phonological levels (Blevins, 1995), and can contribute to defining of the syllable. As for sound systems that are found in a wide range of languages, attempts have been made to explain them in terms of human capacities of perception (distinctiveness), production (vocal gestures) or both (Kawasaki, 1982; Janson, 1986; Krakow, 1999; Redford and Diehl, 1999; Maddieson and Precoda, 1992; Lindblom, 2000). In this paper, we present results on phonotactic regularities found in lexical and syllabic patterns from several languages. The study was conducted using the ULSID (UCLA Lexical and Syllabic Inventory Database), partly provided by Ian Maddieson (Maddieson and Precoda, 1992). This database was created like UPSID (UCLA Phonological Segment Inventory Database) (Maddieson, 1984; Maddieson and Precoda, 1989), for the understanding of speech sound structures in the world’s languages. ULSID is being developed in our laboratory and currently has syllabified lexicons from 17 languages which are representative of the major language families and fairly well distributed geographically. In line with Maddieson (1993:1):
114
Nathalie Vallée, Solange Rossato and Isabelle Rousset
“Recent loan words, especially those of wide international currency relating to modern technological, political or cultural concepts (‘telephone’, ‘democracy’, ‘football’), have been excluded wherever recognizable”. The database currently contains more than 90,000 words, from 2,000 in Ngizim to 12,200 in French. The mean number of words per language is 5,908. The other languages included in the study are Afar, Finnish, Kannada, Kanuri, Kwakw’ala, Navaho, Nyah kur, Quechua, Sora, Swedish, Thai, Vietnamese, Wa, Yup’ik and !Xòõ. Each lexical entry consists of an IPA transcription with marks indicating its syllabic structure, representing the following information: the division in syllables and, for each syllable, its conventional sub-syllabic components, such as onset and rhyme. In addition to Maddieson and Precoda’s (1992) database, other languages were included using similar sources of information. The syllabification was done either from published (printed or computer-readable) syllabified lexicons (French: BDLEX-Syll from BDLEX 50.000, Pérennou and de Calmès, 2002; Swedish: Berlitz, 1981) or manually by at least two native speakers of the language (as in Vietnamese). The lexical entries consisted of lemmas only. We investigated the sound organization in the lexical units by considering the syllable boundaries. Adjacent sounds in the same syllable (tautosyllabic) and those of two consecutive syllables within lexical items were examined according to the nature of the segments involved. Clear patterns emerged from the statistical analyses and showed that syllables, like phonemes, are not arbitrary linguistic units. Three favored combinations of C and V were found by using computed ratios between observed and expected values in CV syllables, but also in both ( )CV( ) and ( )VC( ) templates, with a stronger link in the last one. This provides evidence favoring sequences where the articulators do not make extensive movements between the consonant and vowel gestures. The results, presented in Section 1, are discussed with respect to the predictions of the Frame/Content Theory (MacNeilage, 1998). We also confirmed the presence of the “LC effect” (MacNeilage and Davis, 2000) in our data. This effect refers to a strong preference for a specific intersyllabic pattern: disyllabic word starts more often with a labial stop consonant preceding the vowel or nucleus, which is itself followed by a coronal stop consonant (MacNeilage, et al., 1999). In Section 2 we show that this tendency, strongly attested in ULSID, occurred in several syllabic patterns. We outline the results of two recent experimental studies, suggesting an articulatory explanation for the existence of a perceptual correlate of the LC effect in French adults.
Favoured syllabic patterns and sensorimotor constraints
115
This last point concerns the asymmetry we observed in sequences involving a nasal (Nasal+Plosive (NP) and Plosive+Nasal (PN) consonant sequences). Results from a preliminary study using EMA® (Carstens, electromagnetic articulography) and EVA® (S.Q.LAB) measurements (Rossato, 2003) are given. They suggest a possible explanation for the preferred VNP pattern (V = syllable nucleus) compared to PNV (though following the Sonority Sequencing Principle) and these results are discussed in Section 3. Thus, our work attempts to provide explanatory reasons for several preferred sound sequences, showing that the widespread patterns in many natural languages are structured from aerodynamic, articulatory and/or perceptual constraints. More precisely, it reveals a preference for simple speech movements such languages favour patterns that require less effort and provide sufficient perceptual distintiveness over those that require effort. More broadly, our findings are relevant to the relationship between sensorimotor capacities and phonology.
2. Interaction between tautosyllabic segments Our first objective was to estimate the role of the syllabic frame in the lexical organization of ULSID’s languages. According to MacNeilage’s Frame, then Content Theory (1998), the universal alternation of consonants and vowels in speech sound sequences is a consequence of jaw movement. Consonants and vowels are articulated during one of the two phases of the mandibular oscillation (raising and lowering of the jaw). Consonants are produced during the closing of the mouth, while vowels are articulated during the opening phase. “Pure frames” are CV syllables produced by only one basic movement of the jaw (elevation then lowering) without any action of other articulators throughout the syllable and they are the simplest syllables. They correspond to the most economical sound sequences. According to MacNeilage and Davis (2000), “pure frames” are the most frequent CV-like articulations in babbling and the most frequent syllables in first words, languages, and proto-languages. Are the simplest syllables (in the sense of the MacNeilage’s Theory) the most widespread in ULSID’s lexicons? What are the favored tautosyllabic sequences in ULSID? The database contains almost 250,000 syllables. We first looked at the most common type of syllable in all languages. A basic CV structure accounted for almost 54% of the ULSID’s syllables. To facilitate the observa-
116
Nathalie Vallée, Solange Rossato and Isabelle Rousset
tion of syllabic content, we created co-occurrence matrices. In order to be able to compare our results with previous studies (MacNeilage, 1998; MacNeilage and Davis, 2000), we grouped segments according to their phonetic features (Vallée et al., 2002): six manners (plosive, fricative, nasal, affricate, approximant and trill/tap/flap), ten places of articulation for consonants (bilabial, labio-dental, coronal, palatal, labial-palatal, labialvelar, velar, uvular, pharyngeal, glottal), and three places of articulation for vowels (front, central, and back). The coronal class grouped apical consonants together i.e. dentals, alveodentals, alveolars, postalveolars and retroflexes (following Keating, 1990). Vowel height was not considered here because we did not find any clear tendencies between vowel aperture and consonant manner. We counterbalanced the observed syllable frequency with the expected frequency estimated from its segments. For example, the frequency of C1V1 was divided by the product of the individual frequencies of C1 and V1. These ratios, calculated for each language on an individual basis, showed what type of syllable was favored (ratio > 1) in our data (see Table 1 for Afar and Table 2 for Thai). Table 1. Ratios between observed and expected frequencies of CV syllables in Afar. Afar
Coronal
Bilabial
Velar
Front
1.16
0.68
0.84
Central
0.88
1.22
0.98
Back
0.99
1.08
1.31
Table 2. Ratios between observed and expected frequencies of CV syllables in Thai. Thai
Coronal
Bilabial
Velar
Front
1.07
0.99
0.71
Central
0.94
0.94
1.08
Back
1.01
1.04
1.22
Favoured syllabic patterns and sensorimotor constraints
117
Three favored onset-nucleus combinations emerged for Afar: coronal-front, bilabial-central, velar-back (Table 1) whereas in Thai, velar-back was clearly preferred and coronal-front was slightly favored (Table 2). The mean value of the ratios for the 17 languages was worked out for each simple CV combination. Three favored onset-nucleus patterns emerged: coronal-front, bilabial-central and velar-back (Table 3). All three favored CV patterns occurred in seven languages, while both bilabialcentral and velar-back combinations were found in twelve languages; the coronal-front was found in eleven. Our results were consistent with previous observations in the babbling utterances of infants, and in a database of ten natural languages which share French and Quechua data with ULSID (MacNeilage and Davis, 2000; MacNeilage, Davis, Kinney and Matyear, 1999). In previous studies, only simple CV sequences were observed and we decided to examine whether the presence or absence of a coda influenced the relationship between the onset and the nucleus in a syllable. The observed/expected ratios for onset-nucleus combinations were calculated for CVC syllables in each language. Figure 1 shows the mean onset-nucleus ratios across languages obtained on CV with CVC syllables, on simple CV syllables and on CVC syllables. No significant influences of the coda on the preferred onset-nucleus pattern were observed: all three favored cooccurrence patterns between onsets and nuclei were also found in CVC structures. Table 3. Mean value of the 17 ratios between observed and expected frequencies of CV syllables (χ2significant, p < 0.001 for each column). CV syllables
Coronal
Bilabial
Velar
Front
1.15
0.82
0.90
Central
0.92
1.11
0.99
Back
1.02
1.01
1.19
We extended our analysis to the interaction between nuclei and codas in the VC syllabic structure. VC syllables accounted for only 2.5% of ULSID’s syllables. Although VC syllables were disfavored, Table 4 shows that there was the same relationship as in CV structures. The three favored cooccurrence patterns were front-coronal, central-bilabial and back-velar. All
118
Nathalie Vallée, Solange Rossato and Isabelle Rousset
three patterns occurred in five languages, the central-bilabial combination emerged in eleven languages, the front-coronal was found in eleven and the back-velar in nine. This result demonstrates that the preferred VC sequences were “pure-frames” like the preferred CV sequences. 1.2
1.2
Bilabial Onset 1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Coronal Onset
0 Front
Central
Back
Front
Central
Back
1.4 1.2
Velar Onset
1
CV + CVC
0.8
CV
CVC
0.6 0.4 0.2 0 Front
Central
Back
Figure 1. Mean expected/observed ratios across individual languages according to the places of onset: bilabial on the upper left, coronal on the upper right and velar on the bottom row. X-axis indicates the place of the nucleus: front, central, back. Each group of three bars represents the mean ratio for respectively: CV with CVC syllables (left bar), CV syllables (middle bar), CVC syllables (right bar).
We also noticed that the link between vowel and consonant had a tendency to be more important in VC than in CV syllables. This result is in line with the concept of the rhyme proposed by phonologists who claim that
Favoured syllabic patterns and sensorimotor constraints
119
the structure of the syllable is hierarchical rather than linear (Selkirk, 1982). In this approach, the relationship between the syllabic constituents are represented by a tree-like diagram in which the nucleus and the coda are linked together in a node located at the same level of structuring than the onset. Blevins (1995:214-215) makes a strong case for the rhyme constituent based on i) syllable weight, ii) restrictions on the number of segments occurring in the syllable rhyme, and iii) the close relationship between nucleus and coda. Bagemihl (1995:707), reviewing the interaction of language games with lingusitic theory, points out that the rhyme was one of the linguistic units manipulated in word games. All of these observations converge and substantiate that the rhyme is a subsyllabic constituent. Table 4. Mean ratios across the individual languages between observed and expected frequencies of VC syllables (χ2significant, p < 0.001 for each column). VC syllables
Front
Central
Back
Coronal
1.24
0.91
0.95
Bilabial
0.54
1.28
1.13
Velar
0.76
0.84
1.37
Regarding the stronger effect of the frame on VC syllables, the properties of the jaw cycle could help explain this trend. Redford (1999) points out inherent asymmetries in the jaw cycle. From experiments carried out in the framework of the Frame/Content Theory, she shows that the closing phase of the mouth was articulated with greater velocity peaks and shorter durations than in the opening phase. In complex syllables the opening phase was articulated with greater displacement (distance between min and max opening within a phase) than the closing phase. Finally, the degree of articulatory stiffness (slope between distance and velocity) was smaller at the opening phase than at the closing phase. If these trends were assessed in additional cross-linguistic studies, they could explain the following crosslinguistic preferences: vowels and consonants are more strongly connected in VC than in CV sequences; CV are more frequent than VC or CVC sequences; single consonants are preferred over consonant clusters or complex consonants; clusters or complex consonants are more frequent in onset position compared to in coda position. Redford claims that “… the mechan-
120
Nathalie Vallée, Solange Rossato and Isabelle Rousset
ical and temporal constraint of the jaw cycle is manifested in the phonological and phonetic sound patterns that are perceived as syllables. […] jaw cycle may provide an articulatory basis for the syllable in language” (Redford 1999:25). According to this view, the dominance of the jaw cycle in adult speech, as described by MacNeilage in the Frame/Content Theory, seems not to be weakened by within and across language phonological asymmetries and co-occurrence constraints in VC patterns. On that account, our results are in line with MacNeilage and Davis study (2000). Languages of ULSID favor pure-frame syllables (over 30% of the all ULSID’s syllables) which correspond to economical sound patterns. According to MacNeilage (1998), such less complex syllabic patterns originate from a simple mandibular cycle (a frame), the vowel, and the adjacent consonant adopting a quite similar position of the tongue. Our final analysis examined tautosyllabic combinations between nuclei and consonants in the ( )CV( ) and ( )VC( ) templates (the brackets indicate that one or more consonants can appear in this position). All the syllabic structures of the 17 languages were considered (except the minimal syllable V which accounted for less than 5% of the ULSID syllables). We conducted statistical analyses taking into account the prevocalic consonant and the postvocalic one, as long as they appeared in the same syllable. Our results (Tables 5-6) clearly showed that front vowels were strongly related to coronals, central vowels to bilabials and back vowels to velars. Thus, regardless of the complexity of the syllabic structures (the degree of complexity is estimated from the number of consonants in the onset or coda position), languages seem to favor consonant-vowel or vowel-consonant combinations where the tongue does not make great displacement in the front-back dimension between the vowel gesture and the gesture of the immediately preceding or following consonant. Our analyses confirmed that the syllabic frame was one of the aspects of the sound organization in lexical units. In a next step, we will extend our analyses to other consonants, by grouping bilabials and labiodentals in a labial category, dorsopalatals, uvulars, and velars in a dorsal category (since we noticed that palatal consonants were very frequently combined with back vowels, as in Vietnamese; ratio observed/expected = 1.9).
Favoured syllabic patterns and sensorimotor constraints
121
Table 5. Mean value of the 17 ratios between observed and expected frequencies of ( )CV( ) combinations (χ2 significant, p < 0.001 for each column). CV patterns
Coronal
Bilabial
Velar
Front
1.09
0.94
1.05
Central
0.87
1.10
0.70
Back
1.01
0.93
1.32
Table 6. Mean value of the 17 ratios between observed and expected frequencies of ( )VC( ) combinations (χ2 significant, p < 0.001 for each column). VC patterns
Front
Central
Back
Coronal
1.11
1.00
0.89
Bilabial
0.53
1.32
1.18
Velar
0.92
0.86
1.22
3. The Labial-Coronal Effect We explored other prevocalic and postvocalic consonants that appeared in the same syllable, or in two consecutive syllables, within lexical items. Regarding structures with less complex onset and coda, we focused on the relation between the initial and the final consonants in CVC syllables or in the consecutive onsets in CV.CV words (the dot indicates the syllable boundary). We observed that various places of articulation for the two consonants were favored. For example, [pap] and [tat] syllables were less frequent than [tap] or [pat], although coronals were the most frequent consonants in onsets, as well as in coda positions (Rousset, 2004). In the favored patterns, there were no place repetitions between onset and coda in the CVC structures. The mean observed/expected ratios for the ULSID’s languages were below 0.9 for the CVC syllables when the two consonants
122
Nathalie Vallée, Solange Rossato and Isabelle Rousset
shared the same place of articulation. This result contrasts with the place assimilation between C and V found in both the favored CV and VC structures. The more widespread CVC patterns in ULSID’s lexica were in descending order: Bilabial-V-Coronal, Coronal-V-Velar, Coronal-V-Bilabial, Velar-V-Coronal. While no difference was observed between the ratios of the two inverse patterns Coronal-V-Velar and Velar-V-Coronal (respective ratios observed/expected = 1.18 and 1.2), this was not the case for BilabialV-Coronal and Coronal-V-Bilabial (respective ratios = 3 and 1.13). We observed the same findings for the two consecutive open syllable patterns, except that repetition of coronal is the most common CV.CV sequences. The predominance of the Bilabial-V-Coronal-V combination over the Coronal-V-Bilabial-V pattern is strongly attested in CV.CV sequences. We calculated the same ratios as for CVC patterns, and found an LC effect in 15 languages (at statistically significant levels for 13 of them, p < 0.005), either in CVC structures (12 languages, significant for 9) or in CV.CV sequences (14 languages, significant for 13), regardless of their position within the word. In 3 languages (Kanuri, Navaho, Thai), the trend was present only in disyllabic patterns and was not found in CVC syllable structures. The Vietnamese lexicon has few compounded words with the CV.CV structure (less than hundred). The ratio of LC to CL was 0.92. In addition, Wa and !Xòõ had no LC effect in any of the observed structures (Wa has only CVC). Finally, French, Finnish and Quechua had a stronger LC effect in intra-syllabic patterns than in disyllabic sequences. However, their basic syllabic structure is CV. When we took into account only the bilabial consonants, the mean ratio of LC to CL was 2.77 for all the dissyllabic words of ULSID’s languages. Concerning bilabials and labiodentals, the ratio was 2.79. The mean values of LC/CL as a function of the syllable type and its position within the word were also examined. Table 7 shows that the trend appeared not only at word onsets but also elsewhere in the words, both in intersyllabic and intrasyllabic patterns. Nevertheless, the mean ratio was higher for disyllabic words and CV.CV disyllabic sequences at word onset position. These results confirmed the presence in ULSID of the so-called LabialCoronal (LC) effect (MacNeilage and Davis, 2000). According to the authors, there is a strong preference for a specific intersyllabic pattern such that the onset of the first syllable of a disyllabic word CV.CV is a labial stop consonant and the onset of the second syllable is a coronal stop consonant. This pattern is absent during babbling, but appears during the first word stage (MacNeilage, Davis, Kinney and Matyear, 1999). This trend is
Favoured syllabic patterns and sensorimotor constraints
123
so strong that infants produce LC patterns even if the words in the adult language are CL. The authors reported, for example, that “soup” was pronounced with the inverse sequence “pooch” [pu:t]. In a review of seven previous studies, MacNeilage and Davis (1998) noted the predominance of the LC patterns in infants from five different language communities (French, American-English, Dutch, Spanish, Czech). In MacNeilage et al., (1999), the authors observed both CVC and CVCV words produced by 21 English-speaking infants during the first-50-word stage (12-18 months). They calculated the overall occurrences of the two opposite patterns and found a ratio of LC to CL equal to 2.55. They observed also the LC effect in samples of words of ten natural languages. In this cross-linguistic study, the mean ratio of LC to CL was 2.23. In a previous study carried out by Locke (1983), this effect of "anterior-to-posterior progression" was observed in the utterances of children and also in the vocabularies of English and French. Table 7. Mean values of LC/CL ratios for all word lengths as a function of syllable type and position within the word: (1) Bilabial consonants (2) All labial consonants (bilabials + labiodentals). Position
Word Onset
Elsewhere
Sequences
CVC
CV.CV
CVC
CV.CV
LC/CL (1)
1.53
2.41
1.68
1.75
LC/CL (2)
1.73
2.28
1.89
1.68
MacNeilage and Davis (2000) propose developmental arguments, based on the biomechanical properties of the vocal apparatus, to explain the LC effect in infants and adults. The authors put the LC preference within the context of the Frame/Content Theory. Findings from investigations in neurophysiology and clinical neurology led the authors to suggest that the LC effect could be a consequence of articulatory properties and more precisely, a consequence of selecting the simplest gestures first. They argue that LC patterns begin with a less complex task than CL patterns due to the fact that labials are easier to produce than coronals: labials are produced with a basic movement of the jaw while coronals require an additional movement of the tip of the tongue. It can be assumed that velars, which involve both the
124
Nathalie Vallée, Solange Rossato and Isabelle Rousset
backward movement of the tongue body and the raising of the tongue dorsum toward the soft palate, are also more complex to articulate than labials. In addition, many previous studies on tendencies in babbling and early speech across languages have observed that the emergence of the velar closing phase followed the emergence of both labial and coronal closures (Locke, 1983; Davis and MacNeilage 1994; Robb and Bleile, 1994; StoelGammon, 1985). However, comparing three articulatory models, Vilain, Abry, Brosda and Badin (1999) showed that the frame can produce labial or coronal closure, where the place of the closure depends on the position of the tongue at the beginning of the mandibular cycle. So, in this study, coronal closure seemed just as easy as labial closure to produce. In a recent study, Rochet-Capellan and Schwartz (2005a,b) proposed an original explanation of the LC effect. They analyzed the gestural overlap of the consonant closures (both occlusion and constriction) in a task of accelerated repetition of a given CVCV sequence (V = /a/ C = {/p/, /t/, /f/, /s/}) pronounced by thirty-two French subjects. The articulatory and acoustic measurements revealed that the effect of accelerating the repetition put the LVCV sequences in phase with a single jaw-opening gesture; in contrast, there were two jaw cycles when producing CVLV sequences (one cycle for each syllable). At slow speech-rate (2.5 Hz), the lower lip and the tongue tip in LVCV sequences were in phase with the mandibular gestures, but acceleration to five cycles per second prompted a loss of phasing for the tongue apex. At a fast rate, the labial release fit with the preparation phase or launching and the coronal release fit with the jaw-lowering phase. Their results show that LVCV sequences requires more economical mandibular gestures than the CVLV sequences. This is due to the fact that there is stronger articulatory cohesion between the jaw oscillations and lip and tongue movements in LVCV sequences. The authors conclude that the overlap of the two consonantal gestures is easier because, in LC patterns, the coronal closure could be prepared during the labial gesture, making LC simpler to produce than CL. Another recent study, exploiting the Verbal Transformation Effect (Warren, 1961), is in line with this idea (Sato, Vallée, Schwartz and Rousset, 2007). The Verbal Transformation Effect refers to the perceptual changes experienced while listening to a speech form cycled in rapid and continuous repetition. Sato et al. assumed that LC percepts would be more stable (lasting longer before the next transformation) than CL percepts. They proposed two experiments using either voiced or voiceless plosive
Favoured syllabic patterns and sensorimotor constraints
125
consonants, which confirm a greater stability and attractiveness for LC percepts. Several recorded CV syllables with C = {/p/, /t/, /b/, /d/} and V = {/a/, /i/, /o/} were paired with respect to both the vowel quality and the ±voiced consonant feature. The pairs of CV syllables were selected meticulously, taking into account their acoustic similarities (duration, intensity, formant and pitch values). The dissyllabic stimuli were /pata/, /tapa/, /piti/, /tipi/, /poto/ and /topo/ for experiment A; /bada/, /daba/, /bidi/, /dibi/, /bodo/ and /dobo/ for experiment B. For each stimulus, different lexical factors were calculated in order to eliminate a possible lexical effect on verbal transformations: the dissyllabic frequencies, the bigram frequencies, and the neighborhood density. The twenty-four participants heard the reverse disyllabic LVCV sequences (such as "pata" or "tapa") repeated 300 times with no interval between repetitions. They had to report what they perceived as soon as the sequence seemed to change into another form (even if it changed into one they had heard previously). The authors estimated the perceptual stability duration by measuring the time spent perceiving a given form before switching to another one. The results showed that the stability durations of /pV.tV/ were significantly higher than in /tV.pV/ sequences, regardless of the stimulus, with an average of 1.40 more /pV.tV/ than /tV.pV/. The same trend was observed for the patterns involving voiced plosives: the stability durations of /bV.dV/ were significantly higher than in /dV.bV/ sequences, with an average of 1.36 more /bV.dV/ than /dV.bV/. Sato et al. noted that, in the French lexicon, the number of lexical entries beginning with /p/ was twice as high as the ones beginning with /t/. In turn, the number of lexical entries beginning with /b/ was half the number of entries beginning with /d/. Thus, they observed a clear perceptual preference for the LC forms over the CL forms, even when the dissyllabic patterns heard by the subjects contained voiced consonants. Although this finding cannot account completely for the LC effect in languages, it clearly indicates that in a (…)CLCLCLCLC(…) sequence, the listener more naturally segmentes a sequence into LC chunks. This may suggest a possible perceptual correlate of the LC effect in French adults.
4. Consonant sequences with nasal and plosive Languages with similar phoneme inventories may have distinct phonotactic and distributional patterns in order to shape syllables, morphemes, and
126
Nathalie Vallée, Solange Rossato and Isabelle Rousset
words, utterances. Within syllables, it is generally accepted that speech sounds are linearly ordered according to their intrinsic sonority, with the nucleus as the most sonorous element. Although languages permit that the syllabic constituents to be organized in a hierarchical structure, based on the famous Sonority Sequencing Principle (SSP), there is still no accepted universal scale of sonority. As pointed out by Ohala and Kawasaki (1997), the main reason is that there is no clear definition of sonority, not even one based on phonetic properties, like the correspondence between jaw openness and the sonority scale suggested by Lindblom (1983). On the other hand, it is well established that the SSP is unable to account for some sound sequences within and across languages (see Blevins (1995: 211) for /sp st sk/ onsets in English; Broselow (1995: 177) for prenasalized stops in syllable onsets; Montreuil (2000) for negative slope sonority in two consonant onset clusters in Raeto-Romance, Gallo-Italic, also Slavic). Most of the proposed sonority scales or other sound hierarchies (since Saussure, 1916 to Steriade, 1982, or Selkirk, 1984; Anderson and Ewen, 1987; Klein, 1993; Angoujard, 1997) fail to predict Nasal+Plosive sequences at the onset of syllables, and each time predict the inverse pattern Plosive+Nasal (see Clements (1990) for a detailed and well documented discussion on the role of the sonority in syllabification). As a result, the SSP cannot explain the observed combinations between nasals and plosives. In order to account for the favoured sound sequences involving nasals and plosives, we investigated the phonetic properties of the segments in the sequences. We attempted to estimate the role of both articulatory and aerodynamic factors in preferred and non-preferred sound sequences with a plosive and a nasal in ULSID’s languages. First, the distribution of both liquids and nasals with adjacent plosives were analyzed within the words and within the syllables. Our choice to observe liquids adjacent to nasals was justified in that the sonority ranking of nasals and liquids are similar (irrespective of the selected sonority scale) while their articulatory features are very different. The distributions of both Plosive+Nasal and Plosive+Liquid sequences within the words and within the syllables and the inverse combinations (Nasal+Plosive and Liquid+Plosive) were surveyed using 15 syllabified lexicons from the ULSID database. The segments under analysis were described in terms of manner: Plosive (P), Fricative (F), Nasal (N), and Liquid (L) (‘Liquid’ groups type of voiced lateral approximants and rhotics). The number of observed syllables was compared to the number of expected syllables. For example, the number of expected NPV( ) syllables was calculated using the number of CCV( ) syllables weighted by both the
Favoured syllabic patterns and sensorimotor constraints
127
probability of occurrence of the Nasal manner and of the Plosive manner. The same calculation was carried out for all the sequences under analysis. Thus, the number of expected NPV( ) sequences was estimated using the number of CCV( ) sequences in each syllabified lexicon. Within syllables, both complex onsets and complex codas were analyzed (complex syllables accounted for 2.7% of all ULSID’s syllables). As no complex syllables were observed in their lexicons, both !Xòõ and Yup’ik were not considered in this analysis. Three other languages were also ruled out (Afar, Navaho and Ngizim) because the expected frequencies for a given syllable was too low, generating artefactual large difference (when the ratio of observed and expected frequencies was calculated, a small difference on low values generated a high difference on the resulting value). Therefore, we needed a sufficient number of complex syllables for computing the ratio and pursuing the analyses. When a lexicon contained less than 30 complex syllables, it was excluded from the analyses. According to this limit, 9 languages with complex onsets (Finnish, French, Kannada, Kanuri, Nyah kur, Quechua, Sora, Thai, Wa) and 5 languages with complex codas (Kanuri, Nyah kur, Quechua, Thai and Wa) were retained for the analysis. It is worth noting that Kwakw’ala had complex codas without complex onsets. Table 8 presents the mean values of the ratios observed in tautosyllabic sequences involving either a nasal or a liquid contiguous to a plosive consonant. The results revealed that the patterns involving a liquid consonant followed the SSP: PLV sequences were mainly favored while LPV sequences were not found in ULSID’s languages. Complex codas with falling sonority (VLP) were more widespread than VPL rhymes. The latter pattern was found only in the French lexicon, without however being favored (ratio = 0.52 < 1). We observed different patterns in the distribution of nasals. PN sequences in onset position were rare, but followed SSP. Most of the languages studied did not have NPV sequences except two of them (Kanuri and Nyah kur). Besides, both languages favored this type of sequence (the ratios were, respectively, 3.24 and 1.65). Some NPV syllables were present in Ngizim, with 11 tokens out of the twenty CCV syllables. With respect to complex codas, VPN sequences were disfavored while VNP sequences were favored, in accordance with the SSP.
128
Nathalie Vallée, Solange Rossato and Isabelle Rousset
Table 8. Mean value of the observed/expected ratios for each tautosyllabic combination. Underscores indicate the position of the consonant (nasal or liquid) in the sequence. The sound sequences predicted by the SSP are in the gray columns. Complex onset (9 languages)
Complex coda (5 languages)
P_V
_PV
VP_
V_P
Nasal
0.01
0.54
0.03
3.89
Liquid
6.93
0
0.10
1.66
Analyses were extended to Nasal+Plosive and Liquid+Plosive sequences (and opposite combinations) across syllable boundaries. In the 15-language ULSID database, Wa has a monosyllabic lexicon and !Xòõ has too few clusters. So, they were excluded from the following analysis. For each combination, we calculated the mean ratio value among the 13 languages. The results presented in Table 9 reveal that both the VL.P and the L.PV sequences were widely favored across syllable boundaries when compared to the other sequences (VP.L and P.LV). With respect to nasals, the distributional patterns across syllable boundaries revealed another clear preference: N.PV was widely preferred to P.NV for 11 of the thirteen languages (except for both French and Yup’ik), and a similar tendency was found between VN.P and VP.N patterns (the respective values of the ratios were 2.14 and 0.21). Table 9. Mean value of the observed/expected ratios for each combination analyzed across syllable boundaries. The underscore gives the position of the consonant (nasal or liquid). P._V
_.PV
VP._
V_.P
Nasals
0.27
2.34
0.21
2.14
Liquids
0.32
1.09
0.17
1.50
The ratios indicated that both Nasals and Liquids demonstrated a clear preference for coda position, while Plosives tended to appear in onsets. This result follows the Syllable Contact Law which predicts that sonority drops maximally across syllable boundaries (Clements, 1990). Similar trends were found for Nasal+Plosive and Liquid+Plosive sequences across sylla-
Favoured syllabic patterns and sensorimotor constraints
129
ble boundaries even though Nasals and Liquids behaved differently within syllables. We suggest that such behavior can be explained by articulatory and aerodynamic factors. In a pilot experiment, the production patterns of the feature [+nasal] were observed for a French male speaker with an electromagnetic articulograph EMA® (Carstens) (Rossato, Bouaouni and Badin, 2003). The velum height was measured with a pellet glued to the inferior area of the velum for VCV sequences, where C = {/p/, /t/, /k/, /b/, /d/, //, /f/, /s/, //, /v/, /z/, //, /m/, /n/, /l/, //}. Each consonant was repeated in a symmetrical vocalic context for all French oral and nasal vowels. A series of articulatory measurements showed that the voiceless plosives /p t k/ were always produced with a high velar position and the voiced ones with a slightly lower position. The nasal consonants were produced with a wide range of velum heights (0.8 cm), since their production required an open velopharyngeal port. Great variation due to the vocalic context (high vowels, low vowels, or nasal vowels) was observed in these cases. Figure 2 shows the mean velum height trajectory for the following categories: Voiceless plosives, voiced plosives, nasals, and liquids (other categories omitted in this figure). The same corpus was used to record data with the EVA® workstation (S.Q. LAB) on the same speaker in order to measure the intraoral pressure, oral airflow and nasal airflow. Intraoral pressure was estimated using a PVC tube placed in the subject’s mouth and connected to the pressure sensor device of the EVA® workstation. Therefore the measurements of the intraoral pressure were not available for velar stops in /u/ and /o/ contexts, and for uvular consonants in any vocalic contexts. Figure 3 shows that the intraoral pressure maintained a high value during the closure phase of the voiceless plosives, progressively increased during the voiced plosives, and stayed low for the nasals /n, m/ and the liquid /l/ (other consonants were unanalyzed in this study).
130
Nathalie Vallée, Solange Rossato and Isabelle Rousset
Velum height (cm)
Unvoiced Plosive
10.8
Voiced Plosive Liquid
10.6 10.4
Nasal
10.2 0
0.1
0.2
0.3
Time (s) Figure 2. Mean trajectories of velum height during the production of each category of consonant.
Intraoral pressure (hPa)
8 Unvoiced Plosive
6
4
Voiced Plosive
2 Nasal
0 0
0.1
Liquid
0.2
0.3
0.4
Time (s) Figure 3. Intraoral pressure curve for each category of consonants.
Although these articulatory and aerodynamic data were obtained from VCV sequences, they shed light on possible coarticulation effects between adjacent nasal and plosive consonants. The articulation of Nasal+Plosive se-
Favoured syllabic patterns and sensorimotor constraints
131
quences involves both closing the oral tract and lowering the velum to produce the articulation of the nasal consonant. The escape through the nose of the airflow prevents an increase of the intraoral pressure. To produce the following plosive, the velum should raise until the velopharyngeal port is closed. This gesture is probably facilitated by the increase of intraoral pressure due to the closure of the vocal tract. When the two contiguous consonants have the same place of articulation, velum raising and laryngeal control seem to be sufficient to produce such a sequence. On the other hand, the opposite sequences, Plosive+Nasal, start with both a high velic position and high intraoral pressure. At the release of the plosive, the intraoral pressure drops, the velum lowers and opens the velopharyngeal orifice, while the closure of the vocal tract during the articulation of the nasal consonant produces very slight variation of intraoral pressure. When the plosive is unreleased, the intraoral pressure stays high until the velopharyngeal port is open. Figure 2 shows that the mean trajectories of the velum height rose slightly during the closure phase of the voiced and unvoiced plosives. This means that the high intraoral pressure increases the volume of the vocal tract. This pushes the velum upward and causes a widening of the vocal tract in the velar region. Consequently, to produce a following nasal consonant, the velum should be lowered to open the velopharyngeal port despite a high intraoral pressure creating a drop of the intraoral pressure. This suggests that the articulatory cost of a Plosive+Nasal sequence, with or without the plosive release, should be higher than the articulatory cost of a Nasal+Plosive sequence. The latter could be produced with a simple rising of the velum. These articulatory and aerodynamic constraints on velum movements could explain the preference in ULSID’s languages for Nasal+Plosive over Plosive+Nasal sequences, regardless of the syllable boundary.
5. Conclusion Although phoneme sequences in a language exhibit a high degree of complexity, cross-language investigations of common sound sequences indicate that physical and cognitive factors shape, at least partly, the linear organization of phonemes in words and syllables. The non exhaustive set of observations in cross-language syllabified lexicons proposed in this paper was selected with the aim of better understanding the relationship between sensori-motor capacities and phonology. The present results suggest that sev-
132
Nathalie Vallée, Solange Rossato and Isabelle Rousset
eral mechanisms underlie both intra-syllabic and inter-syllabic patterns. They confirm previous findings on CV co-occurrence patterns (MacNeilage, 1998) and also on inter-syllabic patterns involving labial and coronal consonants (MacNeilage and Davis, 2000). In both CV and VC sequences, consonants and vowels often share the same place of articulation. According to MacNeilage’s Frame/Content Theory (1998), these CV patterns stem from the syllabic frame, the basic raising-lowering movement of the jaw. The same trend is observed for the reverse syllabic patterns, with a stronger effect of the frame across nuclei and codas. This indicates that the frame shapes a greater part of the syllabic inventories in languages compared to MacNeilage and Davis’ (2000) findings. Although only two of them are “pure-frame” (both CV and VC), these results reveal that gesture economy plays a role in shaping the preferred intra-syllabic patterns. Likewise, our analyses showed a more extensive LC effect compared to previous work (MacNeilage and Davis, 2000). We found LC patterns in 15 out of 17 languages, not only in disyllabic words or CVC syllables at the beginning of the words, but also in CVC(V) sequences. Contrary to what MacNeilage and Davis claim about the frequency of LC patterns in sound sequences (preference for “simple first”), the results of recent experimental studies conducted in our laboratory suggest i) a perceptual correlate of the LC effect showing that listening to a (…)CLCLC(…) sequence, a French subject tends to segment it into LC chunks (Sato, Vallée, Schwartz and Rousset, 2007); ii) a greater ease of coarticulation in labial-coronal structures compared with coronal-labial ones (Rochet-Capellan and Schwartz, 2005a,b). In this speeded speech production experiment, the dissyllabic LVCV patterns were produced with a single jaw-opening gesture at a fast speech rate, while CVLV sequences were still produced with two jaw frames. At a normal rate, in the production of both LVCV and CVLV patterns, the lower lip and the tongue apex each were in phase with the onset of the jaw-opening gesture. These results indicate that a CV syllable could be out of phase with the jaw-opening gesture (the frame), which can be observed with an increased speech rate paradigm, syllabic cycles being in phase with mandibular cycles before and after acceleration (both cycles had a similar duration). According to Rochet-Capellan and Schwartz, the explanatory reason for the LC effect is the easier overlap of the two consonant gestures in LVCV compared to CVLV sequences. Among the aspects of the sound organization in lexical units, sound sequences involving contiguous nasals and plosives were examined. A series
Favoured syllabic patterns and sensorimotor constraints
133
of articulatory measurements suggest that the preferred sequences avoid a higher articulatory cost and obey to strong aerodynamic constraints on the velum movements. The analyses of preferred sound combinations across languages provide insight into the grounds of the structure of sound sequences in languages. The explanation of these preferred patterns with non-phonological principles stress the importance of the link between language sound structures and physical factors such as aerodynamic principles, beside other characteristics of the human speech production and speech perception systems. Our findings shed light on how sound sequences are structured in language with respect to articulatory factors and aerodynamic constraints. They also contribute to the understanding of how and which phonological and phonetic patterns form the basis of syllable perception.
Acknowledgements We are particularly grateful to Ian Maddieson for letting us use ULSID, Mathieu Maupeu for programming the ULSID interface and Thuy Hien Tran for the Vietnamese lexicon. We thank Barbara Davis and Peter MacNeilage who provided helpful discussion of the findings presented in sections 1 and 2. The research presented in Section 3 was funded by the GIP ANR AAP project “Dynamique de la nasalité. Émergence et phonologisation des voyelles nasales”. The research in Section 2 was funded by the CNRS-SHS Project Complex Systems in Human and Social Sciences "Pati, papa? Modélisation de l’émergence d’un langage articulé dans une société d’agents sensori-moteurs en interaction ".
References Anderson, J. M., & C., Ewen, 1987 Principles of Dependency Phonology. Cambridge: Cambridge University Press. Angoujard, J.-P. 1997 Théorie de la syllabe. Rythme et qualité. Gap: CNRS Éditions. Bagemihl, B. 1995 Language Games and Related Areas. In J.A. Goldsmith (ed.), Handbook of Phonological Theory: 697-712. Oxford: Blackwell Publishers.
134
Nathalie Vallée, Solange Rossato and Isabelle Rousset
Berlitz (Ed.) 1981 Ordbok Fransk-Svensk / Dictionnaire Suédois-Français. Oxford: Berlitz Publishing Compagny Ltd. Blevins, J. 1995 The syllable in Phonological Theory. In J.A. Goldsmith (ed.), Handbook of Phonological Theory: 206-235. Oxford: Blackwell Publishers. Boë, L.-J., Abry, C., Cathiard, M., Schwartz, J.-L., Badin, P., & Vallée, N. 2005 Comment les exceptions des handicaps révèlent les universaux phonologiques bimodaux : contraintes audiovisuelles des systèmes consonantiques des langues du monde. Faits de Langues 25: 175-189. Broselow, E. 1995 Skeletal Positions and Moras. In J.A. Goldsmith (ed.), Handbook of Phonological Theory: 175-205. Oxford: Blackwell Publishers. Carreiras, M., Álvarez, C. J., & De Vega, M. 1993 Syllable frequency and visual word recognition in Spanish. Journal of Memory and Language 32: 766-780. Clements, G. N. 1990 The role of the sonority cycle in core syllabification. In: J. Kingston & M Beckman (eds), Papers in Laboratory Phonology 1: Between the Grammar and Physics of Speech: 283-333. New York: Cambridge University Press. Content, A., Kearns, R. A., & Fraudenfelder, U. H. 2001 Boundaries versus Onsets in Syllabic Segmentation. Journal of Memory and Language 45: 177-199. Davis, B., & MacNeilage, P. F. 1994 Organization of Babbling: A Case Study. Language and Speech 37 (4): 341-355. De Boer, B. 2000 Emergence of sound systems through self-organization. In M. Studdert-Kennedy, J. Hurford & Knight C. (eds.), The emergence of Language: Social Function and the origins of Linguistic Form. Cambridge: Cambridge University Press. Derwing, B. L. 1992 A 'Pause-Break' task for eliciting syllable boundary judgments from literate and illiterate speakers preliminary results for five diverse languages. Language and Speech 35 (1-2): 219-235. Janson, T. 1986 Cross-linguistic trends in the frequency of CV sequences. Phonology Yearbook 3: 179-195. Kandel, S., Álvarez, C., & Vallée, N. 2006 Syllables as processing units in handwriting production. Journal of Experimental Psychology: Human Perception and Performance 32 (1): 18-31.
Favoured syllabic patterns and sensorimotor constraints
135
Kawasaki, H. 1982 An acoustical basis for universal constraints on sound sequences. PhD Thesis, University of California. Kenstowicz, M. 1994 Phonology in generative grammar. Blackwell: Cambridge, MA. Keating, P.A. 1990 Coronal places of articulation. UCLA Working Papers in Phonetics 74: 35-60. Kingston, J. 2007 The phonetics-phonology interface. In P. de Lacy (ed.), Handbook of Phonology: 435-456. Cambridge, UK: Cambridge University Press. Kingston, J., & Diehl, R. 1994 Phonetic knowledge, Language 70: 419-454. Klein, M. 1993 La syllabe comme interface de la production et de la perception phoniques. In B. Laks & Plénat M. (eds.), De Natura Sonorum : essais de phonologie: 99-141. Saint-Denis: Presses Universitaires de Vincennes. Krakow, R. 1999 Physiological organization of syllables: A review. Journal of Phonetics 27: 23-54. Liljencrants, J., & Lindblom, B. 1972 Numerical simulation of vowel quality systems: The role of perceptual contrast. Language 48: 839-862. Lindblom, B. 1983 On the Teleological Nature of Speech Processes. Speech Communication 2: 155-158. 1986 Phonetic Universals in Vowel Systems. In J.J. Ohala (ed.), Experimental Phonology: 13-44. New-York: Academic Press. 2000 The Interplay between Phonetic Emergents and the Evolutionary Adaptations of Sound Patterns. Phonetica 57: 297-314. Lindblom, B. & Maddieson, I. 1988 Phonetic universals in consonant systems. In L.H. Hyman & Li C.N. (eds.), Language, Speech and Mind: 62-78. London and New-York: Routledge. Locke, J. L. 1983 Phonological Acquisition and Change. New York: Academic Press. MacNeilage, P.F. 1998 The Frame/Content Theory of Evolution of Speech Production. Behavioral and Brain Sciences 21: 499-511. MacNeilage, P.F., & Davis, B. 1990 Acquisition of speech production: Frame then content. In M. Jeannerod (ed.), Attention and Performance XIII: Motor Representation and Control: 453-475. Hillsdale, NJ: Lawrence Erlbaum.
136 1998
Nathalie Vallée, Solange Rossato and Isabelle Rousset
Evolution of speech: The relation between phylogeny and ontogeny. In: C. Knight, J. R. Hurforf & M. Studdert-Kennedy (eds), The Evolutionary Emergence of Language: Social function and the origin of linguistic form. Cambridge, Cambridge University Press: 146-160. 2000 On the Origin of Internal Structure of Word Forms. Sciences 288: 527-531. MacNeilage, P.F., Davis, B.L., Matyear, C.M. & Kinney, A. 1999 Origin of speech output complexity in infants and in languages. Psychological Science 310 (5): 459-460. Maddieson, I. 1984 Patterns of sounds. Cambridge: Cambridge University Press. 1986 The Size and Structure of Phonological Inventories: Analysis of UPSID. In J.J. Ohala (ed.), Experimental Phonology: 105-123. New York: Academic Press. 1993 The structure of segment sequences. UCLA Working Papers in Phonetics 83: 1-8. Maddieson, I., & Precoda, K. 1989 Updating UPSID. UCLA Working Papers in Phonetics. 74: 104-111. 1992 Syllable structure and phonetic models. Phonology 9: 45-60. Mawass, K. 1997 Synthèse articulatoire des consonnes fricatives du français. Thèse de doctorat, INP Grenoble, France. Mawass, K., Badin, P., & Bailly, G. 2000 Synthesis of French fricatives by audio-video to articulatory inversion. Acta Acustica 86: 136-146. McGowan, R.S., Koenig, L.L., & Löfqvist, A. 1995 Vocal tract aerodynamics in [aCa] utterances: Simulations. Speech Communication 16: 67-88. Montreuil, J.-P. 2000 Sonority and derived clusters in Raeto-Romance and Gallo-Italic. In L. Repetti (ed.), Phonological Theory and the Dialects of Italy. Amsterdam & Philadelphia: John Benjamins Publishing Compagny. Morais, J., Cary, L., Alegria, J., & Bertelson, P. 1979 Does awareness of speech as a sequence of phones arise spontaneously? Cognition 7: 323-331. Ohala, J.J. 1983 The origin of sound patterns in vocal tract constraints. In P.F. MacNeilage (ed.), The production of speech: 189-216. New-York: Springer Verlag. Ohala, J.J., & Jaeger, J.J. (eds.) 1986 Experimental Phonology. Orlando: Academic Press Inc. Ohala, J. J., & Kawasaki, H. 1984 Prosodic phonology and phonetics. Phonology Yearbook 1: 113-128.
Favoured syllabic patterns and sensorimotor constraints
137
Ohala, J.J., & Kawasaki-Fukumori, H. 1997 Alternatives to the sonority hierarchy for explaining segmental sequential constraints. In E. Stig & Håkon Jahr E. (eds.), Language And Its Ecology: Essays In Memory Of Einar Haugen. Trends in Linguistics. Studies and Monographs 100: 343-365. Berlin: Mouton de Gruyter. Pérennou, G., & de Calmès, M. 2002 BDLEX 50000. French Lexical Database: Lexical Resources V2.1.2. IRIT, Toulouse: ELRA/ELDA. Plénat, M. 1995 Une approche prosodique de la morphologie du verlan. Lingua 95: 97-129. Redford, M.A. 1999 An Articulatory Basis for the Syllable. PhD thesis, The University of Texas, Austin. Redford, M., & Diehl, R. 1999 The relative perceptual distinctiveness of initial and final consonants in CVC structures. Journal of the Acoustical Society of America 106: 1555-1565. Robb, M., & Bleile, K. 1994 Consonant inventories of young children from 8 to 25 months. Clinical Linguistics and Phonetics 8: 295-320. Rochet-Capellan, A., & Schwartz, J.-L. 2005a The labial-coronal effect and CVCV stability during reiterant speech production: An acoustic analysis. Proceedings of INTERSPEECH’2005: 1009-1012. Lisbon. 2005b The labial-coronal effect and CVCV stability during reiterant speech production: An articulatory analysis. Proceedings of INTERSPEECH’2005: 1013-1016. Lisbon. Rossato, S., Badin, P., & Bouaouni, F. 2003 Velar movements in French: An articulatory and acoustical analysis of coarticulation. In M.J. Solé, D. Recasens, & Joaquim R. (eds.), Proceedings of the 15th International Congress of Phonetic Sciences: 3141-3145. Barcelona. Rousset, I. 2004 Des Structures syllabiques et lexicales des langues du monde : données, typologies, tendances universelles et contraintes substantielles. Thèse de doctorat en Sciences du Langage (PhD), Université Stendhal. Grenoble, France. Sato, M., Vallée, N., Schwartz, J.-L., & Rousset, I. 2007 A perceptual correlate of the Labial-Coronal Effect. Journal of Speech, Language and Hearing Research 50: 1466-1480. Saussure, F.D. 1916 Cours de linguistique générale. Payot, Paris.
138
Nathalie Vallée, Solange Rossato and Isabelle Rousset
Schwartz, J.-L., Boë, L.-J., Vallée, N., & Abry, C. 1997 The Dispersion-Focalization Theory of vowel systems. Journal of Phonetics 25 (3): 255-286. Segui, J., Dupoux, E., & Mehler, J. 1990 The role of the syllable in speech segmentation, phoneme identification and lexical access. In G. Altmann (ed.), Cognitive Models of Speech Processing: 263-280. MIT Press. Selkirk, E. 1982 Syllables. In H. van der Hulst & Smith N. (eds.), The Structure of Phonological Representations (2): 337-383. Dordrecht: Foris. 1984 On the major Class Features and Syllable Theory. In M. Aronoff & Oerhle R.T. (eds.), Language and Sound Structure. Cambridge: MIT Press. Stefanuto, M., & Vallée, N. 1999 Consonant systems: From universal trends to ontogenesis. Proceedings of the XIVth International Congress of Phonetic Sciences (3): 1973-1976. San Francisco. Steriade, D. 1982 Greek Prosodies and the Nature of Syllabification. Phd Thesis, MIT. Stevens, K.N. 1989 On the Quantal Nature of Speech. Journal of Phonetics 17 (1): 3-46. 2003 Acoustic and perceptual evidence for universal phonological features. In M.J. Solé, D. Recasens, & Joaquim R. (eds.), Proceedings of the 15th International Congress of Phonetic Sciences: 33-38. Barcelona. Stevens, K.N. & Keyser S.J. 1989 Primary features and their enchancement in consonants. Language 65 (1): 81-106. Stoel-Gammon, C. 1985 Phonetic Inventories, 15-24 Months: A Longitudinal Study. Journal of Speech and Hearing Research 28: 505-512. Treiman, R. 1983 The structure of spoken syllables: Evidence from novel word games. Cognition 15: 49-74. 1989 The internal structure of the syllable. In G.N. Carlson & Tanenhaus M.K. (eds.), Linguistic Structure in Language Processing: 27-52. Dordrecht: Kluwer. Treiman, R. & Kessler, B. 1995 In defense on Onset-Rime syllable structure for English. Language and Speech 38 (2): 127-142. Trubetzkoy, N.S. 1939 Grundzüge der Phonologie. Travaux du Cercle Linguistique de Prague, 7. Translation by Cantineau, J. 1970 Principes de phonologie. Paris: Klincksieck.
Favoured syllabic patterns and sensorimotor constraints
139
Vallée, N., Boë, L.-J., Schwartz, J.-L., Badin, P., & Abry, C. 2002 The weight of substance in phonological structure tendencies of the world’s languages. ZAS Papers in Linguistics 28: 145-168. Berlin. Vallée, N., Schwartz, J.-L., & Escudier, P. 1999 Phase Spaces of vowel systems. A typology in the light of the Dispersion-Focalization Theory (DFT). Proceedings of the XIVth International Congress of Phonetic Sciences: 333-336. San Francisco. Vilain, A., Abry, C., Brosda, S. & Badin, P. 1999 From idiosyncratic pure frames to variegated babbling: Evidence from articulatory modelling. Proceedings of the XIVth International Congress of Phonetic Sciences: 1973-1976. San Francisco. Warren, M.R. 1961 Illusory changes of distinct speech upon repetition – The verbal transformation effect. British Journal of Psychology 52: 249-258.
Structural complexity of phonological systems Christophe Coupé, Egidio Marsico and François Pellegrino 1. Introduction In the linguistic tradition, including phonology, complexity has often been evoked when looking for explanatory arguments (a given phenomenon is rarer because it is more complex than another) or looking for a balance of complexity within subsystems of a language, or directly comparing and ranking several languages according to a linguistic dimension (for a review, see Chitoran and Cohn, and Pellegrino et al. this volume). In this perspective, the concept of complexity is intrinsically relative and necessarily yields to judging something as more or less complex than something else regarding a particular property or even globally. Thus, anyone involved in the enterprise of evaluating phonological complexity faces the tricky issue of, as a first step, defining a set of (phonological) properties and for each property defining a scale of complexity. Then, and only then, is one able to start comparing the phonological complexity of the chosen phonological elements. In that perspective, “to be complex” or not is a (possibly gradient) quality assigned to a particular set of elements. This task is all but straightforward. If choosing the set of properties can be quite simple, characterising them with a scale of complexity is much trickier. Moreover, as one tries to combine several properties of an element to evaluate its overall complexity, the issue of weighting these different dimensions can easily lead to a dead end. In this regard, Maddieson (this volume), is very insightful and presents an excellent summary of where to find and how to define phonological complexity, as well as the limits of this notion. Interestingly, an alternative conception of complexity has developed for a half-century, stemming from cybernetics, systems theory and systems dynamics (e.g. Abraham, 2001 for an epistemological view). First found in statistical physics, biology and computer science, it has rapidly proven relevant within the field of humanities and social sciences and it is now definitively associated with the notion of “complex system”. In that framework, a system is or is not complex according to whether its structure and behaviour satisfies particular characteristics. The picture is thus substan-
142
Christophe Coupé, Egidio Marsico and François Pellegrino
tially different from the “arithmetic” view of complexity because one no longer needs to look for the very dimensions on which to compute complexity, more or less objectively, but rather to “just” validate if a system fulfils some properties known a priori. This way, complexity is no longer a relative notion. To illustrate what the properties of complex systems can be, we refer to Steels (1997): “a complex system consists of a set of interacting elements where the behaviour of the total is an indirect, non-hierarchical consequence of the behavior of the different parts […]. In complex systems, global coherence is reached despite purely local nonlinear interactions. There is no central control source”.
A system can thus be said to be complex if: i) it is structured in different levels; ii) the properties of the global level (the systemic ones) differ from those of the elements of the basic level; iii) the systemic properties cannot be derived linearly from the basic ones. Seeds of this new paradigm can be found in Warren Weaver’s seminal article (1948) where he emphasized the understanding of organized complexity as one of the key issues to be addressed by modern science. Lying somewhere between the simple problems that could be solved with pre-20th century science, and the disorganized complexity that was handled with new statistical and probabilistic tools in the first half of the 20th century, this complexity involves dealing with a number of factors that do not behave independently, but interact into what Weaver called an “organic whole”. Rather than the basic number of factors or constituents in the system (that would be low for simple problems and potentially high in the case of disorganized complexity), it is the nature of their interrelations that actually matters. The step forward lies precisely in this differentiation of levels where the elements and the properties of each level may differ; and what matters is the way the structure of the systemic level emerges from interactions at the basic level. This view, stemming from the science of complex systems, leads to modifying the way phonological complexity is addressed: we no longer intend to compare the overall complexity of phonological systems in terms of which one is more complex than the others. Instead, we aim at characterising their structure. Explaining why there are so many different structures seems now even more crucial than knowing if one is more complex than another. After all, all languages seem to work with the same efficiency; no
Structural complexity of phonological systems
143
one has ever reported a language with communicative disabilities or nonimpaired children failing to learn a particular language. For all we know, all languages are functionally equal (and all complex enough), and yet as Ferguson (1978, p. 9) wrote: “As soon as human beings began to make systematic observations about one another's languages, they were probably impressed by the paradox that all languages are in some fundamental sense one and the same, and yet they are also strikingly different from one another”.
Indeed, typological research has shown that despite the fact that certain types of linguistic structures are clearly more frequent than others, even the uncommon ones can be relatively numerous and very different. Thus, this coexistence of numerous viable types of linguistic elements and structures, although unevenly distributed, reveals that language is a system poorly constrained, or at least presenting numerous degrees of freedom. In this contribution, we develop a study of the structural complexity of the phonological systems of the UPSID database1 in line with these statements. To set the stage, let us just look at the variation present in the languages of the database. They have from 11 to 141 segments, from 3 to 28 vowels, from 6 to 95 consonants, from 6 to 51 distinctive features. This has to do with the variations of types, but discrepancies are even wider when one looks at tokens. To give a few examples, some segments are present in only one language whereas others can cover up to 90% of the sample; only one language has 28 vowels but more than 20% have five; stop consonants are present in all languages, etc. These two sources of data (types and tokens) offer different kinds of information. Looking at types raises issues regarding the set of possible phonological elements (be they features, segments or systems), and at first glance, the observed diversity could push toward considering phonological systems as simple sets of unorganized segments. However, when compared to the theoretical number of possible combinations of features and segments, the number of attested types is relatively low, showing instead that phonological systems are not randomly composed. Moreover, when looking at tokens, the uneven distribution of types among languages reveals that some systems prevail. Consequently, we need to understand what parameters are making one system more widespread than another. This means also, from a methodological point of view, that frequencies of distribution are not an explanation per se and thus should not be considered as inputs in a model, but rather as what is to be explained. They are the emergent properties of an underlyingly organized structure.
144
Christophe Coupé, Egidio Marsico and François Pellegrino
The notion of 'emergence' is a key concept of the dynamical complex system framework. As mentioned before, the different structures of the systems are considered as emerging from the specific interactions of their elementary units. To some extent then a system can be seen as the reflection of the constraints at work. The citation below from Björn Lindblom, illustrates this from the diachronic perspective: "The new form [i.e. the new pronunciation that yields a potential sound change] gets tested implicitly on a number of dimensions: 'articulatory ease', 'perceptual adequacy', 'social value' and 'systemic compatibility'. If the change facilitates articulation and perception, carries social prestige and conforms with lexical and phonological structure, its probability of acceptance goes up. If the change violates the criteria, it is likely to be rejected." (Lindblom, 1998:245).
Again, the notion of "systemic compatibility" pushes forward the idea that the whole (the system) is more than the sum of its parts. Following this line of thinking, in a previous paper (Marsico et al., 2004), we have explored phonological inventories (hereafter PI) assuming that it can lead to the (even partial) understanding of their structure. We began with a bottom-up approach where we intended to i) set the different levels of structuration of PI, ii) identify the properties of each level and iii) characterize the relation(s) between the levels. We got reasonable results as far as points (i) and (ii) are concerned, but our approach showed its limits with point (iii), especially when dealing with the systemic level. The main index we used to monitor the systemic behaviour of PI deals with the notion of redundancy. We wanted to evaluate the longstanding idea of PI as being economic systems (i.e. the MUAF principle – Maximal Use of Available Features – first introduced by Ohala, 1980). Our redundancy measure evaluates the average distance between each segment of a PI and its nearest neighbour. Altough the quantitative results seem to show that PI are indeed based on a principle of economy favoring systems with minimal phonological oppositions (i.e. based on only one feature); the qualitative analysis of these results revealed that our measure is not really a systemic one. As a matter of fact, our redundancy index deals more with one-to-one relationship between segments than with a collective behaviour. The lowest redundancy index is obtained as long as each segment has its minimal counterpart in the system (i.e. a segment differing by only one feature) without considering any of the underlying systemic principles on which MUAF is based: maximal use of features, consistent series of segments. The lowest index can be obtained
Structural complexity of phonological systems
145
with a system made of what Lindblom calls "a collection of 'assorted bonbons'", (Lindblom, 1998:250). This has led us to change our perspective and to adopt a top-down approach directly based on the systemic level. We will develop this approach in the remainder of this paper. Section 2 deals with a structural approach where PIs are considered as networks of connected phonemes. In Section 3, PIs are modelled by considering the distribution of co-occurrences of phonemes, in order to define attraction and repelling relations between them. These relations are then used to propose a synchronic measure of coherence for the phonological systems, and then diachronically extended to a measure of stability.
2. Considering phonological inventories in the light of graph theory 2.1. From a feature-based distance to phonological graphs 2.1.1. About graph theory Mathematical graph theory, also named network theory, during the last decade has had a significant impact in various scientific fields, for two main reasons. The first is the acknowledgment of the range of this theory, which proposes a set of tools and generic concepts that can be applied to a wide range of questions. The second is linked to the theoretical progress made in the understanding of the properties of networks half-way between regular and random networks (e.g. Erdős and Rényi, 1960). If the detailed analyses of these two specific kinds of networks go back several decades, those of intermediate networks are much more recent, and illustrate the difficulty of apprehending Weaver’s “organized complexity”. The smallworld or scale-free networks are by far the most cited today (e. g. Watts and Strogatz, 1998), since they are very commonly encountered in the study of the non-living, living or social phenomena. Several concepts borrowed from graph theory – like the notions of shortest path, robustness, aggregation, hub, or resilience, etc. have led to substantial breakthroughs in a wide range of applications: the functionality and robustness of internet networks, the understanding of the interactions between proteins, or within complex eco-systems, or the propagation of epidemics (Dorogotsev & Mendes, 2001; Pastor-Satorras & Vespignani, 2001). This statement is also correct in linguistics, where scientists have studied the properties of lexical,
146
Christophe Coupé, Egidio Marsico and François Pellegrino
syllabic or phonological graphs (Cancho & Solé, 2001; Cancho & al., 2004, Dorogotsev & Mendes, 2001; Solé, 2004). In general, a graph is defined by a set of nodes and a set of connections. The way nodes are connected leads potentially to graphs of different types but for which a common set of properties may be calculated. While some of these properties depend on the size of the network, others may be invariant, for a given type of network, and regardless of their respective size. In our approach, each phonological system is considered as a set of two networks, one for the vowel segments and one for the consonants (diphthongs are not considered so far). For each graph, the segments are the nodes and the connections are derived from phonetic-phonological relations between them using the algorithm described in the next section. 2.1.2. From phonemes to feature-based phonemic distances One way of quantifying the relation between any two phonemes is to rely on the features that compose them. In this approach, the degree of interaction corresponds to the distance in terms of features, where these features are compared within the natural classes they belong to. The following examples will illustrate this calculation. /u/ high back rounded Æ Distance = 2 /o/ /õ:/ high-mid high-mid Height back back Backness rounded rounded Lip Rounding Length short long Nasality oral nasal Æ Distance = 2 /p/ /v/ Place labial labio-dental Manner plosive fricative Voicing unvoiced voiced Æ Distance = 3 Height Backness Lip Rounding
/i/ high front unrounded
Structural complexity of phonological systems
147
This rough distance could certainly be refined by taking the shared features into account as well, but the main problem would remain: the nature of the relation between phonemes is hard to establish a priori, first because of the lack of common ground for their internal description (see the first part of this volume, on the phonological primitives) and second because the principles explaining the relations between phonological segments is still a controversial and open issue. The proposed methodology by no means aims at being the ultimate formalism but it provides a reasonably adequate balance between the need for some phonetic rationale and the possibility of being consistently applied to any phonological system. 2.1.3. Construction of phonological graphs With a quantification of the distance between phonemes, we can now turn to the construction of a graph where all the phonemes of a PI would be connected and fulfil the following principles: 1) There must be a path between any two phonemes, direct or indirect; 2) This path must be minimal in a way compatible with the notion of economy or parsimony. The first principle is consistent with the promoted view of PI as systems; no phoneme is isolated within a PI, and consequently each phoneme is at least related to one of the other phonemes of the system. This principle stems from the traditional idea of opposition between phonemes. For each phoneme, the second principle aims at selecting the connections occurring within its neighbourhood (in terms of phonetic similarity) since we consider that long-range connections are meaningless. The neighbourhood is not defined using an a priori distance (for instance a hard threshold of 3 between segments) but by selecting the path that preserves a minimal cost as illustrated below2. Let us concentrate on the potential paths linking /o:/ and /a/ in the five vowel system given in Figure 1. The direct path (based on the feature distance between the two phonemes) is 4. Besides, there are several indirect paths, such as for example /o:/ => /e:/ => /a/ or /o:/ => /u/ => /a/ or even /o:/ => /u/ => /i/ => /a/. This last one is especially interesting because the biggest “jump” between two nodes only involves a distance of 2 (in fact all the jumps in this path are of a distance of 2). In our approach, this path is then the less costly, since, step by step, it involves skipping from neighbour to neighbour. In this view, the number of jumps doesn’t matter, what counts
148
Christophe Coupé, Egidio Marsico and François Pellegrino
is their size. For this reason, the direct path (/o:/ => /a/, distance = 4) has been removed from the network in favor of the indirect one. STEP 1
STEP 2
STEP 3
We compute the direct phonetic
Identification of pairs of
Suppression of costly
distance for each phonemes
phonemes for which an
direct paths.
pair.
indirect path requires smaller "jumps" than the direct one.
i 2
2
u
2 4
eː
4
oː
2 4
3
a
i 2
2
3
2
u
2 4
eː
4
oː
2
2
4
3
i
a
2
3
2
u
2
eː
2
oː
2
a
Each node is linked with every
Dotted lines show costly
The resulting network only
other nodes of the network.
paths.
keeps the less costly con-
The values correspond to the
nections (direct or indi-
phonetic distances.
rect).
Figure 1. Description of the algorithm of construction of phonological graphs
The inspection of the various networks or graphs built with this approach reveals properties close to the classical ones in phonology in terms of serial or derivative structures. The next figure illustrates this point with the most frequent five vowel system in our data on the left, and a ten vowel system on the right, composed with the same five vowels plus their nasal counterparts. A layered structure is visible: the sub-network consisting of the vowels /i, e, a, o, u/ mirrors the one composed of the nasal counterparts, and is connected to it in a regular fashion.
Structural complexity of phonological systems
u
i o
e
149
ũ
ĩ i ẽ
u e
a
o
õ
a ã
Figure 2. Two examples of phonological graphs
2.2. Measuring the structural complexity of phonological inventories The previous step explained how we have built phonological graphs from PIs; we will now show how we can compare these PIs in terms of structural complexity using a specific measure relying on the corresponding graphs. The interest of this approach lies in the fact that it is anchored outside phonology and linguistics, and is not the result of an ad hoc measure based on language comparison. As such, this approach can make possible a comparison of PIs’ structural complexity with as little a priori bias as possible. However, estimating the complexity of a graph is not a straightforward matter. Several measures exist (Neel & Orrison, 2006; Jukna, 2006; Bonchev & Buck, 2005) and as always, choosing one over the others seems to depend on implicit considerations. 2.2.1. The notion of “off-diagonal complexity” Among the various possible measures found in the literature, our choice fell on the off-diagonal complexity proposed by Claussen (2004). This measure offers different characteristics that parallel simple intuitions linguists have on PIs. Indeed, this measure: - Doesn’t explicitly take into account graph size (i.e. its number of nodes or connections). This method consequently does not postulate that a large PI will be more complex than a small one; - Is sensitive to the presence of hierarchical sub-structures in the network. This can happen for example when a whole primary system is con-
150
Christophe Coupé, Egidio Marsico and François Pellegrino
trasted by a secondary feature (see above the ten vowel system in figure 2, or below, the system of Chipewyan); - Is minimal for regular graphs and maximal for free-scale graphs. It thus provides a benchmarking for PIs’ structural complexity (for which free-scale structure is very unlikely). The calculation of the offdiagonal complexity follows several steps: 1. Calculation of the degree of each node by counting its connections; 2. Construction of a matrix M defined by M(k1, k2)=number of connections existing between nodes of k1 degree and nodes of k2 degree; 3. Calculation of the entropy C of the distribution of normalized sums, mi, of the values of the minor diagonals of M with the following formula: k max
C = − ∑ mi log mi ; kmax being the max degree of a phoneme of the graph. i =0
Such a measure can seem complicated, but it is actually able to detect the structural regularities existing at the level of the relations between nodes. Figure 3 gives an example.
a)
b) 1 1
c) 1
2 3
2
1 1
1
1
7
2
1
1
4
3
2
1 1
4 1 1 1 5 6 0
Figure 3. The different steps of calculation of the offdiagonal complexity.
In (a) we have the initial graph with 19 nodes and 18 connections. In (b) we added the degree for each node. (c) presents the corresponding matrix M and the sum of the values of the diagonal. The resulting offdiagonal complexity is thus: 4 1 1 1 1 1 1 5 5 6 6 ⎛ 4 ⎞ C = −⎜ log + log + log + log + log + log + 0 ⎟ = 1.538 18 18 18 18 18 18 18 18 18 18 18 18 ⎝ ⎠
Structural complexity of phonological systems
151
As the preceding graphs have shown, offdiagonal complexity can only be calculated for non-valued graphs, i.e. graphs where the connections have no intrinsic value or weight. This is a serious limitation since, in our approach, the connections stand for distances between phonemes and are thus naturally valued. Because the Claussen measure doesn’t allow taking this information into account, the only use we make of it is when pruning the full graph by removing the costly connections. Figure 4 gives the offdiagonal complexity of several relatively simple PIs whereas Figure 5 illustrates the possibility to do so with a much more complicated system. 1)
u
i
u
i
o
e
eː
a
oː a
C = 0.64 3)
2)
C = 0.69 u
i e
o ɛ
uː 4)
iː i
ɔ
u oː
eː
a
a
C = 0.69
C = 0.97
Figure 4. Simple vowel systems and values of their offdiagonal complexity.
ĩ
ĩˠ iː
ũ i
u e
ũː uː
o a
aː
ã ãː
Figure 5. Off-diagonal complexity of the Chipewyan 14 vowel system; C=0.89.
152
Christophe Coupé, Egidio Marsico and François Pellegrino
These few examples are clear evidence of the absence of a direct relation between number of phonemes and value of the offdiagonal complexity. Figure 4 compares systems of the same size (4.1 vs. 4.2 for 5-vowel systems, and 4.3 vs. 4.4 for 7-vowel system) but with different complexity, and systems with the same complexity but different cardinal (4.2 vs. 4.3). The PI in 4.1 with only primary phonemes (or “basic” ones, according to Marsico et al. (2004) is less complex than the one in 4.2 with the same number of vowels but with a secondary non contrastive feature (length).This latter system being as complex as the PI in 4.3 which has two more vowels but all primary ones. Chipewyan (figure 5) presents a smaller complexity than the PI in 4.4 despite having twice as many vowels, due to its more regular structure. 2.3. Comparisons between phonological inventories from UPSID and random ones regarding off-diagonal complexity The table below gives the offdiagonal complexity of vocalic and consonantal systems3 for the whole set of languages from UPSID. UPSID C mean C min C max Std. Dev.
Vowel systems 0.794 0 1.700 0.313
Consonant systems 1.670 0 2.379 0.325
No correlation between vocalic and consonantal complexity was found (r²=0.0006). Thus, there is no compensation between structural complexity of vowels and consonants (no negative correlation); nor any parallel behaviour between them, like the smaller the one the smaller the other (no positive correlation). These results confirm, with a different measure, the ones described in Maddieson (2006). To assess whether the offdiagonal complexity was really capturing meaningful information on PIs, we compared the 451 UPSID PIs with a set of 451 generated PIs. These PIs were randomly composed by picking phonemes (from the whole set of existing phonemes) respecting the distribution of PI size from UPSID. Thus, every UPSID system was matched with a random one of the same size, but for which the content did not obey any linguistic motivation. Our hypothesis was that if random and actual systems lead to similar distribution of structural complexity, the offdiagonal com-
Structural complexity of phonological systems
153
plexity is pointless. The table below gives the distribution of random systems and is to be compared with the previous one. RANDOM C mean C min C max Std. Dev.
Vowel systems 1.071 0 2.106 0.470
Consonant systems 1.965 1.045 2.788 0.316
On average, the complexity of random systems is significantly higher than the one of real systems, both for vocalic (t(450) = -10.41; p < 0.001) and consonantal systems (t(450) = -13.85; p < 0.001). These results support the idea that this measure of complexity does capture part of the organization of PIs; random systems are more complex, i.e. they are less structured than real ones. On the other hand, there is a large overlap in the ranges of variation of complexity, especially for vowel systems where both random and real systems have a minimal zero value. A possible interpretation is that the off-diagonal complexity is not discriminative enough, due to a limited number of observed structures for the vowel systems. In order to further evaluate the performance of the algorithm, we considered the possible variations in terms of complexity among the main linguistic groups to which the UPSID languages belong. Following Maddieson (2006), we separated our sample in the 6 major geographical areas presented in the next table, along with the total number of languages per area and the average vocalic and consonantal complexity value. Two one-factor ANOVAS, independently run on vocalic and consonantal systems, reveal significant differences among the groups (F(5) = 6.02; p < 0.001 and F(5) = 23.25; p < 0.001, respectively). Post-hoc Scheffé’s tests reveal furthermore that the structural complexity of the vocalic systems of the area “Australia & New Guinea” is significantly different than the areas “Europe, South and West Asia” and “East and South-East Asia”. Regarding consonantal systems, several areas, when considered in pairs, show significant differences. For example, the “Africa” area presents a complexity significantly greater than any other, except for the “Europe, South and West Asia” area which presents a very close average value, as shown in the next table:
154
Christophe Coupé, Egidio Marsico and François Pellegrino
Number of languages Structural complexity of vocalic system Structural complexity of consonantal system
Europe, South and West Asia
East and SouthEast Asia
71
Africa
North America
South and Central America
Australia and New Guinea
108
74
68
66
64
0.90
0.87
0.79
0.73
0.81
0.65
1.83
1.61
1.84
1.67
1.51
1.45
2.4. Conclusion As the previous paragraphs have shown, the offdiagonal complexity seems a promising measure for analyzing the structure of PIs. However, although it coincides pretty well with linguists’ intuitions when applied to some specific systems, when the whole set of PIs from UPSID is considered, the distribution of complexity values is very compact, thus limiting the comparison between systems. This is due to the fact that the Claussen measure is more adapted to bigger graphs with more diverse internal structures. In our data, the limited typological structural variation is therefore a problem. One possible improvement could be to take into account the weight of the connections of the graphs (i.e. the phonetic distance between phonemes), but this is not possible yet with this measure. Still, these results suggest the following: (i) there are differences in terms of PI structure among linguistic areas, (ii) there is no relation whatsoever between the complexities of vocalic and consonantal system, and (iii) real PIs display a certain degree of regularity that random systems don’t.
Structural complexity of phonological systems
155
3. From molecules to phonemes: calculating cohesion and stability for phonological inventories In this section, we will present an alternative approach aiming at characterizing systemic features of PIs as well. We will propose a measure evaluating the cohesion of PIs, by borrowing the concept of energy so familiar in statistical physics. Unlike the measure introduced in the previous section, which was based on the topology of the systems, this one focuses more on the phonemes and their very interactions within systems. We will first present a measure of the interactions between phonemes considered two by two, and then an extension of this measure to the evaluation of the overall cohesion of PIs. Last, an evolutionary model of PIs will be presented and commented on. For reasons of clarity, we will only describe here the case of vocalic systems, but our approach also applies to full systems without separating vowels from consonants, as done in section 2. 3.1. On the notion of attraction and repulsion between phonemes For the topological approach combined with the off-diagonal complexity, the degree of relationship between phonemes was evaluated using a simple feature-based distance. Here, we propose to measure the interaction between two given segments using their patterns of cooccurrence among languages present in the UPSID database. This approach is based on the assumption that if two segments recurrently appear or don't appear together in PIs, an underlying constraint is probably responsible for this pattern. The study of this kind of regularities in the PIs of UPSID has been partially addressed by Clements (2003), who found convincing arguments in favour of the feature economy theory. To do so, Clements studied contingency matrices (see example below) in order to see whether phonetically close segments have a tendency to attract (to be present in the same PI) or repulse each other (to not appear together). He used a χ² test to ensure that only significant interactions are considered, according to the intrinsic frequency of each phoneme. Clements’ approach can be continued in two directions. First, the χ² test is limited when rare events are at play – a problem Clements did not have to deal with in his study. A solution is to apply the exact Fisher test instead, which can be used with any number of occurrences; actually, the χ² is just an approximation of the Fisher test, less costly in terms of calculation, but for which stronger hypotheses must be met.
156
Christophe Coupé, Egidio Marsico and François Pellegrino
A second improvement consists in not only considering the cooccurrence of two phonemes A & B, but more generally the four possibilities A & B, !A & B, A & !B, !A & !B where "!" stands for the absence of a phoneme. This allows for capturing a larger set of possibly relevant phenomena than if only considering the case where the two segments are present at the same time. The table below gives the contingency matrix for the phonemes /a/ and /ã/ in UPSID:
/a/ !/a/
/ã/ 82 1
!/ã/ 310 58
As we can see, only one language has /ã/ without /a/ whereas 310 others present the reverse situation, /a/ without /ã/. If we only calculate the statistical significance of the cooccurrence between /a/ and /ã/, we are bound to find a rather weak interaction because /a/ has an high intrinsic frequency (/a/ is present in 392 languages out of 451). Nevertheless, the contingency matrix is highly asymmetrical. Taking into account all the four possibilities allows us to measure not only the direct interaction between two phonemes of a system, but also the impact of the presence of one of the two when the other is absent: is the system indifferent or is it going to evolve to "recruit" the missing one or "get rid" of the other? Other pairs of oral vowels and their nasal counterparts follow the same pattern. This particular distribution may be linked to the mechanism of transphonologisation by which nasal vowels are derived from their oral counterparts by extension of the nasal feature of an adjacent consonant. The nasal vowel cannot appear without the oral one and the rare cases where it does only happen because the oral vowel disappears afterward (usually by quality change4). This example clearly illustrates that PIs can reflect diachronic processes although they are only implicitly expressed. The approach we are following is similar to a binarisation of PIs as they are now not only described by the set of phonemes they contain but by the set of all the others they don't contain as well. For example, a system with 5 vowels (out of the 180 ones possible in UPSID) will be described by the presence of these 5 vowels AND by the absence of the 175 others. To quantify the interaction between two phonemes, we took the logarithm of the exact Fisher test. Since the logarithm of probabilities only provides negative values, it is also necessary to evaluate the direction of the interaction: when two phonemes appear together more often than their re-
Structural complexity of phonological systems
157
spective frequencies would predict if they were independent, a "+" sign is given to their interaction whereas a "-" sign is attributed when the frequency of appearance is less than what is expected under the independence hypothesis. Finally, values have been normalized between -1 and +1, and give the strength of the interaction I. Figure 6 below presents some of the strongest interactions found in UPSID PIs.
˜i
i
ɪ
ʊ
e
u
u˜
o ɔ
ɛ a+
a
ɑ
a˜ Figure 6. Dashed lines represent the strongest repelling interactions, and solid ones, the strongest attracting interactions.
The relations illustrated in figure 6 reveal that systems have a tendency to harmonize front and back vowels for a given height: /i/ and /u/ attract each other, like /e/ and /o/ or /ɛ/ and /ɔ/. Another positive interaction groups together the three nasal vowels derived from /i, a, u/. Negative interactions are relevant as well. We notice that the three low vowels repel each other; and so do the pairs /i - ɪ/ and /u - ʊ/. The interactions /i - ʊ/ and /ɪ - u/ can be the reflection of the harmony between /i/ and /u/. In a maybe counterintuitive way, the strongest interactions do not involve /a/ with /i/ or /u/, even though these three segments are present together in a vast majority of languages. This can be explained by the fact that these segments are all very frequent (considered independently), so frequent that the Fisher test does not recognize their interactions as plausible. The same comment applies when the pair of segments involves an extremely rare secondary feature (like breathy-voiced or creaky-voiced for example); the test is then not powerful enough to assign strong interactions. This limitation prevents saying anything about relations between the most
158
Christophe Coupé, Egidio Marsico and François Pellegrino
frequent or the rarest segments. It can seem odd, especially for the very frequent segments (/a/, /i/ and /u/ for instance) for which several theories have proposed explanations of their frequency explicitly based on their interaction (in the line of the maximum or adaptive dispersion theory (Liljencrants & Lindblom, 1972)). However, it guarantees that only the information present in the database (and statistically assessed) is considered, without any theoretical a priori. Thus, this approach proposes a theoryneutral point of view that is worth being further explored as a way to access additional information on PIs. In order to take into account both the interaction and the intrinsic information relative to phoneme frequency in PIs, we also calculate the exact Fisher test for the frequency of distribution of a particular segment compared to a theoretical frequency of 50%. Segments that are present in less than 50% of the languages are given a negative intrinsic value and those present in more than 50% a positive one. These values are obtained by a transformation of the result of the Fisher test similar to the normalization used for the interactions. This intrinsic value V is linked to the frequency of phonemes through a nonlinear relation that takes the sampling effect into account. In the current approach, we only consider pairs of segments, but it is theoretically possible to deal with interactions of n-tuples with n > 2. Nevertheless, the size of the UPSID database would dramatically limit the number of triplets of phonemes for which significant interactions would be assessed. 3.2. From pairs of segments to the whole system We have defined, on the one hand, the intrinsic value V for individual segments and on the other hand, the interaction forces I for pairs of phonemes. Since the exact Fisher test, when applied to the interactions, neutralizes the weight of the intrinsic frequency of the segments, these two measures are statistically independent and thus can be combined for a global measure of cohesion. We now define this measure as C: (1)
C ( Sv ) = ∑ V ( Pi ) + ∑ I ' ( Pi , Pj ) Pi
Pi , Pj i≠ j
In this equation, Sv is a vocalic system, Pi and Pj are vowels and I ' ( Pi , P j ) = I ( Pi , P j ) if Pi and Pj are present, I (! Pi , P j ) if Pj is present
Structural complexity of phonological systems
159
and Pi absent etc. This way, we integrate for each pair of segments of a system the relevant combination among the four possible ones (not only present-present). Besides the fact that this doesn’t discard useful information, it makes the global cohesion independent of the size of the PI. One potential drawback, though, is the smoothing of the values of C and thus the resulting small range of variation between the PIs. However, the study of the PIs’ distribution is relevant. Our approach echoes Pablo Jensen’s in a recent economic study about the interactions between retail stores (Jensen, 2006). In his work, the interactions between the various stores, positive or negative, are calculated on the basis of the frequency of their cooccurrences in a close neighbourhood (that plays a similar role to PIs in our approach). All the interactions are then summed to calculate an energy value – corresponding to our value I – characterizing the organization of an economic and geographic space. This measure can also be calculated for any new potential store in order to evaluate its fitness in the anticipated location. Our approach echoes previous research on maximization of perceptual distance by replacing phoneme-to-phoneme perceptual similarity with synchronic phonemic interactions measured from UPSID. On the one hand, it definitely limits the explanatory power since the interactions revealed probably result from several factors without really identifying them. On the other, it enables us to examine the phonological system as a whole, and not only the primary vowels for instance, since all possible pairs of phonemes can be considered. Moreover, it provides a way to reveal interactions that would have been ignored in other more traditional approaches. Still, an important drawback remains since a kind of circularity is present, because we a priori use the frequency of distribution of segments to produce results on the same inventories. The concept of cohesion, defined as above, may intuitively be connected with a kind of global fitness of PIs: a system consisting of a set of antagonistic phonemes that have a tendency to repel each other would be ill-fitted; vice-versa a well-fitted system would consist of phonemes that go well with each other. Yet, this approach strongly relies on the implicit postulate that summing over the inventory the 1 by 1 interaction within each pair of phonemes is able to capture the complexity of the whole system. We thus hypothesise that we are dealing with a nonlinear second order complexity, and not a higher order one. If, based on this hypothesis, we obtain good results, for example in the predictions of the evolution of PI, then it would seem reasonable to say that PIs are of a relatively “small” complexity compared
160
Christophe Coupé, Egidio Marsico and François Pellegrino
to other systems with complexities of higher order. More explicitly, it would indicate that the model based on second-order interactions is a good approximation. On the contrary, if no valid result is reached, higher-order complexity (involving patterns of interactions with 3 or more segments) might be assumed. 3.3. Cohesion of the UPSID phonological inventories The presentation of the results starts with a comparison of the vowel systems of the 451 languages from UPSID with random systems, and by distinguishing the contribution of the intrinsic value V from the impact of the interactions I (Figure 7 to Figure 9). 140
UPSID systems
130
V value of the vocalic system
Random systems 120
110
100
90
80 0
5
10
15
20
25
30
35
Size of the vocalic system
Figure 7. Intrinsic values V for UPSID vowel systems (in dark) and random systems (in grey). Standard Deviation bars are displayed for the distribution of random systems.
Figure 7 shows that the intrinsic values V are higher for real systems than for random ones. Besides, V tends to decrease when the size of the system increases, corresponding to the appearance of rarer segments in the system. If we take a look at the maximal values of V reached for given sizes of the system, one can observe the following hierarchy, given that S(n) is the system of size n of maximum intrinsic value: V{S(5)} > V{S(6)} > V{S(7)} > V{S(4)} > V{S(3)}.
Structural complexity of phonological systems
161
Figure 8 deals with the interaction forces. I is higher, on average, for real systems than for random systems, although the distributions are overlapping. The overlap decreases for larger inventories since I increases for a significant proportion of real systems while, on average, it monotonically decreases for random systems. 7
UPSID systems Random systems
Interactions I within the vocalic system
6
5
4
3
2
1
0 0
5
10
15
20
25
30
35
-1 Size of the vocalic system
Figure 8. Interaction values I for UPSID vs. random vowel systems. Color code is the same as for Figure 7.
A plausible explanation is that the more the size of the system increases, the more likely it is to contain phonemes with a low intrinsic value V; However, the recruited phonemes have a tendency to positively interact with each other (high I). Figure 9 represents the global measure of cohesion C. At first sight, the results are similar to those of intrinsic values V, The main reason is that the range of variation of I is much lower than V variation (this is visible by comparing scales of axes of ordinates from Figures 7 and 8).There are however differences to be highlighted with respect to Figure 7: The ranking of the systems with the strongest cohesions for given sizes leads to the following order: C{S(5)} > C{S(7)} > C{S(3)} > C{S(6)}. Like for the intrinsic value, the max is obtained for a 5-vowel system (/i, e, a, o, u/), but the following hierarchy is different, suggesting that interactions play a role in the global cohesion of a system.
162
Christophe Coupé, Egidio Marsico and François Pellegrino
140
UPSID systems
Cohésion of the system C
130
Random systems
120
110
100
90
80 0
5
10
15
20
25
30
35
Size of the vocalic system
Figure 9. Global cohesion value for UPSID vs. random vocalic systems. Dinka is circled.
It is worth noticing that Dinka (Nilotic family, circled on the graph) with its 13 vowels doesn’t follow the general trend of UPSID PIs but falls into the variability of random systems. This system is indeed extremely uncommon since it contains 7 breathy-voiced and 6 creaky-voiced vowels with no modal voiced vowels (although we should probably take such a description of Dinka with caution). This fact decreases the intrinsic value of the system even though its interactional strength is not especially low (3.40). As a last remark, we would mention that none of the most cohesive systems violates principles such as symmetry, gradual filling of the vocalic space, etc. without these principles being directly specified in the calculations we’ve presented. Among the most cohesive systems, the first to use a secondary feature is a ten vowel system for which the five vowels /i, e, a, o, u/ are contrasted in terms of nasality.
Structural complexity of phonological systems
163
3.4. From static measures of cohesion to evolutionary dynamics: the notion of stability Using the measure of cohesion of the PIs as a fitness measure, we can now build a relatively simple model of stochastic evolution where various possible evolutionary trajectories are implemented and evaluated. The main driving force (and hypothesis) of this model is that a change is more likely to happen if it increases the global cohesion of the system where it takes place. This does not imply that changes decreasing the cohesion of a PI are impossible, for example under the influence of social constraints, but such changes are less likely to happen, and consequently, they are rarer in the simulations. The evolutionary algorithm processes as follow: * For a given PI S, 100 new systems are built differing by 0, 1 or more segments. These systems represent possible evolutions from S to a neighbour system. * The probability of each potential evolution is calculated by comparing its global cohesion with that of S, and then normalizing the differences in order to have a set of probabilities ranging from 0 to 1. The changes leading to an increase of cohesion will have the highest probabilities. * A system among the 100 is randomly chosen with respect to this distribution of probabilities. This system is considered as the new state of system S. Several mechanisms have been tested to explore the energetic landscape of the given PI S, as well as for the normalization of the differences between initial and final cohesion. They all lead to comparable results. Our model can test several evolutionary routes and then estimate the stability of a system as a function of its cohesion compared to that of its neighbouring systems. The stability is evaluated as follows: given a particular system, let us consider 500 independent evolutionary hypotheses (as the one described above) and evaluate the percentage of evolutions that maintain the system in its initial state (no change). The more cohesive the system S compared to its neighbours, the more likely the continuation of this state and thus the higher the stability. Vice-versa, a system surrounded by more cohesive systems is instable and very likely to change.
164
Christophe Coupé, Egidio Marsico and François Pellegrino
Figure 10 presents several indicators derived from the stability simulation for UPSID vowel systems. For a given size of the systems (X axis), the graph displays the stability of the less stable system (diamond shape), the most stable one (triangle shape) and the average stability of all the systems of that given size (square shape). 5
3
100%
7 6
90%
4
10
12
8 9
14
11
80% 13
Stability (%)
70%
15
60%
16
18
50% 40%
17
30% 20 20% 19
10%
21
24 22
0% 0
5
10
15
20
Size of the vocalic system (nb. of vowels)
25
28 30 Minimal stability Average stability Maximal stability
Figure 10. Stability of UPSID vowel systems sorted by increasing size. Numbers along the top curve give the size of the corresponding system.
Interestingly, whereas the maximal global cohesion decreased as soon as systems reached 7 or 8 segments, we notice that maximal stability is still high even for 12 or 14 vowels. Thus, the simulated evolutions reveal that large systems can play the role of stable attractors even if their cohesion is not really high; what matters is that it is higher than their neighbours. Another interesting result is the change of mode that operates at sizes close to 9 vowels: for smaller systems, odd numbers of vowels are the most stable (empty triangles in the graph) whereas for larger systems, stability comes with an even number of segments (full triangles). This particularly salient effect is worth being linked to the change of organization observable in the contents of phonological systems precisely around 9 vowels (Vallée, 1994). Below this threshold, we mostly find primary vowels in systems when above, systems tend to reorganize in series, contrasting a primary set of vowels by/with a secondary feature. To this regard, Kolokuma Ijo (spoken in Nigeria) is a good example as it has 18 vowels: 9 different qualities
Structural complexity of phonological systems
165
and their 9 nasal counterparts; it turns out to be the most stable 18-vowel system in UPSID (around 63%).
4. Conclusions and perspectives Phonological systems, because of their variety and their structure, constitute an archetype of complex organized systems. They are the reflection of physiological, cognitive and linguistic constraints together, as well as socio-linguistic ones linked to the interactions between speakers. Our understanding of these constraints, of their interactions and their impact on the evolution of systems themselves is still limited due to their complexity. In this picture, the science of complexity provides particularly powerful tools to shed new light on the issues at hand, especially to understand better the connections between the microscopic level (the phonological constituents) and the macroscopic level (each system, considered as a whole). However, their development and their adaptation to linguistics is not straightforward, and even if the first results are promising we must keep this difficulty in mind. The different approaches developed in this paper aim at extracting the intrinsic information hidden in a typological database of phonological inventories, avoiding as much as possible traditional a prioris in linguistic theories. More precisely, we paid attention to two factors of the complexity of PIs: the structural complexity and the interactional complexity. The issue of the hierarchy of phoneme complexity has been partially addressed in a previous paper (Marsico et al., 2004), by correlating the frequency of occurrence of phonemes in PIs to their capacity to generate new phonemes derived by addition of secondary features. The methodology used to evaluate the structural complexity comes from graph theory; it tries to take into account some regular patterns of organisation potentially important in phonological systems (like principles of economy or symmetry for example) and to dissociate the effects of the topology of the system from those of its size. These metrics shed light on the fact that the systems of the languages of the world are globally more structured than randomly constituted ones, and that significant topological differences exist between different linguistic areas. Furthermore, the complexity of consonantal systems is higher than that of vocalic systems. This last result raises a recurrent question relative to phonological systems: can we apply the same analysis to consonantal and vocalic spaces?
166
Christophe Coupé, Egidio Marsico and François Pellegrino
For a while, the structural differences seemed irreducible (discontinuitycontinuity, different acoustic cues and articulatory gestures). However, we think, following Lindblom and Maddieson (1988), that a common theory is possible, at least concerning the main internal principles structuring these spaces with a balance between a perceptual principle of sufficient contrast and an articulatory one of least effort or economy. Regarding the interactional complexity of PIs – and their intrinsic complexity as well - we used a methodology directly inspired by the interactional forces within physical systems and by the calculation of the resulting energy of this system. We have thus calculated indices of intrinsic value (linked to the identity of the different phonemes of a PI), and of interactional force (linked to the reciprocal influences of phonemes among them), for all the vocalic systems of our database. Here again, this approach has shown that a significant difference exists between real and random systems. Furthermore, these measures confirm that when the size of the vocalic system increases, the existing phonemes have a strong reciprocal positive influence (i.e. the interactional force increases), in order to partially compensate for the fact that these phonemes may have a smaller intrinsic value. This compensation, similar to a positive retroaction loop typical of numerous complex systems, can furthermore be a determining factor in the mechanisms of evolution of phonological systems. To test this hypothesis, we modelled a stochastic evolution of PI based on real systems, taking into account the global cohesion reached by the systems at each step of evolution. This led us to estimate a stability value for each system, and it showed that even systems with a relatively small cohesion (most often because of a large number of vowels) are judged stable by the model. Moreover, stable systems with more than 9 vowels use two distinct series of vowels (oral vs. nasal for example), thus illustrating a principle of feature economy or parsimony. We see here an emergent regularity which, if not predictable on the basis of cohesion alone, is nevertheless compatible with the fact that systems with a relatively large number of vowels are not rare (45% of UPSID languages have 8 or more vowels). More than the results themselves, our intention is to validate the fact that an interdisciplinary approach coming from the science of complexity allows the effective extraction of relevant information from PIs. This should not hide the fact that various issues are still at stake and require further study. One of the most important points is to define an approach which better validates the predictions of the evolutionary model, despite the limited size of UPSID. As a matter of fact, the frequency of distribution of
Structural complexity of phonological systems
167
systems (which can be linked to some extent to the quality – or fitness – of their response to synchronic and diachronic constraints) cannot be properly estimated with this database. If this major problem is solved, we will be able to evaluate more precisely if the second order complexity (consideration of 2-2 interactions at the microscopic level) gives a good approximation of the fitness of systems or if the relations existing between micro- and macroscopic levels are even more complex. Finally, another interesting aspect deals with the study of the evolutionary routes themselves, so as to discover potential attractors and cyclic attested trajectories in the history of languages. In the long term, the instantiation of these elements within a multi-agents model will allow us to address the external factors of evolution (socio-linguistic ones) as well, and to confront this approach with rich theoretical frameworks, such as the one proposed in Mufwene (2001).
Notes 1. All our data come from a slightly modified version of the UPSID database (Maddieson, 1984, Maddieson & Precoda, 1990) which contains 451 languages balanced regarding geographical distribution and genetic affiliation. 2. This approach has been selected from among several potential methods because it preserved interesting properties in terms of structure (see below). 3. The sets of features describing vowels and consonants being disjoint, we applied the algorithm separately on the two sub-systems. 4. The only language in UPSID presenting that situation is Kashmiri, with an opposition between /ɒ/ and /ã/.
References Abraham R. 2001 The genesis of complexity, unpublished ms available at: http://www.ralphabraham.org/articles/MS%23108.Complex/complex.pdf (consulted in December 2007). Bonchev, D. and Buck, G. A. 2005 Quantitative measures of network complexity. In Complexity in chemistry biology and ecology. Bonchev, D. and Rouvray, D. (Eds.). Springer Verlag. New York.
168
Christophe Coupé, Egidio Marsico and François Pellegrino
Cancho, R. F. i. and Solé, R. V. 2001 The small-world of human language. Santa Fe Institute Working Paper 01: 03-016. Cancho, R. F. i., Solé, R. V. and Köhler, R. 2004 Patterns in syntactic dependency networks. Physical Review E. 69: 051915-051911-051919. Claussen, J. C., 2004 Off-diagonal Complexity: A computationally quick complexity measure for graphs and networks. q-bio.MN/0410024. Clements, G. N. 2003 Feature economy as a phonological universal. Proceedings of the 15th International Congress of Phonetic Sciences. Barcelona. pp. 371-374. Dorogovtsev, S. N. and Mendes, J. F. F. 2001 Language as an evolving word web. Proceedings of The Royal Society of London Series B, Biological Sciences, 268: 2603-2606. Erdős, P. and Rényi, A 1960 The Evolution of Random Graphs. Magyar Tud. Akad. Mat. Kutató Int. Közl. 5: 17–61. Ferguson, Charles H. 1978 Historical backgrounds of universal research. In J. H. Greenberg, C. A. Ferguson & E. A. Moravcsik (eds.). Universals of human language, vol. 1. pp. 61-93. Stanford, CA: Stanford University Press. Jensen, P. 2006 Network-based predictions of retail store commercial categories and optimal locations. Phys. Rev. E 74: 035101. Jukna, S. 2006 On graph complexity. Combinatorics, Probability & Computing 15: 1-22. Liljencrants, J., Lindblom, B. 1972 Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 839-862. Lindblom, B. 1986 Phonetic universals in vowel systems. In Experimental phonology. Ohala, J., Jaeger, J. (eds). Orlando: Academic Press. pp. 13-44. 1998 Systemic constraints and adaptive change in the formation of sound structure. In Approaches to the evolution of language. Hurford J.R., Studdert-Kennedy M. & Knight C. (eds). Cambridge University Press: Cambridge. pp. 242-264. Lindblom, B. and Maddieson, I. 1988 Phonetic universals in consonant systems. In Language, Speech and mind. Li, C., Hyman, L. (eds). London: Routledge. pp. 62-78. Maddieson, I. 1984 Patterns of sounds. Cambridge: Cambridge University Press.
Structural complexity of phonological systems 2006
169
Correlating phonological complexity: data and validation. Linguistic Typology 10.1: 106-123 Maddieson, I., Precoda, K. 1990 Updating UPSID. UCLA Working Papers in Phonetics 74: 104-111 Marsico, E., Maddieson, I., Coupé, C., Pellegrino, F. 2004 Investigating the hidden structure of phonological systems. Proceedings of the 30th Meetingof the Berkeley Linguistic Society. Berkeley. pp. 256-267. Mufwene, S. S. 2001 The ecology of language evolution. Cambridge: Cambridge University Press. Neel, D. L. and Orrison, M. E. 2006 The Linear Complexity of a Graph. The Electronic Journal of Combinatorics 13. Ohala, J. J. 1980 Moderator's summary of symposium on 'Phonetic universals in phonological systems and their explanation', Proceedings of the 9th International Congress of Phonetic Sciences, Vol. 3. Copenhagen: Institute of Phonetics, pp. 181-194. Pastor-Satorras, R. and Vespignani, A. 2001 Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86: 3200-3203. Schwartz, J. L., Boë, L. J., Vallée, N., and Abry, C. 1997 The Dispersion-Focalization Theory of Vowel Systems. Journal of Phonetics 25.3: 255-286 Solé, R. V. 2004 Scaling laws in language evolution. In Power Laws in the Social Sciences. Cioffi, C. (Ed.), Cambridge University Press, Cambridge, MA. Steels, L. 1997 The synthetic modeling of language origin. Evolution of Communication Journal 1: 1-34 Vallée, N. 1994 Systèmes vocaliques : de la typologie aux prédictions. Thèse de doctorat, Grenoble: Université Stendhal. Watts, D.J. and Strogatz, S. H. 1998 Small world. Nature. 393: 440-442 Weaver, W. 1948 Science and Complexity. American Scientist 36: 536
Scale-free networks in phonological and orthographic wordform lexicons Christopher T. Kello and Brandon C. Beltz 1. Competing constraints on language use Languages are constrained by the physical, perceptual, and cognitive properties of human communication systems. For instance, there are upper bounds on the amount of time available for communication. These bounds constrain the lengths of phonological and orthographic codes so that communication can proceed apace. There are also constraints on the amount of linguistic information that can be condensed into a given span of perception or production (Liberman, 1967). These constraints place lower bounds on the amounts of speech activity needed for phonological and orthographic codes. Constraints on languages often work in opposition to one another, perhaps the most famously proposed example being Zipf’s principle of least effort (Zipf, 1949). On the one hand, memory constraints produce a tendency towards using fewer numbers of words to reduce memory effort needed to store and access them. A vocabulary that requires minimal memory effort on the part of the speaker is one that uses a single word for all purposes. On the other hand, ambiguity constraints produce a tendency towards using larger numbers of words to reduce the number of meanings per word, and thereby reduce effort needed to disambiguate word meanings. A vocabulary that requires minimal disambiguation effort on the part of the listener is one that uses a different word for every distinct concept. The principle of least effort states that natural languages are constrained to minimize both speakers’ and listeners’ efforts, and only by balancing them can effective communication be achieved. It is generally accepted that language usage must strike a balance between these two kinds of effort. However, Zipf controversially claimed that the principle of least effort is responsible for a particular kind of scaling law (also known as a power law) that appears to be true of word usage throughout the world. The scaling law states that the probability of using a given word W in language L is approximately inversely proportional to its frequency rank,
172
Christopher T. Kello and Brandon C. Beltz
P (WL ) ≈ r α where α ≈ –1. For instance the highest ranked word in English (THE) is about twice as likely to occur as the second highest ranked word (OF), which is about twice as likely as the fourth highest, and so on. This scaling law in the distribution of word frequencies means that a few words are used very often and most words are used rarely. This dichotomy creates a combination (balance) of high frequency words requiring little memory effort (because they are general-purpose words used often in many different contexts), and low frequency words requiring little disambiguation effort (because they are specialized words with particular meanings and contexts). The connection between word frequency and word meaning is evident, for instance, in the fact that closed-class words tend to be the most frequent of their language, and also appear in the most general contexts (e.g., the English word THE may be followed by virtually any noun, adjective, or adverb, albeit some words follow more frequently than others). Rare words are often from highly specialized domains and therefore appear in very particular contexts (e.g., terms specific to a given profession). Zipf’s law transparently corresponds to a continuous balance across the frequency range, from minimizing memory effort in the few frequent, context-general words, to minimizing disambiguation effort in the many rare, context-specific words (see Morton, 1969). This balance is present at all measureable scales because the function between word frequency and frequency rank is the same regardless of the scale at which these variables are measured (i.e., the relation is invariant over multiplication by a common factor). The idea that Zipf’s principle of least effort leads to this scaling law makes some intuitive sense, but Zipf never gave a rigorous proof of it. More problematically, other candidate hypotheses came to light that appeared to provide simpler explanations. Mandelbrot (1953), Miller (1957), and Li (1992) each showed that scaling law frequency distributions could be obtained from texts composed of random letter strings. Their proofs have led many researchers to discount such distributions as inevitable and therefore trivial facts of language use. However, others have pointed out that corpora composed of random strings have important differences with natural language corpora (Tsonis, Schultz, & Tsonis, 1997). For instance, the most frequent random strings are necessarily those of middling length, whereas in natural languages these
Scale-free networks in phonological and orthographic wordform lexicons
173
tend to be the shorter words. Random strings also cannot speak to the relationship between word frequency and word meaning. More generally, random strings do not have the capacity for structure that is requisite of real wordforms. Thus it appears that random strings exhibit scaling laws because string frequency has a particular relationship with string length, but this relationship is not what creates scaling laws in real word frequencies. 1.1. Criticality in language use Spurred by the inadequacies of random string accounts, Ferrer i Cancho and Solé (2003) conducted an information theoretic analysis to investigate Zipf’s hypothesized connection between the principle of least effort and scaling law frequency distributions. The authors showed that, under fairly general assumptions, the balance of memory effort and disambiguation effort can be shown to produce a scaling law in the frequency distribution of word usage. Their analysis was motivated by theories of critical phenomena that were developed in the area of physics known as statistical mechanics (Huang, 1963; Ma, 1976). The aim of statistical mechanics is to describe the probabilistic, ensemble (global) states of systems with many interacting components. Ferrer i Cancho and Solé (2003) modeled communication systems by treating language users as system components and word usage as the result of component interactions. From this perspective, ensemble states correspond to distributions of word usage, and the authors focused on two kinds of distributions that often constitute opposing phases of a system’s behavior. One phase is characterized by high entropy in that systems may exhibit different behaviors with roughly equal probability (i.e., a flat probability distribution). The other is characterized by low entropy in that some behaviors may occur more often than others (i.e., a peaked probability distribution). In this framework, the high entropy phase corresponds to minimizing disambiguation effort in that many different words are used in order to distinguish among many different meanings (i.e., a relatively flat probability distribution of word usage). The low entropy phase corresponds to minimizing memory effort in that only one or a few words are used for most meanings (i.e., a relatively peaked probability distribution of word usage). As explained earlier, an effective communication system is one that strikes a balance between these two opposing phases.
174
Christopher T. Kello and Brandon C. Beltz
Theory from statistical mechanics is useful here because it has been shown that, when complex systems transition between phases of low and high entropy, the transition often occurs abruptly rather than gradually (Ma, 1976). In thermodynamic terms, low memory effort and low disambiguation effort may be two opposing phases of the communication system that have a sharp phase transition between them. Systems poised near phase transitions are said to be in critical states, and critical states are known to universally exhibit scaling laws in their behaviors, including scaling law distributions like Zipf’s law (Bak & Paczuski, 1995). Thus evidence of Zipf’s law suggests that communication systems tend to be poised near critical states between phases of low memory effort and low disambiguation effort. To investigate this hypothesis, Ferrer i Cancho and Solé (2003) built a very simple, information theoretic model of a communication system, and they optimized the model according to two opposing objectives: To minimize the entropy of word usage on the one hand (minimize memory effort), while also minimizing the entropy of meanings per word on the other hand (minimize disambiguation effort). These entropies were opposed to one another and the model contained a parameter that governed their proportional influence on communication. Model results revealed a sharp transition between the phases of low memory effort and low disambiguation effort. Moreover, Zipf’s law was obtained when communication was poised near this phase transition. These simulation results provide a theoretically grounded explanation of Zipf’s law, but one might question whether the authors have built a bridge too far: why would theories of critical phenomena developed for physical systems apply to systems of human communication? The answer is that systems in critical states exhibit general principles of behavior that hold true regardless of the particular kinds of components that comprise the system, a phenomenon known as universality in theoretical physics (Sornette, 2004). Thus interacting atoms or interacting words or interacting people may all share in common certain principles of emergent behavior.
2. Competing constraints on wordform lexicons If principles of criticality are general to language systems, then scaling laws analogous to Zipf’s law should be found in language systems wherever there is a phase transition between low and high entropy. In the present study, we adopt and adapt Ferrer i Cancho and Solé’s (2003) information
Scale-free networks in phonological and orthographic wordform lexicons
175
theoretic analysis to investigate an analogously hypothesized phase transition in language systems. The language domain that we focus on is wordform lexicons. For the sake of simplicity let us represent wordforms as linear strings of phonemes or letters. The appearances of words in speech or text can be coarsely represented as such strings, in which case wordform lexicons consist of all strings that appear as wordforms in a given language (token information about individual appearances is discarded). Language users must know their wordform lexicons to communicate, and thus communication constraints should apply to lexicon structure, just as they apply to word usage (the latter being defined in terms of token information instead of lexicon structure). We investigated two competing constraints on lexicon structure that are analogous to the ambiguity and memorability constraints hypothesized for Zipf’s law, namely, distinctiveness and efficiency constraints. On the one hand, the efficiency of lexicon structure should be maximal in order to minimize the resources needed to represent them. Analogous to Zipf’s law, a maximally efficient lexicon is one that uses the fewest number of letter strings necessary to distinguish among all wordforms. This means that letter strings are reused across wordforms as much as possible. If one allows homophones or homographs to occur without limit (i.e., using the same wordforms to represent multiple word meanings, as in /miːt/→MEAT or MEET for homophones, and WIND→/wɪnd/ or /wɑɪnd/ for homographs), then a maximally, overly efficient lexicon would use only one letter string to code all words. On the other hand, the mutual distinctiveness of wordforms in a lexicon should be maximal in order to minimize the chance of confusing them with each other during communication. A distinctive lexicon is one whose wordforms use unique letter strings, with minimal substring overlap across wordforms. For instance, the English orthographic wordform YACHT is distinctive because substrings like YACH, ACHT, YAC, and CHT are not themselves English wordforms (note that substrings are positionindependent). By contrast, the wordform FAIRED is less distinctive because FAIR, AIR, AIRED, IRE, and RED are all wordforms themselves. Note that a wordform liked FARED would also be less distinctive (e.g., FAR, ARE, RED, FARE), even though FARED is not a substring of FAIRED. We define these competing constraints in terms of all substrings (i.e., wordforms of all sizes) because there does not appear to be any privileged scale of substring analysis. One can see this in the fact that, collectively
176
Christopher T. Kello and Brandon C. Beltz
speaking, languages of the world use all scales of substrings to express their phonological, orthographic, and morphological structures. In English, for instance, some inflectional morphemes are expressed as single letters (e.g., -s for pluralization), whereas others conveying whole word meanings are expressed by strings as large as the wordforms themselves. Between these extremes one can find morphological structures expressed as substrings at any given scale, in any given position. Because distinctiveness and efficiency constraints are defined over all substrings, an analysis of any given language will include substrings that are not linguistically relevant to the wordforms containing them. For instance, the wordform RED does not correspond to a linguistic unit in the wordform FAIRED, yet it is included below in our analysis of an English orthographic wordform lexicon. Conversely, substrings will not capture all possible morphological structures (e.g., infixes in languages like Hebrew). One-to-one correspondence between substrings and linguistic structures is not necessary for our analysis because substrings are not meant to capture all the factors that might help to shape a wordform lexicon; this would not be feasible. Substrings are only meant to capture one facet of the hypothesized balancing act between distinctiveness and efficiency, albeit a salient one. The face validity of our analysis can be seen in the functional importance of balancing distinctiveness and efficiency: If distinctiveness is overemphasized, then structure will not be sufficiently shared across wordforms. If efficiency is over-emphasized, then structure is not sufficiently heterogeneous across wordforms. Our research question is whether the need to balance these competing constraints poises wordform lexicons near a phase transition between states of low and high entropy. If so, then a scaling law is predicted to occur in the distributions of substrings that comprise wordform lexicons. 2.1. Scale-free wordform networks We explain how a scaling law is predicted in the next section, but it is helpful to first point out that our prediction corresponds to what is commonly referred to as a scale-free network. To illustrate by contrast, note how the word frequency distributions following Zipf’s law do not have any explicit connections among the words. This is because only frequency counts are relevant to Zipf’s law. Substring frequency distributions are different be-
Scale-free networks in phonological and orthographic wordform lexicons
177
cause substring counts are related to the substring structure of wordform lexicons. For instance, each substring count for the English wordform RED corresponds to its connection with another English wordform (RED is a substring of FAIRED, REDUCE, PREDICT, and so on). These connections form a structure that can be formalized as network (i.e., directed graph) for which each node is a different wordform, and one node is linked to another whenever the former is a substring of the latter. A small piece of the network created from an English wordform lexicon is diagrammed in Fig 1.
Figure 1. Piece of English orthographic wordform network
The inclusion of all substring relations among wordforms creates a densely interconnected network with a tree-like branching structure from shortest to longest wordforms. The shortest wordforms serve as the tree trunks; they have no incoming links because no wordforms are contained within them. The longest and most unique wordforms are at the branch tips; they have no outgoing links because they are not substrings of other wordforms. The progression from trunks to tips is highly correlated with wordform length, but not strictly tied to it: Some longer but common substrings are more trunk-like than shorter but unusual substrings (e.g., SING is more root-like than YO). This wordform network is relevant to our research question because the links are directly related to the distinctiveness and efficiency of the wordform lexicon. In particular, distinctiveness increases as the number of incoming links decreases, and efficiency increases as the number of outgoing links increases. Thus wordform networks serve as tools for conceptualizing and analyzing the hypothesized distinctiveness and efficiency constraints on lexicon structure. In terms of the network formalism, our predicted scaling law can be found in the counts (i.e., degrees) of outgoing links per node (i.e., the num-
178
Christopher T. Kello and Brandon C. Beltz
ber of times that a given wordform appears as a substring of another wordform in the lexicon). Rather than use the frequency rank distribution as for Zipf’s law, network link distributions are often expressed in terms of the cumulative probability distribution: The probability of choosing a wordform node at random whose number of outgoing links is ≥ k is predicted to be
P (≥ k ) ≈ k γ
where γ ≈ –1 is typically referred to as a scale-free network. The cumulative probability distribution is a popular means of expressing scale-free networks, in part because exponents can be more directly and reliably estimated from it (see Kirby, 2001). Casting our predicted scaling law as a scale-free network is also potentially useful because scale-free networks have attracted a great deal of attention in recent years throughout the sciences. Many systems in nature and society can be represented as networks, and it turns out that such networks are often scale-free. For instance, scale-free network structures have been found in computer networks (Barabasi, Albert, & Jeong, 2000; Albert, Hawoong, & Barabasi, 1999), business networks (Wasserman & Faust, 1994), social networks (Barabasi, Jeong, Neda, Ravasz, Schubert, & Vicsek, 2002), and biological networks of various kinds (Jeong, Tombor, Albert, Oltvai, & Barabasi, 2000; Sole, 2001). In the context of language, Steyvers and Tenenbaum (2005) found that semantic networks of words have scale-free structures when constructed using either behavioral or encyclopedic methods. They built one semantic network from word association data by linking any two word nodes for which one was given as an associate of the other (e.g., a participant might associate the word NURSE with DOCTOR). Two other networks were similarly built using encyclopedic methods, one based on a thesaurus and the other on an on-line encyclopedia. All three methods yielded semantic networks whose link distributions obeyed a scaling law. Semantic networks have the connotation of spreading activation across the nodes via their links, and many other networks also entail transmission of information or materials among the nodes. However, it is important to clarify that our wordform networks do not come with an assumption of spreading activation or information transmission among wordforms. We employ the network formalism only for its structural properties.
Scale-free networks in phonological and orthographic wordform lexicons
179
2.2. Information theoretic analysis To show how a scale-free wordform network is predicted in the balance of distinctiveness and efficiency constraints, we parallel Ferrer i Cancho and Solé’s (2003) information theoretic analysis that showed how Zipf’s law can be predicted from Zipf’s principle of least effort. We represent a wordform network as a binary matrix A = {aij}. Each row i represents a wordform wi, where 1 ≤ i ≤ n and n is the number of words in the lexicon. Each column j also represents a wordform numbered from 1 to n. Each aij = 1 if wi is a substring of wj (wordforms are treated as substrings of themselves, i.e., aij = 1 for all i = j), and aij = 0 otherwise. The probability that wordform wi appears as a substring, relative to all other wordforms, is given by (all sums are from 1 to n),
P (wi ) = ∑ aij j
∑∑ a k
kl
l
The efficiency of a wordform lexicon is defined in terms of the entropy of the substring probability distribution,
H n (w) = −∑ P (wi ) log n P (wi ) i
Hn(w) = 0 when a single wordform is used for all words, and Hn(w) = 1 when all wordforms appear as substrings of other wordforms equally often (the upper boundary is 1 because the log is base n). Thus a lexicon is efficient to the extent that it uses as few wordforms as possible, where “fewer” corresponds not only to the number of different wordforms, but also the frequency with which each is used. The distinctiveness of a wordform wi is defined in terms of its diagnosticity, that is, the amount that uncertainty is reduced about the identity of a word W given that it contains wi. The negative of this amount can be quantified by the entropy over the probability distribution of wordforms conditioned by the presence of wi,
H n (W wi ) = −∑ P (w j wi )log n P (w j wi ) j
Hn(W|wi) = 1 when the presence of wi provides no information about the identity of W, and Hn(W|wi) = 0 when the presence of wi assures the identity of W. Each conditional probability is given by
P (w j wi ) = aij
∑a
ik
k
180
Christopher T. Kello and Brandon C. Beltz
Finally, the overall distinctiveness of a wordform lexicon is defined as the average distinctiveness over wordforms (the average is used to normalize both Hn(w) and Hn(W|w) between 0 and 1),
H n (W w) = ∑ H n (W wi ) n i
The balancing of distinctiveness and efficiency now translates into the simultaneous minimization of Hn(w) and Hn(W|w). These constraints are in opposition to each other because Hn(W|w) = 1 when Hn(w) = 0, i.e., when a single wordform is used for all words. However, when wordforms appear as substrings equally often, Hn(w) = 1, there is no guarantee that substrings will be as diagnostic as possible, Hn(W|w) = 0. This is true because wordforms may be equally “overused” as substrings. Thus these constraints are not isomorphs of each other. The balance of minimizing Hn(w) versus Hn(W|w) is parameterized by 0 ≤ λ ≤ 1 in
Ω(λ ) = λH n (w) + (1 − λ )H n (W w) In their parallel analysis, Ferrer i Cancho and Solé (2003) created matrices Aλ that minimized Ω(λ) at numerous sampled values of 0 ≤ λ ≤ 1 (see also Ferrer i Cancho, 2006). They showed that at λ ≈ 0.4, a sharp transition existed in the values of their entropic measures that were analogous to Hn(w) and Hn(W|w). Moreover, they found that the frequency of word usage was distributed according to Zipf’s law at the transition point. Thus it appears that this point is a phase transition exhibiting a scaling law. Our analysis parallels Ferrer i Cancho and Solé’s (2003) in order to make the same kind of scaling law prediction, but in terms of substring structure in a wordform lexicon, rather than word usage in communication. Thus our analysis predicts a scaling law in the distribution of outgoing links across wordform nodes, that is, it predicts a scale-free network. This scalefree network is predicted at the transition point between phases of lexicon distinctiveness versus lexicon efficiency. Languages are predicted to evolve towards this transition point because all lexicons need to distinguish between wordforms while simultaneously minimizing the resources needed.
Scale-free networks in phonological and orthographic wordform lexicons
181
3. Empirical evidence for scale-free wordform networks Our predicted scaling law is relatively straightforward to test. It simply requires the creation of wordform networks from real languages, and the examination of their structure for a scaling law in their link distributions. We begin with networks created from phonological and orthographic wordforms in English, and we then report the same analyses for four other languages. 3.1. English wordform networks A total of 104,347 printed words and 91,606 phonetically transcribed words were drawn from the intersection of the Carnegie Mellon University pronunciation dictionary and the Wall Street Journal corpus. The letter strings comprised an orthographic wordform lexicon, and the phoneme strings were used to create two different phonological wordform lexicons, one with lexical stress markings on the vowels (primary, secondary, and tertiary) and one without stress markings. The frequency of wordform usage was not part of the wordform lexicons. A wordform network was created for each of the three lexicons. Each node in each network corresponded to an individual wordform, and within each network one node was linked to another if the former wordform was a substring of the latter. For the stress-marked lexicon, one wordform was a substring of another only if both the phonemes and stress markings of the former were contained in the latter. Each node i of a network had ki outgoing links, where 1 ≤ ki ≤ n and n is the total number of wordforms in the corresponding lexicon. As mentioned earlier, the predicted scaling law is usefully expressed in terms of the cumulative probability distribution, which is linear under a logarithmic transform (the intercept is zero),
log P (≥ k ) ≈ γ log k This expression facilitates visualization and analysis of the data. Cumulative probability distributions for the three wordform networks are plotted on a log-log scale in Fig 2. Clear evidence for a scaling law can be seen in the negative linear relation between log P(≥ k) and log k. The exponent of the scaling relation for each distribution was estimated by the slope of a linear regression line fit to the data between 1 ≤ log k ≤ 3. In theory, scaling laws range over all scales (i.e., the entire distribution), but
182
Christopher T. Kello and Brandon C. Beltz
empirical observations rarely if ever achieve this ideal because of limited amounts of data and other practical limitations. These limitations typically show up in cumulative probability distributions as deviations in the tails from the scaling relation. These deviations are slight for the wordform networks plotted in Fig 1, but to avoid them exponents were estimated from the middle of the distribution. Fig 1. English Wordform Networks
0
log P(≥k )
-1 -2 -3
Ortho, γ ≈ -0.82 Phono w/ Stress, γ ≈ -1.05 Phono w/o Stress, γ ≈ -0.90
-4 -5 0
1
2
3
4
5
log k
Figure 2. English wordform networks
The estimated exponents are close to the canonical value of –1 for scalefree networks. The exponent estimate for the orthographic wordform network is slightly more negative than the others, indicating that it is slightly more densely interconnected (and likewise for the phonological network without stress versus with stress). These differences in density are under investigation but they may be partly due to the differences in morphological transparency among the lexicons: In English, orthographic wordforms represent morphological structure more directly, e.g., SIGN is a substring of SIGNATURE in the orthographic network but not the phonological networks. Differences aside, the results generally confirm the predicted scaling law.
Scale-free networks in phonological and orthographic wordform lexicons
183
3.2. Wordform networks in other languages The same wordform network analyses were also conducted on orthographic wordform lexicons for Dutch, German, Russian, and Spanish. These particular lexicons were chosen only because they were readily analyzable and downloadable at ftp://ftp.ox.ac.uk/pub/wordlists. These languages represent a sample of the Indo-European language family. In terms of their morphological structure, they are mostly characterized as synthetic languages (i.e., high morpheme-to-word ratios). Comparing these languages to English, which is more of an isolating language (i.e., low morpheme-to-word ratio) provides an initial gauge of the degree to which language type influences the results of our analyses. 178,339 Dutch Wordforms γ ≈ -1.01
31,801 Russian Wordforms γ ≈ -1.07
159,102 German Wordforms γ ≈ -1.06
85,947 Spanish Wordforms γ ≈ -0.91
Figure 3. Other wordform networks
The cumulative probability distribution for each wordform network is plotted in Fig 3 with the lexicon size and the estimated scaling exponent (the axes are the same as in Fig 2). All four languages show evidence of a scaling relation in the center of their link distributions with estimated exponents near –1. Estimates varied slightly across languages, as did the amount of deviation in the tails of the distributions.
184
Christopher T. Kello and Brandon C. Beltz
The wordform statistics of the orthographic lexicons are reported in Table 1. N is the number of wordforms analyzed, M and SD are the mean and standard deviation of wordform lengths respectively, and γ is the estimated scaling exponent of the link distributions. Evidence for the isolating quality of English morphology is reflected in its shorter mean wordform length compared with the other languages (fewer and smaller morpheme combinations), which are more synthetic by comparison. The slightly less negative scaling exponent for English may be due to its isolating quality, but the cross-linguistic differences in γ may also be due to idiosyncracies in the corpora used. For our current purposes, it is sufficient that all the languages exhibit a scaling law as predicted. Table 1. Summary statistics for orthographic lexicons
English Dutch German Russian Spanish
N 104,347 178,339 159,102 31,801 85,947
M 7.3 10.2 11.9 8.1 8.9
SD 2.3 3.0 3.5 2.4 2.5
γ -0.82 -1.01 -1.06 -1.07 -0.91
3.3. Ruling out an artifactual explanation Altogether, our network analyses appear to provide considerable evidence for the scaling law predicted to occur in the balance of distinctiveness and efficiency constraints on the structure of wordform lexicons. But before coming to this conclusion, we must first determine whether these results may be an inevitable and therefore trivial property of wordform networks created from substring relations. In particular, it may be that lexicons composed of variable-length random letter strings also produce the predicted scaling law. This may seem possible because, even for random letter strings, shorter wordforms will tend to have more outgoing links compared with longer wordforms, and the longest wordforms will have no outgoing links. Thus variations in wordform length alone may be sufficient to create the predicted scaling law. We tested this artifactual explanation by creating a wordform lexicon comprised of random letter strings using essentially the same method as used by Mandelbrot (1953), Miller (1957), and Li (1992). Each wordform
Scale-free networks in phonological and orthographic wordform lexicons
185
was incrementally built up by repeatedly adding a letter with probability p = 0.82, or completing the wordform with probability 1–p = 0.18. Each letter was chosen at random with equal probability, and the completion probability was chosen so that the average wordform length would be the same as that for our corpus of English orthographic wordforms. A total of 104,347 random wordforms were created, which is the size of our English orthographic wordform lexicon. The cumulative probability distribution for the random wordform network is plotted in Fig 4. The graph shows that the distribution does not at all resemble the scaling relation observed for the English orthographic wordform network, whose distribution is also plotted for purposes of comparison. Instead of a scaling relation, the random wordforms yielded a tiered distribution that is indicative of characteristic numbers of outgoing links per node. For instance, the majority of nodes had only one or a few outgoing links, but a second large group of nodes had 30-35 links. Hardly any nodes had between 6 and 18 links. Five other random lexicons were generated and each one resulted in a similarly tiered distribution. Fig 3. Artificial Wordform Networks
0
log P(≥k )
-1 -2 -3
English Orthography Random Wordforms Bigram Wordforms
-4 -5 0
1
2
3
4
5
log k
Figure 4. Artificial wordform networks
The failure of random wordform lexicons to yield a scaling relation shows that our results with real lexicons were not an artifact of length variability in wordform lexicons. It therefore appears that the observed scaling relations reflect a property of the structural relations among wordforms in natural languages. To provide further support for this conclusion, we tried to recreate the scaling relation by creating an artificial wordform lexicon using the bigram frequencies of English orthography. Wordforms were again
186
Christopher T. Kello and Brandon C. Beltz
built up incrementally, except that the probability of each letter being chosen was conditioned on the previous letter, and the conditional probabilities were estimated from the Wall Street Journal corpus. So for instance, if the letter Q happened to be chosen as the first letter of a given wordform, there was a 97% chance that the second letter would be U. This method created a wordform lexicon that mimicked the statistical properties of English wordforms. The cumulative probability distribution for the bigram wordform network is also plotted in Fig 4. This distribution is much closer to the predicted scaling law in that the tiers are gone and the slope of the overall descent is near -1. However, there is a “bump” over most of the center of the distribution that deviates from the nearly perfect linear relation of the English wordform network. This result indicates that the statistical structure of English wordforms did, in fact, play a role in the observed scaling relation. However it also suggests that not all relevant aspects of wordform structure are captured by bigram frequencies because the scaling relation was not entirely recovered. Work is underway to determine whether more of the scaling relation can be recovered with artificial lexicons that more closely mimic the statistical structure of English.
4. Conclusions In this chapter, theories of criticality were used to predict a heretofore unexamined scaling law in the structure of phonological and orthographic wordform lexicons. Evidence for the predicted scaling law was found in the wordforms of five different languages, and analyses of artificial lexicons showed that the scaling law is not artifactual. The law is hypothesized to emerge from the balance of two competing constraints on the evolution of wordform lexicons: Lexicons must be as distinctive as possible by minimizing substring overlap among wordforms, while also being as efficient as possible by reusing substrings as much as possible. A phase transition is hypothesized at the balance of these high and low entropy phases, respectively. Empirical and theoretical work on critical phenomena predicts a scaling law distribution near the hypothesized phase transition. The predicted scaling law distribution was expressed in terms of scalefree networks in which wordforms were connected whenever one was a substring of another. In general, some of these substring links reflect the linguistic structure that underlies wordforms. For instance, root morphemes
Scale-free networks in phonological and orthographic wordform lexicons
187
like FORM will often be substrings of their inflected and derived forms like FORMED and FORMATION, respectively. Also, monosyllabic wordforms like /fit/ are substrings of multisyllabic wordforms like /dɪ'fit/. However, substring relations do not always respect linguistic structure, and not all linguistic structure is reflected in substring relations. For instance, LAND is a substring of BLAND even though there is no morphological relation between them, and /ɹid/ is not a substring of /ɹɛd/ even though the latter verb is the past tense of the former. This partial correspondence between our wordform networks and linguistic structure makes their relationship unclear. Substring relations among wordforms fall within in the purview of linguistics, but they do not appear to have a place in current linguistic theories. Nonetheless, the observed scaling relations are lawful and non-trivial, as we have argued, and may be universal as well. If so, then it may prove informative to investigate whether and how scale-free wordform networks may be accommodated by linguistic theory. For instance, there are some phonological processes that may fit with our explanation of scale-free wordform networks. Processes like assimilation, elision, syncope, and apocope may generally help to make wordform lexicons more efficient by creating more overlap among wordforms, whereas processes like dissimilation, epenthesis, and prothesis may help to make wordform lexicons more distinctive by creating less overlap among wordforms. Finally, similar ideas have been explored in Lindblom’s Theory of Adaptive Dispersion (Lindblom, 1986; Lindblom, 1990) and in Ohala’s Maximum Use of Available Features (Ohala, 1980). It has been proposed that the phonological contrasts of a language are chosen to simultaneously 1) maximize the distinctiveness of contrasts, and 2) minimize articulatory effort. Constraint 1 is analogous to distinctiveness as we have defined it, except that phonological contrasts are more fine-grained than substrings. Constraint 2 stands in opposition to Constraint 1 and phonological systems must strike a balance between these opposing constraints, analogous to how lexicons must strike a balance between distinctiveness and efficiency. Flemming (2004) proposed a third constraint on maximizing the number of contrasts, which was intended to ensure lexicons of sufficient size without excessively long words. Fleming’s constraint is useful for bridging the gap between phonological inventories and lexicon efficiency. The similarities between our theory and theories like Lindblom’s and Flemming’s suggest possible avenues of fruitful exchange. In one direction,
188
Christopher T. Kello and Brandon C. Beltz
for instance, Lindblom’s theory may benefit from principles of critical phenomena. In the other direction, our analysis may benefit from the inclusion of articulatory effort and distinctiveness among sounds, which clearly have important influences on the structure of wordforms. Such theoretical exchanges exemplify the kind of transdisciplinary work that is currently going on throughout the complexity sciences.
References Albert, R., Hawoong, J., Barabasi, A.L. 1999 Diameter of the World Wide Web. Nature, 401, 130. Bak, P. & Paczuski, M. 1995 Complexity, contingency, and criticality. Proceedings of the National Academy of Sciences, 92, 6689–6696. Barabási, A., R. Albert, & Jeong, H. 1999 Mean-field theory for scale-free random networks. Physica A, 272(1), 173-187. 2000 Scale-free characteristics of random networks: the topology of the world-wide web. Physica A, 281(1-4), 69-77. Barabási, A., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T. 2002 Evolution of the social network of scientific collaborations. Physica A, 311(3-4), 590-614. Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. Cambridge, MA: MIT Press. Ferrer i Cancho, R. 2006 When language breaks into pieces: A conflict between communication through isolated signals and language. BioSystems, 84, 242-253. Ferrer i Cancho, R., & Solé, R.V. 2003 Least effort and the origins of scaling in human language. Proceedings of the National Academy of Science, 100(3), 788-791. Flemming, E. 2004 Contrast and perceptual distinctiveness. In Hayes, Kirchner & Steriade (Eds), Phonetically-Based Phonology (pp 232-276). Cambridge University Press. Huang, K. 1963 Statistical Mechanics. New York: Wiley. Jeong, H., Tombor, B., Albert, R., Oltvai, Z., & Barabási, A. 2000 The large-scale organization of metabolic networks. Nature, 407(6804), 651-654.
Scale-free networks in phonological and orthographic wordform lexicons Kirby, S. 2001 Li, W. 1992
189
Spontaneous evolution of linguistic structure: An iterated learning model of the emergence of regularity and irregularity. IEEE Transactions on Evolutionary Computation, 5(2), 102-110.
Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842-1845. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., & Studdert-Kennedy, M. 1967 Perception of the Speech Code. Psychological Review, 74(6), 431-461. Lindblom, B. 1986 Phonetic universals in vowel systems. In J.J. Ohala and J.J. Jaeger (Eds), Experimental Phonology. Orlando, Florida: Academic Press. 1990 Phonetic content in phonology. PERILUS, 11. Ma, S. 1976 Modern theory of critical phenomena. Reading: Benjamin / Cummings. Mandelbrot, B. 1953 An Informational Theory of the Statistical Structure of Language. In W. Jackson (Ed), Communication Theory. London: Bettersworths. Miller, G. 1957 Some effects of intermittent silence. American Journal of Psychology, 52, 311-314. Morton, J. 1969 Interaction of information in word recognition. Psychological Review, 76(2), 165–178. Newman, M. 2005 Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323-351. Ohala, J.J. 1980. Chairman’s introduction to symposium on phonetic universals in phonological systems and their explanation. In Proceedings of the Ninth International Congress of Phonetic Sciences, 1979, 184-185. Copenhagen: Institute of Phonetics, University of Copenhagen. Oudeyer, P. 2005 The self-organization of speech sounds. Journal of Theoretical Biology, 233(3), 435-439. Solé, R. V. 2001 Complexity and fragility in ecological networks. Proceedings of the Royal Society B: Biological Sciences, 268(1480), 2039-2045. Sornette, D. 2004 Critical phenomena in natural sciences: chaos, fractals, selforganization, and disorder: concepts and tools (2nd ed.). Berlin; New York: Springer.
190
Christopher T. Kello and Brandon C. Beltz
Steyvers, M. & J. B. Tenenbaum 2005 The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41-78. Tsonis, A., Schultz, C., & Tsonis, P. 1997 Zipf's law and the structure and evolution of languages. Complexity, 2(5), 12-13. Wasserman, S. & Faust, K. 1994 Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press. Zipf, G. K. 1949 Human Behaviour and the Principle of Least Effort. New York: Hafner.
Part 3: Phonological representations in the light of complex adaptive systems
The dynamical approach to speech perception: From fine phonetic detail to abstract phonological categories Noël Nguyen, Sophie Wauquier, and Betty Tuller
Much research has been devoted to exploring the representations and processes employed by listeners in the perception of speech. There is in that domain a longstanding debate between two opposite approaches. Abstractionist models, on the one hand, are based on the assumption that an abstract and speaker-independent phonological representation is associated with each word in the listener's mental lexicon. In exemplar models of speech perception, on the other hand, words and frequently-used grammatical constructions are represented in memory as large sets of exemplars containing fine phonetic information. In the present paper, the opposition between abstractionist and exemplar models will be discussed in light of recent experimental findings that call this opposition into question. In keeping with recent proposals made by other researchers, we will argue that both fine phonetic detail and abstract phonological categories are likely to play an important role in speech perception. A novel, hybrid approach, that aims to get beyond the abstraction vs. exemplar dichotomy and draws on the theory of nonlinear dynamical systems, as applied to the perception of speech by Tuller et al. (1994), will be outlined.
1. Two approaches to the perception of speech In spite of the huge variability shown by the speech signal both within and across speakers, listeners are in most situations able to identify spoken words and get access to their meaning in an effortless manner. A major challenge in studies of speech perception is to understand how such variations in the pronunciation of words are dealt with by listeners in the soundto-meaning mapping. There is still much disagreement about the nature of the processes and representations that this mapping may involve. According to proponents of the highly influential abstractionist approach, the speech signal is converted by listeners into a set of context-
194
Noël Nguyen, Sophie Wauquier and Betty Tuller
independent abstract phonological units, in a way that variations in the acoustic instantiation of a given word are factored out at an early stage of processing (e.g., Fitzpatrick & Wheeldon, 2000; Lahiri & Reetz, 2002; Stevens, 2002). It is often hypothesized that inter-individual variations are removed prior to building up this abstract representation by means of a speaker normalization procedure (see Johnson 2005b for a recent review). A clear demarcation is established in that framework between the surface phonetic form of a word and the underlying phonological representation associated with that word. The abstractionist approach contends that the phonological representation for each word is both unique and permanently stored in memory. Readers are referred to Cutler et al. (to appear), Eulitz & Lahiri (2004), Lahiri & Reetz (2002), Pallier et al. (2001), and Stevens (2002), for recent important papers in the abstractionist framework. Abstractionist models such as Lahiri and Reetz's (2002) Featurally Underspecified Lexicon (FUL) model offer a representation-based solution to the speech variability problem (see discussion in Pitt & Johnson, 2003). In FUL, phonological representations in the lexicon are underspecified for certain features such as [coronal] and listeners are insensitive to surface variations shown by words for these features. In this way assimilation of word-final coronals (such as /n/ in the word green) to the place of articulation of the following consonant (as in green bag, phonetically realized as [ɡɹiːm bæɡ]) is considered to remain consistent with the underspecified phonological representation of the carrier word. Because of this so-called no-mismatch relationship between the word's surface and underlying forms, assimilation is expected to have no disruptive effect on the identification of the word. Other abstractionist models rely on processing, rather than representations, and assume that recognizing the assimilated variant of a word entails recovering the word's unassimilated shape through phonological inference. In spite of the fact that exemplar models are representation-oriented, they differ from abstractionist models on many important dimensions. A major difference between the two approaches is that prototypical exemplar models view each exemplar as corresponding to a language chunk that is stored in memory along with all the details specific to the particular circumstances in which it has been produced or encountered. These include sensory-motor, semantic and pragmatic characteristics, but also indexical information about the speaker's identity and the situation of occurrence, to mention but a few properties. Exemplars are therefore deeply anchored into their context of occurrence in the largest possible sense and this has drastic
The dynamical approach to speech perception
195
implications for how spoken language may be represented in the brain. In some exemplar-based theories, this is at variance with viewing linguistic utterances as being built from a pre-defined set of context-independent phonological primitives. In a radical departure from this widely accepted combinatorial view to language, Bybee & McClelland (2005) extend sensitivity to context and non-uniformity to all levels of linguistic analysis, and go as far as to claim that “there is no analysis into units at any level or set of levels that will ever successfully and completely capture the realities of synchronic structure or provide a framework in which to capture language change”. As pointed out by Johnson (2005a), the exemplar approach to sensory memory has been well established in cognitive psychology for more than a century. Goldinger (1996, 1998) and Johnson (1997b) drew on this general theoretical framework to develop explicit exemplar-based models of speech perception (see Pierrehumbert, 2006, for a historical overview), although these models have a number of prominent precursors: Klatt's (1979) Lexical Access from Spectra model, and Elman's (1990) Recurrent Neural Network, for example, were also based on the assumption that lexical representations are phonetically rich. Exemplar models today present a major alternative to the better-established abstractionist approach, with farreaching implications not only for phonetics and phonology, but also and more generally for our understanding of language structure and language use (e.g., Barlow & Kremmer, 2000). In their current form, exemplar models of speech perception also raise a number of important theoretical and empirical questions, and it is on some of these questions that we focus in the following section.
2. Fine phonetic detail and abstract phonological categories in speech perception In recent years, research has increasingly focused on the listener's sensitivity to properties of the speech signal that are generically referred to as “fine phonetic detail” (FPD, hereafter). This research suggests that FPD has a significant impact on speech perception and understanding, in some circumstances at least. FPD includes allophonic variation, sometimes specific to certain words or classes of words (Pierrehumbert, 2002), as well as sociophonetic variation, broadly construed as being associated with the speaker's identity, age, gender, and social category (Foulkes & Docherty, 2006).
196
Noël Nguyen, Sophie Wauquier and Betty Tuller
Fine phonetic detail is designated as such in the sense that it is to be distinguished from the local and most perceptually prominent cues associated with phonemic contrasts in the speech signal (Hawkins, to appear). It may, however, encompass substantial acoustic variations, such as those that distinguish male from female voices. FPD is therefore identified as “detail” only with respect to a specific theoretical viewpoint, namely the traditional segmental approach to speech perception and production. A form of antiphrasis, FPD refers to phonetic properties that are judged non-essential in the identification of speech sounds in a theoretical framework whose limits the exemplar approach endeavors to demonstrate. The goal of current research on FPD is to show that FPD is important in speech perception, and, therefore, that a change of theoretical perspective is warranted. Recent studies on the role of FPD in spoken word recognition have provided evidence that perceptually-relevant allophonic variation includes vowel-consonant acoustic transitions (e.g., Marslen-Wilson & Warren, 1994), within-category variations in voice onset time (Allen & Miller, 2004; Andruski et al., 1994; McMurray et al., 2002, 2003), long-domain resonance effects associated with liquids (West, 1999), and graded assimilation of place of articulation in word-final coronals (e.g., Gaskell, 2003; the studies cited here were conducted on either American or British English). To a certain extent, however, the fact that listeners are sensitive to allophonic variation was established much earlier. For example, studies conducted in the 1970s and 1980s consistently showed that coarticulation between neighboring segments provides listeners with perceptuallyrelevant cues to segment identity (and by extension to word recognition). A well-known example is regressive vowel-to-vowel coarticulation in English, which allows the identity of the second vowel to be partly predictable from the acoustic cues associated with it in the first vowel (Martin & Bunnell, 1981, 1982). It has also been repeatedly demonstrated that word recognition is sensitive to the individual characteristics of the speaker's voice (e.g., Goldinger, 1996, 1998). Although allophonic and between-speaker variation are often lumped together under the generic term FPD, recent evidence suggests that these two types of phonetic variation are dealt with in different ways by listeners (Luce & McLennan, 2005). The phonetics of conversational interaction is another area in which evidence for the role of FPD in speech perception is growing (e.g., CouperKuhlen & Ford, 2004). A major finding of these studies is the tendency shown by participants in a conversation to imitate each other. Imitation seems to occur at every level of the conversational exchange, including the
The dynamical approach to speech perception
197
phonetic level (Giles et al., 1991). For example, Pardo (2006) had different talkers produce the same lexical items before, during, and after a conversational interaction, and found that perceived similarity in pronunciation between talkers increased over the course of the interaction and persisted beyond its conclusion. Phonetic imitation, or phonetic convergence, is a mechanism that may be actively employed by talkers to facilitate conversational interaction. It has also been suggested that phonetic imitation plays an important role in phonological and speech development (Goldstein, 2003) and is rooted in the ability that human neonates have to imitate facial gestures (Meltzoff & Moore, 1997). In addition, phonetic imitation has been assumed in recent work to be one of the key mechanisms that underlie the emergence and evolution of speech sound systems (e.g., de Boer, 2000). The behavioral tendency shown by humans to imitate others may be connected at the brain level with the presence of mirror neurons, whose role in the production, perception and acquisition of speech now seems well established (Studdert-Kennedy, 2002; Vihman, 2002; Jarick & Jones, 2008). Crucially for the present paper, phonetic convergence demonstrates that listeners are sensitive to speaker-dependent phonetic characteristics which have an influence on both the dynamics of conversational interaction, and, across a longer time range, the representations associated with words in memory after the interaction has ended. Such sensitivity to context in listeners has led researchers like Tuller and her colleagues to contend that speech perception studies should focus on the listener's individual behavior in its situation of occurrence, as opposed to abstract linguistic entities (Case et al., 2003, see also Tuller, 2004). It is important to point out that by itself, the fact that listeners are sensitive to FPD is not inconsistent with abstractionist models of speech perception. For example, Stevens (2004) contends that in addition to what he refers to as the defining articulatory/acoustic attributes associated with distinctive features (e.g., the spectrum of the release burst of stop consonants), the phonetic implementation of these features involves so-called language-specific enhancing gestures, which allow the features' perceptual saliency to be strengthened (e.g., tongue-body positioning for tongue-blade stop consonants, see Stevens, 2004). Although enhancing gestures can be regarded as FPD in some models and hence non-essential for the identification of speech sounds, they are attributed an important role in Stevens' abstractionist model of lexical access (Stevens, 2002). Likewise, the TRACE model of speech perception (McClelland & Elman, 1986) relies on the assumption that fine-grained acoustic properties may have an impact on word
198
Noël Nguyen, Sophie Wauquier and Betty Tuller
recognition, although TRACE too may be regarded as an abstractionist model (it contains an infralexical phonemic level of processing and, at the lexical level, each word is represented by a single processing unit). TRACE accounts for part of the listener's sensitivity to FPD by modelling the flow of acoustic information within the speech processing system by means of a set of continuous parameters. It is also designed to explain how finegrained coarticulatory variation is taken into account in the on-line identification of phonemes (Elman & McClelland, 1988). Thus, the assumption that FPD has a role to play in speech perception is not specific to exemplar models and is also found in at least some abstractionist models. Exemplar models do diverge from abstractionist models, however, in assuming that in addition to being relevant to on-line speech perception and understanding, FPD is stored in long-term memory. Thus, in exemplar models, lexical representations are phonetically rich. To our knowledge, much of the available evidence for long-term storage of FPD in the mental lexicon comes from studies of speech production. As shown by Bybee (2001, 2006a) and Pierrehumbert (2001, 2002), frequency-dependent differences in the phonetic realization of words that meet the same structural description (e.g., words underlyingly containing a schwa followed by a sonorant, such as the high-frequency word every [evɹɪ], produced with no schwa, compared with the mid-frequency word memory [memɹ̩ɪ], produced with a syllabic /r/)1, must be learned and stored in the mental lexicon by speakers in the course of language acquisition. This is also true of sociophonetic variation, which has to be learned inasmuch as the relationship between phonetic forms and social categories is arbitrary (Foulkes & Docherty, 2006). Because these sometimes subtle patterns of phonetic variation have to be detected by the speaker, either explicitly or implicitly, before she/he is able to reproduce them, these studies lend strong albeit indirect support to the assumption that perceived FPD is stored in the lexicon. More direct evidence is available from a variety of sources. In a well-known series of experiments, Goldinger (1996, 1998) showed that prior exposure to a speaker's voice facilitates later recognition of words spoken by the same speaker compared with a different speaker. Strand (2000) found that listeners respond more slowly to nonstereotypical male and female voices than to stereotypical voices in a speeded naming task. These studies suggest that the individual characteristics of the speaker's voice as well as the acoustic/phonetic properties associated with the speaker's gender are retained in memory by listeners. Both Johnson (1997b) and Goldinger (1998) consider that direct storage of FPD in the lexicon allows
The dynamical approach to speech perception
199
listeners to deal with between-speaker variations in the production of words without having to resort to a normalization procedure (but see Mitterer, 2006, for experimental counterevidence). Little is known yet about the possible forms of exemplars stored in memory. A survey of the relevant literature indicates that exemplars are generally considered as multimodal sensory-motor representations of language chunks of various size (more on this later), and can therefore be characterized in a general way as being a) non-symbolic, b) parametric, c) in a relationship of intrinsic similarity with the input speech signal, and d) abstract, up to a certain extent, since the auditory trace of speech fades away after about 400 ms (Pardo & Remez, 2007). Because of the limits of our current knowledge in that domain, exemplar models of speech production and perception can be, in a paradoxical way, far more abstract than they purport to be. In these models, exemplars are sometimes represented in a highly schematized form which bears little resemblance to the fine-grained acoustic structure of speech. Much research still needs to be done to characterize the representation of exemplars in the listener's brain. Exemplar theories also differ among themselves in important ways. For example, Hintzman's (1986) rationalist approach to memory may be incompatible with the neo-empiricist viewpoint advocated by Coleman (2002)2. Importantly, there is a lack of consensus among proponents of the exemplar approach with regard to the status of phonological representations in speech perception. In some models, such as Johnson's (1997a, 2005a) XMOD, exemplars have no internal structure, and are conceived as unanalyzed auditory representations associated with whole words. This, however, does not mean that sublexical units such as segments and syllabic constituents cannot have psychological reality. Although it is assumed that such units are not explicitly represented in memory, they can nevertheless be brought to the listener's consciousness as the speech signal is mapped onto the lexicon. These units temporarily emerge as a by-product of lexical activation, as connections between time-aligned, phonetically-similar portions of exemplars are established. In this framework, listeners are assumed to be simultaneously sensitive to units of different sizes in the speech signal, albeit with a natural bias for larger units to prevail over smaller ones (Grossberg, 2003). What may be viewed as a phonological structure, with a certain degree of abstraction, is therefore built up by the listener in the online processing of speech, although this structure is said to be but “a fleeting phenomenon - emerging and disappearing as words are recognized” (Johnson, 1997a). Other researchers (Hawkins, 2003, 2007; Luce &
200
Noël Nguyen, Sophie Wauquier and Betty Tuller
McLennan, 2005; McLennan & Luce, 2005; Pierrehumbert, 2006) have proposed a hybrid approach in which exemplars are encoded in memory in conjunction with permanently-stored abstract phonological representations. In Hawkins' POLYSP model of speech perception and understanding (Hawkins & Smith, 2001; Hawkins, 2003, to appear) for example, FPD is mapped onto abstract prosodic structures as characterized in the Firthian Prosodic Analysis phonological framework. The whole-word exemplar hypothesis, as it is adopted in some models of speech perception (e.g., Goldinger, 1998; Johnson, 1997b), raises a number of issues that have been highlighted by different authors. First, it is not always clear why words should indeed be postulated as basic units of processing and storage. It seems more likely that fragments of speech of many different sizes would empirically come to surface in the utterances to which listeners are exposed over the course of their life. If the logic that governs the exemplar approach is to be fully followed, one should assume that word sequences of high frequency, such as I don't know, should be stored as single units in what then becomes a highly extended mental lexicon (see Bybee, 2001, 2006b). Second, the whole-word exemplar hypothesis in perception is inconsistent with what Pierrehumbert (2006) refers to as the phonological principle, i.e., “that languages have basic building blocks, which are not meaningful in themselves, but which combine in different ways to make meaningful forms”, as shown by the fact that classicallydefined allophonic rules are found to apply to a large majority of words sharing the same structural description, even if they may not extend to all of these words. Pierrehumbert (2006) also points out that it is difficult to see how whole-word exemplar models can account for the bistable character of speech perception, i.e., that an ambiguous speech sound potentially associated with two categories will be perceived as a member of one and only one of these categories at any one time (since such response patterns seem to rely on a winner-take-all competition among two underlying abstract units). There is a well-known and extensive body of evidence in support of the idea that infra-lexical phonological representations come into play in spoken word recognition (e.g., Cutler et al., to appear; Lahiri & MarslenWilson, 1991; Lahiri & Reetz, 2002; Pallier, 2000). In addition, numerous experimental studies (e.g., Lahiri & Reetz, 2002, among others) have shown that, in some circumstances at least, abstract phonological categories seem to prevail over fine phonetic detail in the mapping of speech sounds onto meaning3. In the following section, we focus on two studies that we
The dynamical approach to speech perception
201
and our colleagues recently carried out and whose results also point to a role for abstract phonological representations in speech perception.
3. Further evidence for the role of abstract phonological representations in speech perception Dufour et al. (2007) examined the influence that regional differences in the phonemic inventory of French may have on how spoken words are recognized. Whereas the phonemic system of standard French is traditionally characterized as containing three mid vowel pairs, namely /e/-/ɛ/, /ø/-/œ/, and /o/-/ɔ/, as in épée /epe/ “sword” vs. épais /epɛ/ “thick”, and côte /kot/ “hill” vs. cote /kɔt/ “rating”, southern French is viewed as having three mid-high vowel phonemes only, /e/, /ø/ and /o/ (Durand, 1990). [ɛ], [œ] and [ɔ] appear at the phonetic level but they are in complementary distribution with respect to the corresponding mid-high variants, according to a variant of the so-called loi de position (a mid-vowel phoneme is realized as midhigh in an open syllable and as mid-low in closed syllables and whenever the next syllable contains schwa, see Durand, 1990). Thus, in southern French, épée and épais will be both pronounced [epe] and côte and cote will be both pronounced [kɔtə]. Dufour et al. (2007) asked how words such as épée, épais, côte and cote, as produced by a speaker of standard French, i.e., with a contrast in vowel height in the word-final syllable, were perceived by speakers of both standard and southern French. Using a lexical decision task combined with a long-lag repetition priming paradigm, Dufour et al. found that pairs of words ending in a front mid vowel (e.g. épée épais) were not processed in the same way by both groups of subjects. Standard French speakers perceived the two words as being different from each other, as expected, whereas southern French speakers treated one word as a repetition of the other. By contrast, both groups of subjects perceived the two members of /o/-/ɔ/ word pairs as different one from the other. Thus, the results showed that there are within-language differences in how isolated words are processed, depending on the listener's regional accent. Note that southern speakers are far from being unfamiliar with standard French. On the contrary, they are widely exposed to it through the media and at school in particular. According to Dufour et al., the observed response patterns for southern speakers may be accounted for by assuming that the /o/-/ɔ/ contrast is better defined than the /e/-/ɛ/ contrast in these speakers' receptive phonological knowledge of standard French. The /o/-/ɔ/
202
Noël Nguyen, Sophie Wauquier and Betty Tuller
contrast is a well-established and highly recognizable feature of standard French, which is as such well-known to southern speakers, even if this contrast is neutralized in these speakers' dialect. By comparison, the distribution of /e/ and /ɛ/ in word-final position in standard French is characterized by greater complexity both across and within speakers, and there is evidence showing that word-final /e/ and /ɛ/ are in the process of merging in Parisian French (although the speaker used in Dufour et al.'s study did make the distinction between the two vowels, as confirmed by the fact that standard French subjects did not process the second carrier word as being a repetition of the first one). Dufour et al. hypothesized that because of the unstable status of the /e/-/ɛ/ contrast, both vowels were perceptually assimilated to the same abstract phonological category by speakers of southern French. Thus, Dufour et al.'s results suggest that abstract phonological representations are brought into play by listeners in spoken word recognition. Nguyen et al. (2007b) recently undertook a study on the perceptual processing of liaison consonants in French. Liaison in French is a wellknown phenomenon of external sandhi that refers to the appearance of a consonant at the juncture of two words, when the second word begins with a vowel, e.g. un [œ̃] + enfant [ɑ̃fɑ̃] → [œ̃nɑ̃fɑ̃] “a child”, petit [pəti] + ami [ami] → [pətitami] “little friend”. In earlier work, Wauquier-Gravelines (1996) showed that listeners found it more difficult to detect a target phoneme (e.g., /n/) in a carrier phrase, when that phoneme was a liaison consonant (son avion [sɔ̃navjɔ̃] “her plane”) compared with a word-initial consonant (son navire [sɔ̃naviʁ] “her ship”). The proportion of correct detection proved significantly lower for the liaison than for the word-initial target consonant. According to Wauquier-Gravelines, the listeners' response pattern was attributable to the specific phonological status that liaison consonants have in French. More particularly, and in the autosegmental phonology framework (Encrevé, 1988) espoused by Wauquier-Gravelines, liaison consonants are treated as floating segments with respect to both the skeletal and syllabic tiers, as opposed to fixed segments, which are lexically anchored to a skeletal slot, and which include word-initial consonants, but also word-final (e.g., /n/ in la bonne [labɔn] “the maid”) and word-internal (e.g., /n/ in le sénat [ləsena] “the senate”) ones. Using a speeded phoneme detection task, Nguyen et al. (2007b) aimed to confirm that detecting liaison consonants in speech is difficult. They examined to what extent differences in the detection rate of liaison consonants vs. word-initial consonants could, at least in part, stem from the phonetic properties of these consonants, by systematically manipulating these properties. The potentially distinctive
The dynamical approach to speech perception
203
status of liaison consonants compared with fixed consonants in perception was further explored by inserting word-final and word-medial fixed consonants as well as word-initial ones in the material4. The results first showed that the percent correct detection systematically varied depending on the position of the target consonant in the carrier sentence: listeners tended to miss liaison consonants more often than fixed consonants, whether these were in word-initial, word-final or word-medial position. Second, manipulating the liaison consonants' and word-initial consonants' fine phonetic properties had no measurable influence on how accurately these consonants were detected by listeners. Nguyen et al. (2007b) pointed out that the listeners' response pattern was partly consistent with an exemplar-based theory of French liaison such as the one proposed by Bybee (2001). In this approach, liaison consonants are deeply entrenched in specific grammatical constructions, and the realization of liaison is highly conditioned by the strength of the associations between words within such constructions. Although little is said in Bybee's theory about how liaison consonants may be processed in speech understanding, a prediction that may be derived from this theory is that listeners will process liaison consonants as part and parcel of the constructions in which these consonants appear. As a result, it may be difficult for listeners to identify liaison consonants as contextindependent phonemic units, as explicitly required in a phoneme-detection task. This, however, should be true for all the segments a construction may contain. On this account, listeners should not experience less difficulty in detecting a word-initial consonant compared with a liaison consonant, when these consonants appear in word sequences that are highly similar to each other with respect to their morpho-syntactic and phonetic make-up, as was the case in Nguyen et al.’s (2007b) material. The lower detection rates observed for liaison than for word-initial target consonants was consistent with the assumption that liaison consonants have a specific phonological status and, to that extent, provided better evidence for the abstractionist autosegmental account of liaison than for the exemplar-based account. Readers are referred to Nguyen et al. (2007b: 8-21) for further detail about the experiment and its potential theoretical implications.
204
Noël Nguyen, Sophie Wauquier and Betty Tuller
4. The dynamical view of speech perception: Beyond the exemplars vs. abstractions dichotomy? The above review of the literature suggests that the dichotomy that is sometimes erected between exemplar-based and abstractionist approaches to speech perception may be to a large extent artificial. Experimental evidence is available that provides support for the role of both fine phonetic detail and abstract phonological categories in speech perception. The recent development of so-called hybrid models (Hawkins, 2003, to appear; Luce & McLennan, 2005; McLennan & Luce, 2005; Pierrehumbert, 2006) is governed by the assumption that FPD and abstract phonological categories combine with each other in the representations associated with words in memory. Over the last decade, Tuller and colleagues (Case et al., 1995; Tuller et al., 1994; Tuller, 2003, 2004) have developed a model that shares some of the characteristics of the hybrid approach. This model, referred to as the TCDK model hereafter, uses concepts from the theory of nonlinear dynamical systems to account for the mechanisms involved in the categorization of speech sounds. In this model, there are two complementary aspects to speech perception. On the one hand, speech perception is assumed to be a highly context-dependent process sensitive to the detailed acoustic structure of the speech input. On the other hand, it is viewed as a nonlinear dynamical system characterized by a limited number of stable states, or attractors, which allow the system to perform a discretization of perceptual space and which are associated with abstract perceptual categories. In this section, after a brief and schematic presentation of the TCDK model, we report the results of a study recently conducted on the categorization of speech sounds in French with a view to testing some of the model's predictions. The implications of the model for the exemplar vs. abstraction debate will then be discussed. The model was first designed to account for listeners' response patterns in a binary-choice speech categorization task. In the experiments reported in Tuller et al. (1994), listeners were presented with stimuli ranging on an acoustic continuum between say and stay and their task was to identify each stimulus as one of these two words. Listeners' responses were modeled using a nonlinear dynamical system characterized as follows: V(x) = k(x) – x2 / 4 + x4 / 4
The dynamical approach to speech perception
205
In this equation, x represents the perceptual form (in this case, say or stay), k a control parameter, and V(x) a potential function which may have up to two stable perceptual forms indicated by minima in the potential function, depending on the value of k. The control parameter k itself depends on the acoustic characteristics of the stimulus, on the one hand, and the combined effects of learning, linguistic experience and attentional factors, on the other hand, in a way described by the following equation: k(λ) = k0 + λ + ε / 2 + εθ(n – nc)(λ – λf) where k0 refers to the system's initial state, λ represents the acoustic parameter that is manipulated in the stimuli (in the present case, the duration of the silent gap between the fricative and the diphthong), ε is a parameter that characterizes the lumped effect of learning, linguistic experience and attention, θ is the discrete form of the Heaviside step function, n is the number of perceived stimulus repetitions in a given run, nc represents a critical number of accumulated repetitions, and λf denotes the value of λ at the other extreme from its initial value. For a given value of k, the system's state evolves in the x perceptual space to get trapped into a local minimum, or attractor, of V(x). Each of the two possible responses in the categorization task corresponds to one attractor in the perceptual space. Figure 1 shows the shape of the potential function for five values of k between -1 and 1. The potential function has one minimum only for extreme values of k, which correspond to stimuli unambiguously associated with either of the two categories, and two minima in the middle range of k, where both categories are possible. As k increases in a monotonic fashion (from left to right in Figure 1), and in the vicinity of a critical value kc, the system's state, represented by the filled circle in Figure 1, abruptly switches from the basin of attraction in which it was initially located, to the second basin that has gradually formed as the first one disappears.
Figure 1. Shape of the potential function V(x) for five values of k. Adapted from Tuller et al. (1994).
206
Noël Nguyen, Sophie Wauquier and Betty Tuller
In Tuller et al.’s (1994) experiments, the stimuli on the say-stay continuum were presented to the listener in either a randomized order, or a sequential order. In the latter case, listeners heard the entire set of stimuli twice, going from one of the two endpoints (e.g., say) to the other (stay), and then back to the first one (say) again. In such sequential presentations, three possible response patterns were to be expected: a) hysteresis, defined as the tendency for the listener's response at one endpoint to persist across the ordered sequence of stimuli towards the other endpoint, b) enhanced contrast, in which the listener quickly switches to the alternate percept and does not hold on to the initial categorization, and c) critical boundary, where the switch between the two percepts is associated with the same stimulus regardless of the direction of presentation across the continuum. The results showed that critical boundary was much less frequent than hysteresis and contrast, which occurred equally often. These data provided strong support for the assumption that speech perception is a highly context-dependent process, characterized by a rich variety of dynamical properties. Readers are referred to Tuller et al. (1994) and Case et al. (1995) for further detail on these experiments. Nguyen et al. (2005) and Nguyen et al. (2007a) recently undertook to extend Tuller and colleagues' (1994) hypotheses and experimental paradigm to the categorization of speech sounds in French. In Nguyen et al. (2007a), the material was made up of 21 stimuli on an acoustic continuum between cèpe /sɛp/ (a type of mushroom) and steppe /stɛp/ (in physical geography, a steppe, that is, a plain without trees). Each stimulus contained a silent interval between /s/ and /ɛ/ whose duration increased from 0 ms (Stimulus 1) to 100 ms (Stimulus 21) in 5-ms steps. These stimuli were used in a speeded forced-choice identification task administered to eleven native speakers of French naive as to the purposes of the experiment and with no known hearing defects. Listeners were presented with the 21 stimuli in both random and sequential (1→21→1 or 21→1→21) order. The experiment comprised 20 randomized presentations alternating with 20 sequential presentations. The inter-stimuli interval was two seconds and the experiment lasted about an hour. An index referred to as the CH (Contrast-Hysteresis) index was devised to measure the amount that hysteresis or enhanced contrast contributed to each subject's responses to sequential presentation. This entailed locating the position on the continuum of the stimulus associated with the switch from one response to the other in the first part of the presentation (e.g., 1→21, in a 1→21→1 sequence), on the one hand, and in the second part of
The dynamical approach to speech perception
207
the presentation (e.g., 21→1, in a 1→21→1 sequence), on the other hand. The distance between these two points was then measured, in such a way that positive values corresponded to hysteresis, negative values to enhanced contrast, and 0 to critical boundary. The distribution of the CH index across the 20 sequential presentations for all the subjects is shown in Figure 2. These data indicate that hysteresis prevailed over enhanced contrast and critical boundary. The CH index reached a grand average value of 3.5 that proved to be significantly higher than 0 in a linear mixed-effects model using the CH values as the predicted variable, the intercept as predictor (fixed effect) and the subjects as a blocking factor (random effect; t(208) = 3.93, p < 0.001).
Figure 2. Observed values of the CH index in Nguyen et al.’s (2007a) speech categorization experiment. Left panel: distribution across all the presentations; right panel: mean value and standard deviation for each of the 20 sequential presentations.
Figure 2 also shows the mean and standard deviation of the CH index for each of the 20 sequential presentations, in chronological order from the beginning of the experiment. An important prediction of the TCDK model is that the amount of hysteresis should decrease as the subject gets more experienced with the task and stimuli (e.g., Case et al., 1995; Tuller, 2004; Nguyen et al., 2005). This is because the model accommodates increased experience by using a step function that causes more of a change in the tilt of the potential for each step change in the acoustic stimulus. This prediction was borne out by the data, as a decrease in the CH value over the course of the experiment was observed which proved statistically significant in a linear mixed-effects model using the CH value as predicted variable, the rank of presentation as predictor and the subjects as blocking factor (t(208) = -2.792, p < 0.01). These results offered further confirmation that the speech perception system can be modelled as a nonlinear dynamical
208
Noël Nguyen, Sophie Wauquier and Betty Tuller
system whose current state simultaneously depends on the input speech sound, the system's past state, and higher-level cognitive factors that include the listener's previous experience with the sounds she/he has to categorize. In the following, we concentrate on what the TCDK model may contribute to the general issue of abstractionist and exemplar-based views of phonological representation. In spite of its being linked to a specific experimental task (forced-choice categorization), the TCDK model shows a number of general properties that, in our view, open the way towards a novel, hybrid view to speech perception that gets beyond the dichotomy traditionally established between exemplars and abstract phonological representations. Clear differences arise between the nonlinear dynamical approach exemplified by the TCDK model and the abstractionist approach. In the former, to a greater extent and more systematically than in the latter, speech categorization is viewed as being sensitive to the detailed acoustic characteristics of the input signal. For example, it is assumed that small variations in the acoustic structure of an ambiguous speech sound can lead to large changes in the perceptual system's response. This will be the case when these variations cause the system to move across a saddle point in the potential function (see Figure 1). Note, however, that perceptual sensitivity to small acoustic change is not assumed to be the same in all regions of the acoustic space. Variations in stimuli close to a prototypical sound unambiguously associated with a given perceptual category will have little impact on the listener's response. This is viewed in the model as being governed by a oneattractor potential function (see left and right panels of Figure 1). The TCDK model also predicts that the relative stability of a category will vary depending on how frequently that category has been perceived in the preceding sequence of speech sounds. This is attributed to the fact that the tilt of the potential function changes more sharply in response to a variation in the input sound when a given category has been perceived more often. Yet another prediction of the model is that the location of the perceptual switch from one category to another in the acoustic space tends strongly to depend on the trajectory followed by the stimuli in that space, as is the case in both hysteresis and enhanced contrast. Over a longer time scale, increasing experience with the stimuli is expected to affect the dynamics of speech categorization, which will tend to move away from hysteresis towards enhanced contrast. The predicted sensitivity of the speech perception system to gradient acoustic properties, frequency of occurrence of perceived categories, trajectory of speech sounds in the acoustic space, and
The dynamical approach to speech perception
209
training, is consistent with listeners' observed response patterns, and seems difficult to account for by purely abstractionist models of speech perception in the absence of a role for FPD. In the TCDK model, this sensitivity partly derives from the fact that speech sounds are mapped onto a discrete and finite set of perceptual categories by means of a continuous potential function, as opposed to the sharp division between sounds and percepts often posited in abstractionist models. While the dynamical nature of speech categorization is central to the TCDK model, it is also, to a certain extent, emphasized in the exemplar approach. Exemplars are assumed to accumulate in memory as listeners are exposed to them, and this causes boundaries between categories to be continuously pushed around in the perceptual space. As a result, more frequent categories (represented by a higher number of exemplars) gradually come to prevail over less frequent ones (Pierrehumbert, 2006). Perceptual categories are taken in the exemplar approach to be time-dependent, and to evolve continuously in the course of the conversational interactions in which speakers/listeners engage. This of course has major theoretical implications, as frequency of use is expected to have an impact on the very form of phonological representations in memory. As indicated above, however, dynamics in speech perception is not restricted to the incremental effect of exposure on perceived categories but encompasses a much wider range of phenomena such as hysteresis, contrast, bifurcation, and stability5. The TCDK model aims to take advantage of the powerful theory of nonlinear dynamical systems to account for these phenomena in all their variety and along short as well as long time scales. It offers an explanation of the bistability of speech perception, attributed to the coexistence of two mutually exclusive attractor states in the perceptual space. In addition, theoretical and methodological tools (e.g., Erlhagen et al., 2006) are available that may allow the nonlinear dynamical framework to be extended to the study of conversation interaction between two or several speakers, and to model the dynamics of speech processing and its influence on the organization of perceived categories as this interaction unfolds in time.
5. Acknowledgments This work was partly supported by the ACI Systèmes complexes en SHS Research Program (CNRS & French Ministry of Research). Betty Tuller was supported by grants from the National Science Foundation (0414657
210
Noël Nguyen, Sophie Wauquier and Betty Tuller
and 0719683) and the Office of Naval Research. A first version of the present paper was presented at the Workshop on Phonological Systems and Complex Adaptive Systems held in Lyon in July 2005. Feedback from Abby Cohn, Adamantios Gafos, and Sharon Peperkamp, among other participants, is gratefully acknowledged. We are also grateful to two anonymous reviewers for their critical comments and suggestions.
Notes 1. However, Lavoie (1996), cited by Cohn (2005), did not find the assumed positive correlation between rate of schwa deletion and lexical frequency. 2. In Hintzman's MINERVA 2 model, each memory trace is assumed to be internally represented as a configuration of so-called primitive properties, some of which may be abstract, and which are not themselves acquired by experience. In Coleman's empiricist view, by contrast, properties shown by word forms in memory are by default considered as deriving from sensory experience, unless empirical evidence is obtained that cannot be accounted for without bringing abstract phonological units into play. 3. To take but one example, Lahiri & Reetz (2002) provide experimental evidence suggesting that listeners are insensitive to surface variations in place of articulation of word-final coronals, in accord with the assumption that coronals are unspecified for place of articulation in the lexicon. 4. The material contained twenty groups of four test sentences. Within each group, the target consonant (e.g., /z/) was located in word-initial position (e.g., […] des zéros […] /dezero/ “zeros”) in the first sentence, in word-final position (e.g., […] seize élèves […] /sɛzelɛv/ “sixteen pupils”) in the second sentence, in word-medial position (e.g., […] du raisin […] /dyrezɛ̃/ “some grapes”) in the third sentence, and in liaison position at the juncture between two words (e.g., […] des écrous […] /dezekru/ “some nuts”) in the fourth sentence. Two different versions of Sentences 1 and 4 were created. In the cross-spliced versions, the target consonant and preceding vowel were exchanged between the two sentences (for example, /ez/ in /dezekru/ was substituted to /ez/ in /dezero/, and vice-versa). In the identity-spliced versions, the target consonant and preceding vowel in each sentence originated from another repetition of that sentence. Listeners' performance in the target detection task was expected to be poorer for cross-spliced sentences than for identity-spliced sentences if liaison consonants and word-initial consonants showed perceptually-salient differences in their phonetic realization. 5. Because the TCDK model presented here is aimed to account for the dynamics of speech perception in a binary-choice categorization task, both the number of attractors and the association between attractors and perceptual categories were
The dynamical approach to speech perception
211
set by design. How new attractors can emerge in the perceptual space, as the listener is exposed to non-native speech sounds for example, is a major issue addressed by Case et al. (2003), Tuller (2004), and Tuller et al. (2008), in particular. Although the nature of the perceptual categories associated with attractors remains to be established, work by Tuller & Kelso (1990, 1991) suggests that these categories may combine both articulatory and auditory information.
References Allen, J. and Miller, J. 2004 Listener sensitivity to individual talker differences in voice-onsettime. Journal of the Acoustical Society of America, 115:3171–3183. Andruski, J., Blumstein, S., and Burton, M. 1994 The effect of subphonetic differences on lexical access. Cognition, 52:163–187. Barlow, M. and Kemmer, S. (eds) 2000 Usage-Based Models of Language. Center for the Study of Language and Information, Stanford, CA. Bybee, J. 2001 Phonology and Language Use. Cambridge University Press, Cambridge. 2006a Frequency of Use and the Organization of Language. Oxford University Press, Oxford. 2006b From usage to grammar: the mind’s response to repetition. Language, 82:529–551. Bybee, J. and McClelland, J. 2005 Alternatives to the combinatorial paradigm of linguistic theory based on domain general principles of human cognition. The Linguistic Review, 22:381–410. Case, P., Tuller, B., Ding, M., and Kelso, J. 1995 Evaluation of a dynamical model of speech perception. Perception & Psychophysics, 57:977–988. Case, P., Tuller, B., and Kelso, J. 2003 The dynamics of learning to hear new speech sounds. Speech Pathology. November 17, 1-8 from http://www.speechpathology.com/articles/arc_disp.asp?article_id=50 &catid=560. Cohn, A. 2005 Gradience and categoriality in sound patterns. Paper presented at the Workshop on Phonological Systems and Complex Adaptive Systems, Lyons, France, 4–6 July 2005.
212
Noël Nguyen, Sophie Wauquier and Betty Tuller
Coleman, J. 2002 Phonetic representations in the mental lexicon. In Durand, J. and Laks, B., editors, Phonetics, Phonology, and Cognition, pp. 96–130. Oxford University Press, Oxford. Couper-Kuhlen, E. and Ford, C. (eds) 2004 Sound Patterns in Interaction. Cross-linguistic Studies from Conversation. John Benjamins, Amsterdam. Cutler, A., Eisner, F., McQueen, J., and Norris, D. to appear Coping with speaker-related variation via abstract phonemic categories. In Fougeron, C., D’Imperio, M., Kühnert, B., and Vallée, N., editors, Papers in Laboratory Phonology X. Mouton de Gruyter, Berlin. de Boer, Bart 2000 Self-organization in vowel systems. Journal of Phonetics, 28:441–465. Dufour, S., Nguyen, N., and Frauenfelder, U. 2007 The perception of phonemic contrasts in a non-native dialect. Journal of the Acoustical Society of America Express Letters. 121:EL131–EL136. Durand, J. 1990 Generative and Non-Linear Phonology. Longman, London. Elman, J. 1990 Finding structure in time. Cognitive Science, 14:179–211. Elman, J. and McClelland, J. 1988 Cognitive penetration of the mechanisms of perception: compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27:143–165. Encrevé, P. 1988 La liaison avec et sans enchaînement. Seuil, Paris. Erlhagen, W., Mukovskiy, A., and Bicho, E. 2006 A dynamic model for action understanding and goal-directed imitation. Brain Research, 1083:174–188. Eulitz, C. and Lahiri, A. 2004 Neurobiological evidence for abstract phonological representations in the mental lexicon during speech recognition. Journal of Cognitive Neuroscience, 16:577–583. Fitzpatrick, J. and Wheeldon, L. 2000 Phonology and phonetics in psycholinguistics models of speech perception. In Burton-Roberts, N., Carr, P., and Docherty, G., editors, Phonological Knowledge: Conceptual and Empirical Issues, pages 131–160. Oxford University Press, Oxford. Foulkes, P. and Docherty, G. 2006 The social life of phonetics and phonology. Journal of Phonetics, 34:409–438.
The dynamical approach to speech perception
213
Gaskell, M. 2003 Modelling regressive and progressive effects of assimilation in speech perception. Journal of Phonetics, 31:447–463. Giles, H., Coupland, N., and Coupland, J. 1991 Accommodation theory: Communication, context, and consequence. In Giles, H., Coupland, N., and Coupland, J., editors, Contexts of Accommodation: Developments in Applied Sociolinguistics, pp. 1–68. Cambridge University Press, Cambridge. Goldinger, S. 1996 Words and voices: episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory and Cognition, 22:1166–1183. 1998 Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105:251–279. Goldstein, L. 2003 Emergence of discrete gestures. In Proceedings of the XVth International Congress of Phonetic Sciences, pp. 85–88, Barcelona, Spain. Grossberg, S. 2003 Resonant neural dynamics of speech perception. Journal of Phonetics, 31:423–445. Hawkins, S. 2003 Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31:373–405. to appear Phonetic variation as communicative system: Perception of the particular and the abstract. In Fougeron, C., D’Imperio, M., Kühnert, B., and Vallée, N., editors, Papers in Laboratory Phonology X. Mouton de Gruyter, Berlin. to appear. Hawkins, S. and Smith, R. 2001 Polysp: A polysystemic, phonetically-rich approach to speech understanding. Rivista di Linguistica, 13:99–188. Hintzman, D. 1986 “Schema abstraction” in a multiple-trace memory model. Psychological Review, 93:411–428. Jarick, M. and Jones, J.A. 2008 Observation of static gestures influences speech production. Experimental Brain Research, 189:221–228 Johnson, K. 1997a The auditory/perceptual basis for speech segmentation. Ohio State University Working Papers in Linguistics, 50:101–113. 1997b Speech perception without speaker normalization. In Johnson, K. and Mullenix, J., editors, Talker Variability in Speech Processing, pp. 145–166. Academic Press, San Diego.
214
Noël Nguyen, Sophie Wauquier and Betty Tuller
2005a 2005b
Decisions and mechanisms in exemplar-based phonology. UC Berkeley Phonology Lab Annual Report, pp. 289–311. Speaker normalization in speech perception. In Pisoni, D. and Remez, R., editors, The Handbook of Speech Perception. Blackwell, Malden, MA. pp. 363–389.
Klatt, D. 1979
Speech perception: a model of acoustic-phonetic analysis and lexical access. Journal of Phonetics, 7:279–312. Lahiri, A. and Marslen-Wilson, W. 1991 The mental representation of lexical form: a phonological approach to the recognition lexicon. Cognition, 38:245–294. Lahiri, A. and Reetz, H. 2002 Underspecified recognition. In Gussenhoven, C. and Warner, N., editors, Papers in Laboratory Phonology VII, pp. 637–675. Mouton de Gruyter, Berlin. Lavoie, L. 1996 Lexical frequency effects on the duration of schwa— Resonant sequences in American English. Poster presented at LabPhon 5, Chicago, June 1996. Luce, P. and McLennan, C. 2005 Spoken word recognition: The challenge of variation. In Pisoni, D. and Remez, R., editors, The Handbook of Speech Perception, pages 591–609. Blackwell, Malden, MA. Marslen-Wilson, W. and Warren, P. 1994 Levels of perceptual representation and process in lexical access words, phonemes, and features. Psychological Review, 101:653–675. Martin, J. and Bunnell, H. 1981 Perception of anticipatory coarticulation effects. Journal of the Acoustical Society of America, 69:559–567. 1982 Perception of anticipatory coarticulation effects in vowel-stop consonant-vowel sequences. Journal of Experimental Psychology: Human Perception and Performance, 8:473–488. McClelland, J. and Elman, J. 1986 The TRACE model of speech perception. Cognitive Psychology, 18:1–86. McLennan, C. and Luce, P. 2005 Examining the time course of indexical specificity effects in spoken word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition, 31:306–321. McMurray, B., Tanenhaus, M., and Aslin, R. 2002 Gradient effects of within-category phonetic variation on lexical access. Cognition, 86:B33–B42.
The dynamical approach to speech perception
215
McMurray, B., Tanenhaus, M., Aslin, R., and Spivey, M. 2003 Probabilistic constraint satisfaction at the lexical/phonetic interface: Evidence for gradient effects of within-category VOT on lexical access. Journal of Psycholinguistic Research, 32:77–97. Meltzoff, A. and Moore, M. 1997 Explaining facial imitation: A theoretical model. Early Development and Parenting, 6:179–192. Mitterer, H. 2006 Is vowel normalization independent of lexical processing? Phonetica, 63:209–229. Nguyen, N., Lancia, L., Bergounioux, M., Wauquier-Gravelines, S., and Tuller, B. 2005 Role of training and short-term context effects in the identification of /s/ and /st/ in French. In Hazan, V. and Iverson, P., editors, ISCA Workshop on Plasticity in Speech Perception (PSP2005), pages A38–39, London. Nguyen, N., Lancia, L., and Tuller, B. 2007a The dynamics of speech categorization: Evidence from French. in preparation. Nguyen, N., Wauquier-Gravelines, S., Lancia, L., and Tuller, B. 2007b Detection of liaison consonants in speech processing in French: Experimental data and theoretical implications. In Prieto, P., Mascaro, J. and Solé, M.-J., editors, Segmental and Prosodic Issues in Romance Phonology, pp. 3–23. John Benjamins, Amsterdam. Pallier, C. 2000 Word recognition: do we need phonological representations? In Cutler, A., McQueen, J., and Zondervan, R., editors, Proceedings of the Workshop on Spoken Word Access Processes (SWAP), pp. 159– 162, Nijmegen. Pallier, C., Colomé, A., and Sebastián-Gallès, N. 2001 The influence of native-language phonology on lexical access: exemplar-based vs. abstract lexical entries. Psychological Science, 12:445–449. Pardo, J. 2006 On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America, 119:2382–2393. Pardo, J. S. and Remez, R. E. 2007 The perception of speech. In Traxler, M. and Gernsbacher, M., editors, The Handbook of Psycholinguistics, Second Edition. Elsevier, Cambridge, MA. in press. Pierrehumbert, J. 2001 Exemplar dynamics: Word frequency, lenition, and contrast. In Bybee, J. and Hopper, P., editors, Frequency Effects and the Emergence of Linguistic Structure, pages 137–157. John Benjamins, Amsterdam.
216 2002
Noël Nguyen, Sophie Wauquier and Betty Tuller
Word-specific phonetics. In Gussenhoven, C. and Warner, N., editors, Papers in Laboratory Phonology VII, pages 101–140. Mouton de Gruyter, Berlin. 2006 The next toolkit. Journal of Phonetics, 34:516–530. Pitt, M. and Johnson, K. 2003 Using pronunciation data as a starting point in modeling word recognition. In Proceedings of the XVth International Congress of Phonetic Sciences, Barcelona, Spain. Stevens, K. 2002 Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111:1872–1891. 2004 Invariance and variability in speech: Interpreting acoustic evidence. In Slifka, J., Manuel, S., and Matthies, M., editors, Proceedings of From Sound to Sense: 50+ Years of Discovery in Speech Communication, pages B77–B85, Cambridge, MA. MIT. URL: www.rle.mit.edu/soundtosense/. Strand, E. 2000 Gender Stereotype Effects in Speech Processing. PhD thesis, Ohio State University. Studdert-Kennedy, M. 2002 Mirror neurons, vocal imitation and the evolution of particulate speech. In Stamenov, M. and Gallese, V., editors, Mirror Neurons and the Evolution of Brain and Language, pages 207–227. John Benjamins, Amsterdam. Tuller, B. 2003 Computational models in speech perception. Journal of Phonetics, 31:503–507. 2004 Categorization and learning in speech perception as dynamical processes. In Riley, M. and Van Orden, G., editors, Tutorials in Contemporary Nonlinear Methods for the Behavioral Sciences. National Science Foundation. URL: www.nsf.gov/sbe/bcs/pac/nmbs/nmbs.jsp. Tuller, B., Case, P., Ding, M., and Kelso, J. 1994 The nonlinear dynamics of speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 20:3–16. Tuller, B., Jantzen, M., and Jirsa, V. 2008 A dynamical approach to speech categorization: Two routes to learning. New Ideas in Psychology, 26:208–226. Tuller, B. and Kelso, J. 1990 Phase transitions in speech production and their perceptual consequences. In Jeannerod, M., editor, Attention and Performance XIII: Motor Representation and Control, pages 429–452. Lawrence Erlbaum, Hillsdale, NJ.
The dynamical approach to speech perception 1991
217
The production and perception of syllable structure. Journal of Speech and Hearing Research, 34:501–508.
Vihman, M. 2002 The role of mirror neurons in the ontogeny of speech. In Stamenov, M. and Gallese, V., editors, Mirror Neurons and the Evolution of Brain and Language, pp. 305–314. John Benjamins, Amsterdam. Wauquier-Gravelines, S. 1996 Organisation phonologique et traitement de la parole continue. Unpublished PhD dissertation, Université Paris 7, Paris. West, P. 1999 Perception of distributed coarticulatory properties in English /l/ and /ɹ/. Journal of Phonetics, 27:405–426.
A dynamical model of change in phonological representations: The case of lenition1 Adamantios Gafos and Christo Kirov 1. Introduction This paper presents a model of diachronic changes in phonological representations. The broad context is that of sound change, where word representations evolve at relatively slow time scales as compared to the time scale of assembling a phonological representation in synchronic word production. The specific focus is on capturing certain key properties of an unfolding lenition process. Diachronic changes in phonological representations accumulate gradually during repeated production-perception loops, that is, through the impact of a perceived word on the internal representation and subsequent production of that word or other related words. To formally capture the accumulation of such changes, we capitalize on the continuity of parameter values at the featural level. Our model is illustrated by contrasting it with another recent view exemplified with an exemplar model of lenition, which also embraces continuity in its representational parameters. A primary concern is providing a formal basis of change in phonological representations using basic concepts from the mathematics of dynamical systems.
2. The case of lenition The term “lenition” is used to describe a variegated set of sound alternations such as voicing of obstruents between two vowels, spirantization of stops in a prosodically weak position, and devoicing of obstruents in syllable-final position which are either attested synchronically or have resulted diachronically in a restructuring of the phonemic inventory of a language. One diachronic example of lenition is Grimm’s Law, according to which Proto-Indo-European voiceless stops became Germanic voiceless fricatives (e.g. PIE *[t] > Gmc *[θ]). Other examples of sound changes described as cases of lenition are given below (for recent studies see Gurevich, 2004;
220
Adamantios Gafos and Christo Kirov
Cser, 2003, and Lavoie, 2001). In each case, a stop turns to a fricative similar in place of articulation to the original stop. (1)
a. b. c. d.
Southern Italian dialects: [b d g] → [v ð ɣ] intervocalically. Greek (Koine): [pʰ tʰ kʰ] → [f θ x] except after obstruents. Proto-Gaelic: [t k] → [θ x] intervocalically. Hungarian: [p] → [f] word initially.
Consider any single transition between two states of a lenition process, say, starting with a stop [b] and resulting in a fricative [v], [b] > [v]. At a broad level, one can describe two kinds of approaches to this kind of transition. The symbolic approach, as exemplified by Kiparsky’s classic paper on linguistic universals and sound change (Kiparsky, 1968), studies the internal composition of the individual stages (e.g. feature matrices at each stage) and makes inferences about the nature of the grammar and the representations. The continuity of sound change, that is, how the representation of the lexical item containing a [b] changes in time to one containing a [v], is not studied. This is in part due to the theoretical assumption that representations are discrete. That is, there is no symbol corresponding to an intermediate degree of stricture between that of a stop and a fricative. In the dynamical approach, the transition process between the stages is studied at the same time as the sequence of stages. In what follows, we instantiate a small, yet core part of a dynamical alternative to the symbolic model of sound change. 2.1. An exemplar model of lenition It is useful to describe the main aspects of our model by contrasting it with another model proposed recently by Pierrehumbert. This is a model of sound change aimed at accounting for certain generalizations about lenition, extrapolated from observations of synchronic variation or sound changes in progress. The model proposed in Pierrehumbert (2001) has two attractive properties. It offers a way to represent the fine phonetic substance of linguistic categories, and it provides a handle on the effect of lexical frequency in the course of an unfolding lenition process. In Pierrehumbert’s discussion of lenition, it is assumed that the production side of a lenition process is characterized by the following set of properties.
A dynamical model of change in phonological representations
221
Table 1. Properties of lenition
Properties of Lenition 1. Each word displays a certain amount of variability in production. 2. The effect of word frequency on lenition rates is gradient. 3. The effect of word frequency on lenition rates should be observable within the speech of individuals; it is not an artifact of averaging data across the different generations which make up a speech community. 4. The effect of word frequency on lenition rates should be observable both synchronically (by comparing the pronunciation of words of different frequency) and diachronically (by examining the evolution of word pronunciations over the years within the speech of individuals.) 5. The phonetic variability of a category should decrease over time, a phenomenon known as entrenchment. The actual impact of entrenchment on lenition is not clear, and Pierrehumbert does not cite any data specific to entrenchment for this particular diachronic effect. In fact, while a sound change is in progress, it seems equally intuitive (in the absense of any data to the contrary) that a wider, rather than narrower range of pronunciations is available to the speaker. Pierrehumbert uses the example of a child’s productions of a category becoming less variable over time, but this may only apply to stable categories, rather than ones undergoing diachronic change. It may also be orthogonal to the child’s phonetic representations, and rather be due to an initial lack of biomechanical control. For these reasons, therefore, our own model is not designed to guarantee entrenchment while sound change is taking place, but does show entrenchment effects for diachronically stable categories. The first property, variability in production, does not apply exclusively to lenition process, but rather it is a general characteristic of speech production that any lenition model should be able to capture. The frequency related properties are based on previous work by Bybee who claims that at least some lenition processes apply variably based on word frequen-
222
Adamantios Gafos and Christo Kirov
cy (Bybee, 2003). Examples include schwa reduction (e.g. memory tends to be pronounced [mɛmɹi]) and t/d-deletion (e.g. told tends to be pronounced [tol]). Once a lenition process has begun, Bybee’s claim amounts to saying that words with high frequency will weaken more quickly over time than rare words. Consequently, lenition effects can be seen both synchronically and diachronically. Synchronically, a more frequent word will be produced more lenited (with more undershoot) than a less frequent word in the current speech of a single person. Diachronically, all words in a language will weaken across all speakers, albeit at different rates. What are the minimal prerequisites in accounting for the lenition properties above? First, it is clear that individuals must be capable of storing phonetic detail within each lexical item. We also need a mechanism for gradiently changing the lexical representations over time. To do this, the perceptual system must be capable of making fine phonetic distinctions, so that the information carried by these distinctions can reach the currently spoken item in the lexicon. Pierrehumbert’s exemplar-based model of lenition gives explicit formal content to each of these prerequisites (Pierrehumbert, 2001). The model is built on a few key ideas, which can be described in brief terms. Specifically, in the exemplar-based model, a given linguistic category is stored in a space whose axes define the parameters of the category. In Pierrehumbert (2001), it is suggested that vowels, for example, might be stored in an F1/F2 formant space. This space is quantized into discrete cells based on perceptual limits. Each cell is considered to be a bin for perceptual experiences, and Pierrehumbert views each bin to be a unique potential exemplar. When the system receives an input, it places it in the appropriate bin. All items in a bin are assumed to be identical as far as the perceptual system is concerned, and the more items in a particular bin, the greater the activation of the bin is. All bins start out empty and are not associated with any exemplars that have actually been produced and/or perceived (memory begins as a tabula rasa). When a bin is filled, this is equivalent to the storage of an exemplar. The new exemplar is given a categorical label based on the labels of other nearby exemplars. This scheme limits the actual memory used by exemplars. There is a limited number of discrete bins, and each bin only stores an activation value proportional to the number of exemplar instances that fall into it. Thus, not all the exemplar instances need to be stored. A decay process decreases the activation of an exemplar bin over time, corresponding to memory decay. Figure 1, taken from Pierre-
A dynamical model of change in phonological representations
223
humbert (2001), shows the F2 space discretized into categorically labeled bins.
Figure 1. Exemplar bins with varying activations.
The set of exemplars with a particular category label constitutes an extensional approximation of a probability distribution for that category over the storage space. Given the coordinates in the storage space over which a category is defined, that distribution would provide the likelihood that a token with those coordinates would belong to that category (e.g. how likely is it that the token is an /a/). During production, a particular exemplar from memory is chosen to be produced, where the likelihood of being chosen depends on how activated the exemplar is. The chosen exemplar is shifted by a bias in the direction of lenition. This bias reflects the synchronic phonetic motivation for lenition. This includes at least the tendency to weaken the degree of oral constriction in contexts favoring segmental reductions, e.g. in non-stressed syllables, syllable codas, or intervocalically. For relevant discussion see Beckman et al. (1992) and Wright (1994). To account for entrenchment (see Table 1(5)), Pierrehumbert extends this production model by averaging over a randomly selected area of exemplars to generate a produced candidate. Since the set of exemplars defines a probability distribution (in an extensional sense), weighing the average by each exemplar’s probability results in a production candidate pushed toward the center of the distribution. The exemplar scheme described in this section derives the five properties of lenition discussed earlier as follows. Variability in production is
224
Adamantios Gafos and Christo Kirov
directly accounted for since production is modeled as an average of the exemplar neighborhood centered around a randomly selected exemplar from the entire set stored in the system. Each lexical item has its own exemplars, and each production/perception loop causes the addition of a new exemplar to the set. This new exemplar is more lenited than the speaker originally intended due to biases in production, so the distribution of exemplars skews over time. In a given period of time, the number of production/perception loops an item goes through is proportional to its frequency. Thus, the amount of lenition associated with a given item shows gradient variation according to the item’s frequency (Dell, 2000). As all processes directly described by the exemplar model occur within a single individual, lenition is clearly observable within the speech of individuals. Diachronically, lenition will proceed at a faster rate for more frequent items because they go through more production/perception loops in a given time frame. The synchronic consequence of this is that at a point in time, more frequent items will be more lenited in the speech of an individual than less frequent items. Finally, entrenchment is a consequence of averaging over several neighboring exemplars during production, shifting the resulting production towards the mean of the distribution described by all the exemplars. In sum, the exemplar-based model offers a direct way to encode phonetic details, and captures the assumed effects of frequency on lenition. Pierrehumbert further claims that the exemplar model is the only type of model that can properly handle the above conception of lenition (Pierrehumbert, 2001:147). In what follows, we will propose an alternative dynamical model of lenition. The dynamical model can also account for the lenition properties reviewed above. But it is crucially different from the exemplar-based model in two respects. The dynamical model encodes phonetic details while maintaining unitary category representations as opposed to representations defined extensionally by collections of exemplars. In addition, the dynamical model also admits a temporal dimension, which is currently not part of the exemplar-based model. 2.2. A dynamical model of lenition 2.2.1. Description of the model Studying language change as a process occurring in time broadly motivates a dynamical approach to modeling. A dynamical model is a formal system
A dynamical model of change in phonological representations
225
whose internal state changes in a controlled and mathematically explicit way over time. The workings of the proposed model are based on a dynamical formalism called Dynamic Field Theory (Erlhagen & Schöner, 2002). A central component of our model is the spatio-temporal nature of its representations. Take a lexical item containing a tongue tip gesture as that for /d/. We can think of the specification of the speech movements associated with this gesture as a process of assigning values to a number of behavioral parameters. In well-developed models that include a speech production component, these parameters include constriction location and constriction degree (Guenther, 1995; Saltzmann & Munhall, 1989; Browman & Goldstein, 1990). A key idea in our model is that each such parameter is not specified exactly but rather by a distribution depicting the continuity of its phonetic detail. Although our model does not commit to any specific phonological feature set or any particular model for the control and execution of movement, to illustrate our proposal more explicitly let us assume the representational parameters of Articulatory Phonology (Saltzmann & Munhall, 1989; Browman & Goldstein, 1990). Thus, let us assume that lexical items must at some level take the form of gestural scores. A gestural score, for current purposes, is simply a sequence of gestures (we put aside the intergestural temporal relations that also must be specified as part of a full gestural score). For example, the sequence /das/ consists of three oral gestures - a tongue tip gesture for /d/, a tongue dorsum gesture for /a/, and a tongue tip gesture for /s/. Gestures are specified by target descriptors for the vocal tract variables of Constriction Location (CL) and Constriction Degree (CD), parameters defining the target vocal tract state. For example, /d/ and /s/ have the CL target descriptor {alveolar}. The CD descriptor of /d/ is {closure} and for /s/ it is {critical}. These descriptors correspond to actual numerical values. For instance, in the tongue tip gesture of a /d/, {alveolar} corresponds to 56 degrees (where 90 degrees is vertical and would correspond to a midpalatal constriction) and {closure} corresponds to a value of 0 mm. In our model, each parameter is not specified by a unique numerical value as above, but rather by a continuous activation field over a range of values for the parameter. The field captures among other things a distribution of activation over the space of possible parameter values so that a range of more activated parameter values is more likely to be used in the actual execution of the movement than a range of less activated parameter values. The parameter fields then resemble distributions over the conti-
226
Adamantios Gafos and Christo Kirov
nuous details of vocal tract variables. A lexical item therefore is a gestural score where the parameters of each gesture are represented by their own fields. Schematic fields corresponding to the (oral) gestures of the consonants in /das/ are given in Figure 2.
Figure 2. Component fields of /d/, /s/, and /a/. y-axis represents activation. /d/ and /s/ have nearly identical CL fields, as they are both alveolars, but they differ in CD.
Formally, parameters are manipulated using the dynamical law from Dynamic Field Theory (Erlhagen & Schöner, 2002). The basic dynamics governing each field are described by: τdp ( x, t ) = − p ( x, t ) + h + input ( x, t ) + noise
(1)
where p is the field in memory (a function of a continuous variables x, t), h is the field’s resting activation, dp(x,t) is the change in activation at x at time t, τ is a constant corresponding to the rate of decay of the field (i.e. the rate of memory decay), and input(x,t) is a field representing time dependent external input to the system (i.e. perceived token) in the form of a localized activation spike. The equation can be broken down into simpler components to better understand how it functions. The core component τdp( x, t ) = − p( x, t ) + h is an instance of exponential decay. If we arbitrarily select a value for x, and plot p(x,t) over time, we will see behavior described by the exponential decay equation. In the absence of any input or interaction, the activation at p(x,t)
A dynamical model of change in phonological representations
227
will simply decay down to its resting level, h, as shown in Figure 3. If p(x,t) starts at resting activation, it will remain there forever. In the terminology of dynamical systems, the starting activation of a point is known as an initial condition, and the activation it converges to, in this case the resting activation, is known as an attractor. If the input term, input(x,t), is nonzero, then the system will move towards a point equivalent to its resting activation plus the input term. The speed of the process is modulated using the τ term.
Figure 3. Top left: In the absence of input, field activation at a particular point converges to the resting level h = 1 (dashed line) (τ = 10). Top right: With added input input(x,t) = 1, activation converges to resting level h = 1 plus input (top dashed line)(τ = 10). Bottom left: In the absence of input, node activation converges to resting level r = 1 (τ = 20). Bottom left: With added input input(x,t) = 1, activation converges to resting level r = 1 plus input (τ = 20).
Fields are spatio-temporal in nature. Thus specifying the value of a gestural parameter is a spatio-temporal process in our model. We describe each of these aspects, spatial and temporal, in turn. The spatial aspect of the gestural specification process corresponds to picking a value to produce from any of the fields in Figure 2, e.g. choosing a value for Constriction Location for /d/ and /s/ from within the range of values corresponding to the [alveolar]
228
Adamantios Gafos and Christo Kirov
category. This is done by sampling the Constriction Location field, much as we might sample a probability distribution. Since each field encodes variability within the user’s experience, we are likely to select reasonable parameter values for production. A demonstration of this is shown in Figure 4. The noisy character of the specification process allows for variation in the value ultimately specified, but as the series of simulations in Figure 4 verifies the selected values cluster reliably around the maximally activated point of the field.
Figure 4. Variability in production. Histogram of selected values over 100 simulations of gestural specification. Histogram overlaid on top of field to show clustering of selected values near the field maximum.
The specification process presented here is similar but not identical to the sampling of a probability distribution. Fields have unique properties that make them useful for modeling memory. Unlike distributions, fields need not be normalized to an area under the curve of one. The key addition here is the concept of activation. Fields can vary from one another in total activation while keeping within the same limits of parameter values. Because of this added notion of activation, the specification process is more biased towards the maximally activated point in the field (i.e. the mean of the distribution) than a true random sampling would be. This leads to an entrenchment effect for categories not undergoing change. This behavior is shown in Figure 5. In addition, fields have a resting activation level (a lower-limit on activation). This level slowly tends to zero over time, but increases every time the field is accessed during production or perception. Thus, lexical items whose fields are accessed more frequently have higher resting activation levels than lexical items whose component fields are accessed less frequently. Finally, much as memory wanes over time, activation along a field decays if not reinforced by input. The other crucial aspect of the specification process is its time-course. Formalizing gestural parameters with fields adds a time-course dimension
A dynamical model of change in phonological representations
229
to the gestural specification process. Thus, if a lexical representation contains a /d/, the CD and CL parameters for this /d/ are not statically assigned to their (language- or speaker- specific) canonical values, e.g., CL = [alveolar]. Rather, assigning values to these parameters is a time-dependent process, captured as the evolution of a dynamical system over time. In short, lexical representations are not static units. This allows us to derive predictions about the time-course of choosing or specifying different gestural parameters.
Figure 5. Output of entrenchment simulation. The x-axis represents a phonetic dimension (e.g. constriction degree). The field defining the distribution of this parameter is shown at various points in time. As time progresses, the field becomes narrower.
The specification process begins by a temporary increase in the resting activation of the field, i.e. pushing the field up, caused by an intent to produce a particular lexical item (which includes a gesture ultimately specified for a parameter represented by this field). Activation increases steadily but noisily until some part of the field crosses a decision threshold and becomes the parameter value used in production. This scheme ensures that the areas of maximum activation are likely to cross the decision threshold first. After a decision has been made, resting activation returns to its preproduction level. The following equation represents this process mathematically: τdh / dt = −h + h0 + δ(d , max( p) ) * h + noise
(2)
230
Adamantios Gafos and Christo Kirov
where h is the temporarily augmented resting activation, τ is a time scaling parameter, h0 is the pre-production resting activation level, δ(d , max( p) ) is a nonlinear sigmoid or step function over the distance between the decision threshold d and the maximum activation of field p, and noise is scaled gaussian noise. While the distance is positive (the decision threshold has not yet been breached), the δ function is also positive and greater than 1, overpowering the - h term and causing a gradual increase in the resting activation h. When the decision threshold is breached, the δ function becomes 0, and remains clamped at 0 regardless of the subsequent field state, allowing the - h term to bring activation back to h0. The gestural specification process is affected by the pre-production resting activation of the field, in that a field with high resting activation is already “presampled”, and thus automatically closer to the decision threshold. This leads to faster decisions for more activated fields, and by extension more frequent parameter values. The relevant simulations are described below. Figure 6 shows representative initial fields, and Figure 7 shows the progression of the featural specification process over time. We see that given two fields identical in all respects except for resting activation, the field with the higher resting activation reaches the decision threshold first.
Figure 6. The two fields are identical except for resting activation: h0 = 1 (left), h0 = 2 (right). The x-axis is arbitrary.
Figure 7. Sampling was simulated with a decision threshold d = 5, τ = 10, and noise = 0. The first field (left) reached the decision threshold at t = 25,
A dynamical model of change in phonological representations
231
and the second field (right) reached the decision threshold at t = 9 (where t is an arbitrary unit of simulation time). The field with higher initial resting activation reached the decision threshold faster. Both fields return to their pre-production resting activation after decision threshold is reached.
We now discuss the ways in which representing gestural parameters by fields relates to other proposals. The field equation used in our model parallels the exemplar model in many ways, but encapsulates much of the functionality of that model in a single dynamical law which does not require the storage of exemplars. Memory wanes over time as the field decays, much as older exemplars are less activated in the exemplar model. Input causes increased activation at a particular area of the field, much as an exemplar’s activation is increased with repeated perception. This activation decays with time, as memory does. Perhaps the most crucial difference between our model and the exemplar model described earlier is the time-course dimension. In the exemplar model discussed, the assignment of a value to a parameter does not have any time-course. The process is instantaneous. The same is true for the relation between our model and those of Saltzman & Munhall (1989), Browman & Goldstein (1990). Using fields is a generalization of a similar idea put forth in Byrd & Saltzman (2003), where gestural parameters are stored as ranges of possible values. In our model, each range is approximated by an activation field in memory. Finally, representing targets by activation fields is also a generalization of two well-known proposals about the nature of speech targets, Keating’s "windows" (Keating, 1990) and Guenther’s "convex regions" (Guenther, 1995). In Guenther’s model of speech production, speech targets take the form of convex regions over orosensory dimensions. Unlike other properties of targets in Guenther’s model, the convexity property does not fall out from the learning dynamics of the model. Rather, it is an enforced assumption. No such assumption about the nature of the distributions underlying target specification need be made in our model. 2.2.2. Lenition in the dynamical model When a lexical item is a token of exchange in a communicative context, phonetic details of the item’s produced instance may be picked up by perception. This will have some impact on the stored instance of the lexical item. Over longer time spans, as such effects accumulate, they trace out a
232
Adamantios Gafos and Christo Kirov
path of a diachronic change. Our model provides a formal basis for capturing change at both the synchronic and the diachronic dimensions. We focus here on how a single field in a lexical entry is affected in a production-perception loop. The crucial term in the field equation is the input term input(x,t). This input term input(x,t) represents sensory input. More specifically, input is a peak of activation registered by the speech perception module. This peak is located at some detected x-axis value along the field. This value is assumed to be sub-phonemic in character. For example, we assume that speakers can perceive gradient differences in Voice Onset Time values, constriction location, and constriction degree within the same phonemic categories. In the current model, the input term is formulated as e −( x −off )² , where off is the detected value or offset along the x-axis of the field. The spike corresponding to the input term input(x,t) is directly added to the appropriate field, resulting in increased activation at some point along the field’s x-axis. A concrete example is presented in Figure 8. Once input is presented, a system can evolve to a stable attractor state, that is, a localized peak at a value corresponding to the input. The state is stable in the sense that it can persist even after the input has been removed. In effect, the field for the lexical item has retained a memory of the sub-phonemic detail in the recently perceived input. The process of adding gaussian input spikes to an existing field is analogous to the storage of new exemplars in the exemplar model. The field, however, remains a unitary function. It is an intensional representation of a phonetic distribution. A growing set of exemplars is an extensional representation.
Figure 8. (Left) Field representing a phonetic parameter of a lexical item in memory. (Middle) Input function (output of perception corresponding to input ( x, t ) in Equation 1). Represents a localized spike in activation along the field, corresponding in location to, for example, the constriction degree of the input. (Right) Field of lexical item in memory after input is added to it. Field shows increased activation around area of input.
A dynamical model of change in phonological representations
233
Since activation fades slowly over time, only areas of the field that receive reinforcement are likely to remain activated. Thus, a peak in activation may shift over time depending on which region of the field is reinforced by input. In terms of the lenition model this means that regions of the field representing a less lenited parameter fade while regions representing a more lenited parameter are kept activated by reinforcement from input. The interaction between localized increase in activation based on input and the slow fading of the field due to memory decay is the basic mechanism for gradual phonetic change. Since activation fades slowly over time, only areas of the field that receive reinforcement are likely to remain activated. So, a peak in activation may shift over time depending on which region of the field is reinforced by input. Regions of the field representing a less lenited parameter fade while regions representing a more lenited parameter are kept activated by reinforcement from input. Given an initial field (a preshape) representing the current memory state of a lexical item, we can simulate lenition using the model described above. Figure 9 shows the results of one set of simulations. Shown are the state of the simulation at the starting state, after 50 samples of a token, and after 100 samples (in the simulations, the number of samples is small but each sample produces a large effect on the field). Each time step of the simulation corresponds to a production/perception loop. Production was performed as described above by picking a value from the field and adding noise and a bias to it. This produced value, encoded by an activation spike of the form e −( x −off )² , where off = sample( p) + noise + bias , was fed back into the system as input. As can be seen in Figure 9, at the point when lenition begins, the field represents a narrow distribution of activation and there is little variability when sampling the field during production. As lenition progresses, the distribution of activation shifts to the left. During this time the distribution becomes asymmetrical, with a tail on the right corresponding to residual traces of old values for the parameter. It also grows wider, corresponding to an increase in parameter variation while the change occurs.
234
Adamantios Gafos and Christo Kirov
Figure 9. Output of lenition simulation. The x-axis represents a phonetic dimension (e.g. constriction degree during t/d production). Each curve represents a distribution of a particular category over the x-axis at a point in time. As time progresses, the distribution shifts to the left (i.e. there is more undershoot/lenition) and becomes broader.
With small changes in parameterization, our model can more closely represent the entrenchment behavior seen in Pierrehumbert (2001). In Figure 10, lowering the strength of memory decay by adding a constant ε < 1 factor in the –p(x,t) term in Equation 1, results in less flattening of the parameter field as lenition proceeds. However, the distribution retains a wide tail of residual activation around its base. To keep the field narrow as time proceeds, we can alternate between production/perception cycles with a production bias and without. This resembles production of the category in contexts where the phonetic motivation for the bias is present versus contexts where it is absent (e.g. prosodically weak versus strong positions). In effect, Figure 11 was created by biasing only every other simulated production. This was done in addition to lowering the strength of memory decay as discussed above.
A dynamical model of change in phonological representations
235
Figure 10. Lowering memory decay results in less flattening of the field as the lenition simulation proceeds.
Like the exemplar model above, the model described in this section can derive the properties of lenition assumed by the exemplar model. Here we enumerate the functional equivalence of the two models with respect to the properties of lention assumed by the exemplar model. In the dynamical model, variability in production is accounted for by noise during the gestural specification process. Each lexical item has its own fields and each production/perception loop causes a shift in the appropriate field towards lenition due to biases in production (see Figure 8 for an example of a field starting to skew to the left). In a given period of time, the number of production/perception loops an item goes through is proportional to its frequency. Thus, the amount of lenition associated with a given item shows gradient variation according to the item’s frequency. All the processes described here occur within a single individual, so lenition is clearly observable within the speech of individuals. Diachronically, lenition will proceed at a faster rate for more frequent items, again because they go through more production/perception loops in a given time frame. This same mechanism is evident synchronically as well, since at any single point in time, more frequent items will be more lenited than less frequent items.
236
Adamantios Gafos and Christo Kirov
Figure 11. Interleaving biased and non-biased productions leads to consistently narrow field.
In sum, the broad proposal of this section is that diachronic change can be seen as the evolution of lexical representations at slow time scales. The specific focus has been to demonstrate that certain lenition effects, described in a previous exemplar model, can also be captured in our model of evolving activation fields.
3. Conclusion We have presented a dynamical model of speech planning at the featural or vocal tract variable level. This model allows us to provide an alternative account for lenition in lieu of an exemplar-based model. The dynamical and exemplar models cover the same ground as far as their broad agreement with the assumed properties of an evolving lenition process are concerned. However, there are fundamental high level differences between the two. Tables 2 and 3 contrast properties of the exemplar and dynamical models.
A dynamical model of change in phonological representations
237
Table 2. Properties of the exemplar model
Exemplar Model 1. Every token of a category (where category could mean any item capable of being recognized - word, phoneme, animal cry, etc.) is explicitly stored as an exemplar in memory. A new experience never alters an old exemplar (Hintzman, 1986). 2. the complete set of exemplars forms an extensional definition of a probability distribution capturing variability of a category. 3. Distributions are altered by storing more exemplars. Table 3. Properties of the dynamical model
Dynamical Model 1. Every token of a category is used to dynamically alter a single representation in memory associated with that category, and is then discarded. No exemplars are stored. 2. Variability is directly encoded by the singular representation of a category. The parameters of a category exist as field approximations to probability distributions which are defined intensionally. That is, they are represented by functions, ratherthan a set of exemplars. 3. Distributions are altered by dynamical rules defining the impact of a token on a distribution, and changes to the distribution related to the passage of time. Two key differences are highlighted. First, the dynamical model remains consistent with one key aspect of generative theories of representation. Instead of representing categories extensionally as arbitrarily large exemplar sets, linguistic units and their parameters can have singular representations2. These are the fields in our specific proposal. It is these unitary representations, rather than a token by token expansion of the exemplar sets, that drifts in sound change. In this sense, our model is similar to other nonexemplar based models of the lexicon such as Lahiri & Reetz’s (2002) model while still admitting phonetic detail in lexical entries (see previous
238
Adamantios Gafos and Christo Kirov
chapter of this volume by Nguyen, Wauquier & Tuller, for relevant discussion). Second, the dynamical model is inherently temporal. Since both the exemplar and the dynamical model are at least programmatically designed to include production and perception, which unfold in time, this seems to be a key property. In an extension of the present model, we aim to link perceptual to motor representations and to provide an account of the effects of certain lexical factors (such as neighborhood density and frequency) on the time-course of speech production. Such an account would contribute to the larger goal of establishing an explicit link between the substantial literature on the time-course of word planning and linguistic theories of representation.
Acknowledgments We wish to thank Matt Goldrick for his comments on an earlier draft. Many thanks also to the two anonymous reviewers and the editors for detailed and cogent commentary on the manuscript. Research supported by NIH Grant HD-01994 to Haskins Labs. AG also acknowledges support from an Alexander von Humboldt Research Fellowship.
Notes 1. Authors names are listed in alphabetic order. Correspondence should be addressed to both authors,
[email protected],
[email protected] 2 . It is useful to distinguish the exemplar approach from a version of the dynamical one where multiple different instances of a category are stored, corresponding to different registers, different speakers, etc. For our purposes, each of these subcategories is considered unique and has a singular representation.
References Beckman, Mary E., de Jong, Ken, Jun, Sun-Ah, & Lee, Sook-hyang. 1992 The interaction of coarticulation and prosody in sound change. Language and Speech, 35:45–58.
A dynamical model of change in phonological representations
239
Browman, Catherine P., & Goldstein, Louis. 1990 Gestural specification using dynamically defined articulatory structures. Journal of Phonetics,18:299–320. Bybee, Joan 2003 Lexical diffusion in regular sound change. In Restle, D., & Zaefferer, D. (eds), Sounds and Systems: Studies in Structure and Change. Mouton de Gruyter, Berlin, pp. 58–74. Byrd, Dani & Saltzman, Elliot 2003 The elastic phrase: modeling the dynamics of boundary-adjacent lengthening. Journal of Phonetics, 31(2):149–180. Cser, András 2003 The Typology and Modelling of Obstruent Lenition and Fortition Processes. Akadémiai Kiadó. Dell, Gary S. 2000 Lexical representation, counting, and connectionism. In Broe, Michael B., & Pierrehumbert, Janet (eds), Papers in Laboratory Phonology V. Cambridge University Press, Cambridge UK, pp. 335–348. Erlhagen, Wolfram, & Schöner, Gregor 2002 Dynamic field theory of movement preparation. Psychological Review, 109:545–572. Guenther, Frank H. 1995 Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102:594–621. Gurevich, Naomi 2004 Lenition and contrast: functional consequences of certain phonetically conditioned sound changes. Outstanding dissertations in linguistics Series. Routledge, NY. Hintzman, Douglas H. 1986 “Schema” abstraction in a multiple-trace memory memory model. Psychological Review, 93(4):411–428. Keating, Patricia A. 1990 The window model of coarticulation: articulatory evidence. In Beckman, M. E., & Kingston, J. (eds), Papers in laboratory phonology I: Between the grammar and the physics of speech. Cambridge: Cambridge University Press, pp. 451–470. Kiparsky, Paul 1968 Linguistic universals and linguistic change. In Emmon, Bach, & Robert, Harms (eds), Universals in linguistic theory. New York: Rinehart and Winston, pp. 170–202. Lahiri, Aditi & Reetz, Henning 1990 Underspecified recognition. In Gussenhoven, C., & Warner, N. (eds), Papers in laboratory phonology VII. Berlin, Germany: Mouton de Gruyter, pp. 637–675.
240
Adamantios Gafos and Christo Kirov
Lavoie, Lisa 2001 Consonant strength: Phonological patterns and phonetic manifestations. Outstanding dissertations in linguistics Series. Garland, NY. Pierrehumbert, Janet 2001 Exemplar dynamics: word frequency, lenition, and contrast. In Bybee, J., & Hopper, P. (eds), Frequency effects and the emergence of linguistic structure. John Benjamins, Amsterdam, pp. 137–157. Saltzman, Elliot L., & Munhall, Kevin G. 1989 A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1:333–382. Wright, Richard 1994 Coda lenition in american english consonants: An EPG study. The Journal of the Acoustical Society of America, 95:2819–2836.
Cross-linguistic trends in the perception of place of articulation in stop consonants: A comparison between Hungarian and French Willy Serniclaes and Christian Geng 1. Introduction A basic question in the study of speech development during infancy is to understand how the perceptual predispositions of the pre-linguistic child contribute to phonological features in a given language. During a period extending from birth to some six months of age, children perceive many of the phonological contrasts present in the world's languages. Pre-linguistic children react to the acoustic differences between various phonological categories, whereas they do not react to differences within categories (Kuhl, 2004). This indicates that the perceptual processes are somehow prepared since birth for handling phonological contrasts. Being independent of any specific language these perceptual processes correspond to universal capacities, hence after "predispositions", and there is a long standing debate on their nature. For the Motor theory of speech perception (Liberman & Mattingly, 1985), the predispositions are part of a phonetic module specialized for the perception of articulatory features, such as tongue position, interarticulator timing, etc. Alternatively, the predispositions would be psychoacoustic in nature (Jusczyk, 1997; Gerken & Aslin, 2005) and correspond to acoustic features such as the direction of formant transition frequencies, tone onset time, etc. While the issue of the phonetic vs. acoustic nature of the predispositions certainly has a great interest, another fundamental question is about the adaptation of the universal predispositions to the perception of language-specific categories. We will address this last question in the present paper, leaving aside as far as possible the issue of the specific nature of these predispositions. Possible answers in current theories of speech development are that phonological categories (1) are acquired through selection of predispositions relevant for perceiving universal features (Pegg & Werker, 1997); (2) emerge by building up prototypes in an acoustic space, without straightforward relationship with the predispositions (Kuhl, 2004). According to selectionist approaches in their strongest form, phonological contrasts should
242
Willy Serniclaes and Christian Geng
conform to the perceptual predispositions and adaptation to the language environment would then proceed by mere selection of predispositions (Pegg & Werker, 1997). The orthogonal position votes for a languagespecific morphing of the acoustic space and almost free configurability of vowels in this space. A compromise between the strong innatist position and the prototypical approach is that adaptation of the perceptual predispositions to the linguistic paradigm of the ambient language proceeds not only by selection but also by combinations between predispositions (Serniclaes, 2000; Hoonhorst, Colin, Radeau, Deltenre, & Serniclaes, 2006). As there is fairly strong evidence for the existence of predispositions for the perception of virtually all possible phonetic contrasts in the world’s languages (e.g. Vihman, 1996), we cannot avoid contemplating the constraints imposed by such predispositions on the build-up of phonological categories prevailing in a specific language. A similar question to the one raised by the acquisition of phonological systems by individuals is to understand how universal features contribute to the genesis of phonological systems. These questions are fairly similar as they pertain to the build-up of language-specific phonological features from a universal set of features. Somehow in analogy with the developmental models, phonetic models of cross-linguistic diversity describe phonological systems as (1) combinations between universal features (Jakobson, Fant, & Halle, 1952); (2) optimization of distances between categories in some acoustic space (Liljencrants & Lindblom, 1972). In the featural approach of phonological systems, categories are the subproduct of feature combinations and the latter can either be orthogonal or not. Cross-linguistic differences in consonant place of articulation offer a typical example of nonorthogonal combination between two different binary features, many languages displaying three places of articulation categories instead of the 17 potential ones for consonants in the world's languages (Ladefoged & Maddieson, 1996). More specifically for the purpose of the present study on the perception of stop consonants, two different binary features generally giving rise to three categories, the /b, d, g/, instead of the four potential ones, the /b, d, ɟ, g/, in the Jakobsonian framework which was explicitly defined with reference to acoustics and perception (Jakobson et al., 1952). Nonorthogonal combinations between features in system building are conceptually similar to those occurring during phonological development (see hereafter feature "couplings"): in both cases the features are used in such a way as to construct, or to appropriate, the categories prevailing in a given language. Distance models take quite another way. Liljencrants & Lindblom
Cross-linguistic trends in the perception of place of articulation
243
(1972) were the first to relate phonetic principles of perceptual contrast to the structure of vowel inventories and their sizes. In short, languages prefer vowels which are maximally distinct for the perceiver and are produced with the least effort for the producer. Dispersion approaches to phonological systems match nicely with the prototype approach of speech development: in both cases categories freely emerge in a uniform psycho-acoustic space. The language maximizes the perceptual distances between categories (Liljencrants & Lindblom, 1972) and the perceptual system finds back the categories by creating maximally different prototypes (Guenther & Bohland, 2002). However, both the prototypical approach of speech development and distance models of phonological systems do not take account of the possible role of initial conditions in the build-up of the phonological spaces. In their strongest forms, phonological contrasts should conform to the perceptual predispositions and adaptation to the language environment would then proceed by mere selection of predispositions. However, this seems hardly tenable in view of the diversity of phonological contrasts and their plasticity across phonetic contexts. Here we present some further evidence in support to the hypothesis that adaptation of the universal predispositions to the ambient language proceeds by combinations between predispositions (Serniclaes, 2000). 1.1. Models of speech perception development In many instances, adult listeners display “categorical perception”: they are much more sensitive to differences between speech sounds belonging to two different feature categories than to a similar acoustic variation between sounds belonging to the same category, i.e. when no boundary is crossed (see for a review: Harnad, 1987). Still, there are also examples of noncategorical perception in adults, illustrated by better discrimination of within vs. between-category differences (e.g. Massaro, 1987). However, the origins of this non-categorical perception are not entirely clear and might at least partly be due to the existence of subordinate categories, i.e. categorical distinctions without explicit labels that are finer-grained than those under scope in a given task (Snowdown, 1987). The existence of subordinate categories along speech continua is congruent with the hypothesis that the perception of the phonological categories evidenced in the adults arises
244
Willy Serniclaes and Christian Geng
from combinations of predispositions present in the pre-linguistic child (see hereafter "feature couplings"). Predispositions for categorical perception have been evidenced in the pre-linguistic child (below six months of age, see for a review: Vihman, 1996). For instance, infants younger than six months are sensitive to both negative and positive natural VOT boundaries whatever their linguistic background (Spanish, Kikuyu: Lasky, Syrdal-Lasky & Klein, 1975). Similar predispositions were evidenced for place of articulation (Eimas, 1974). The initial ability to discriminate the universal set of phonetic contrasts however appears to decline in the absence of specific language experience. The decline occurs within the first year of life (Werker & Tees, 1984a) and it involves a change in processing strategies rather than a sensorineural loss (Werker & Tees, 1984b). Finally, repeated experience to the sounds of a given language also gives rise to facilitation effects (Kuhl, Stevens, Hayashi, Deguchi, Kiritani, & Iverson, 2006). How do phonological features arise from predispositions? One possible answer to this question is that phonological features are acquired through selection of predispositions relevant for perceiving categories in a given language (Pegg & Werker, 1997). Another possibility is that phonological categories emerge by exposure to the sounds present in a given language without relationship to the predispositions (Kuhl, 2004). While it seems evident that adaptation to a specific language does not proceed entirely through selective processes, there is evidence that the emergence of phonological percepts is somehow constrained by predispositions. For instance, in a French speaking environment the discrimination of VOT in young infants below the age of six months is organized around universal boundaries located around some -30 ms and +30 ms whereas infants above six months discriminate the adult VOT boundary which is located at 0 ms (Hoonhorst et al., 2006). The phonological coupling hypothesis (Serniclaes, 2000) explains the emergence of non-native language specific boundaries by interactions between the universal predispositions. For example, the predispositions for perceiving either negative or positive VOT cannot directly account for the 0 ms VOT boundary, as children below some six months of age are not sensitive to this boundary. This boundary might simply result from the addition of the raw acoustic inputs, which would imply that the predispositions are deactivated. However, experiments with stimuli generated by factorial variation of negative and positive VOT (i.e. stimuli in which the two cues are in conflict) suggest that the 0 ms boundary is obtained by integrating
Cross-linguistic trends in the perception of place of articulation
245
the cues after interactive processing by the predispositions, i.e. after categorisation of positive VOT has affected categorisation of negative VOT, and conversely (Serniclaes, 2000). Such interactive integration is a special instance of the general concept of “coupling” between perceptual entities (for a review see Masin, 1993). It is well known that different acoustic cues contribute to the perception of the same phonological feature (Repp, 1982). One might wonder whether couplings between predispositions underlie the integration of all the cues which contribute to the same feature. This would be the case if each acoustic cue was processed by a different predisposition. There are however two arguments against generalized couplings. First, the integration of some acoustic cues result from psychoacoustic interactions, an obvious example being given by the trade-off between burst duration and intensity for the perception of the voicing feature (Repp, 1979). The second argument against generalized couplings is that acoustic cues which are independent on acoustic groups might nevertheless be part of the same predisposition because they are tied to the same articulatory dimension. However, this latter argument is only valuable if we take for granted that predispositions are phonetic in nature. 1.2. Place of articulation perception The perception of place of articulation in stop consonants offers additional evidence for couplings between features. There are basically two different kinds of acoustic cues involved in place perception: those carried by the burst and those carried by the formant transitions. Although these cues might correspond to different predispositions, as a startpoint we will consider here that burst and transitions are part of the same predisposition (be it for psychoacoustic or phonetic reasons). Further, we will start from the transition-based description of the place features afforded by the Distinctive Region Model (DRM) of place production (Carré & Mrayati, 1991). The DRM is organized around the neutral vowel (schwa) as a central reference. In the neutral vowel context, place boundaries tend to correspond to flat F2F3 transitions, the categories being characterized by rising vs. falling transitions (Figure 1). The four possible combinations between F2 and F3 transitions directions generate four Distinctive Regions on the anterior-posterior direction with the following specifications: F2-F3 both rising (R8), F2 rising-F3 falling (R7), F2-F3 both falling (R6), F2 falling-F3 rising (R5). Al-
246
Willy Serniclaes and Christian Geng
though there are no clearcut correspondences between Distinctive Regions and articulatory descriptions of place categories, the R8, R7, R6, and R5 regions are usually ascribed to labial (/b/), dental (void in French), alveolar (/d/) and velar (/g/) places of articulation, in that order. F3 onset
dental
S5
S4
S3
S2
S1
R7
alveopalatal R6
3000
S6
S14
S7
S13
2500
2000
S8
S9
S10
S11
S12
velar
labial
R5
R8 1500 500
1000
1500
2000
2500
F2 onset
Figure 1a.
Figure 1b
Figure 1c
Figure 1d
Figure 1. Locations of potential place categories in the F2-F3 transition onset space according to the Distinctive Region Model of speech production (Fig.1a) and of perceptual boundaries according to different models of language acquisition (Figs. 1b, 1c, 1d). According to the DRM, changes in size of four different articulatory regions allow separate control over the direction of F2 and F3 transitions. S1 … S14 indicate the location of the stimuli of the present experiment in the F2-F3 transition onset space (Fig. 1a).
Cross-linguistic trends in the perception of place of articulation
247
For the purpose of relating the DRM to the four Hungarian stop categories, our working hypothesis in the present study was that palatal stops (/ɟ/) would occupy the R6 region in this language and that the remaining category (/d/) would occupy the R7 region. To locate the Hungarian palatal in the R6 region is congruent with its complex nature (Keating, 1988) with both dorsal (like the velar) and coronal (like the alveolar) checkmarks. Although Hungarian /d/ stops are usually considered as alveolars rather than dentals, there are large individual variations in coronal place of articulation (as evidenced by articulatory investigations in English and French: Dart, 1998). Therefore, it is not unreasonable to postulate that the presence of palatals in the R6 region might push the Hungarian /d/ stops inside the dental region (R7). Congruent with the acoustic differences between place categories, place perception is grounded on changes in transition direction in the neutral context (Serniclaes & Carré, 2002). However, while the perceptual placeof-articulation boundaries correspond to flat transitions in the neutral context, they undergo specific adjustments in other contexts. The place boundary is shifted towards falling transitions before back rounded vowels, rising transitions before front unrounded vowels, and intermediate positions before front rounded vowels. The radial model of place perception states that the contextual adjustments of the transition boundary follow a rotational movement in the F(onset) – F(endpoint) plane around a central point corresponding to the flat transition in the neutral context, the direction of the boundary line depending on the perceived identity of the following vowel (Serniclaes & Carré, 2002). Though place perception is strongly dependent of phonetic context, the fact that place boundaries correspond to flat F2 and F3 transitions in the neutral vowel context points to a relationship with natural psychoacoustic settings. Both infants below 9 months of age and adults are much less sensitive to a difference between two different falling or rising frequency transitions than to a difference between a falling and a rising transition (Aslin, 1989). This suggests that flat transition boundaries correspond to basic psychoacoustic limitations. Although psychoacoustic in nature, the sensitivity to changes in the direction of frequency transitions might be adapted for perceiving place of articulation during language development. Alternatively, it is also possible that sensitivity to differences in the frequency transition direction were integrated into a speech specific module during phylogenetic evolution (Liberman, 1998). What is clear is that flat transition
248
Willy Serniclaes and Christian Geng
boundaries are not directly usable for perceiving consonant place of articulation in all the languages. The flat F2-F3 transitions might be straightforwardly used for perceiving place of articulation contrasts in a four-category language such as Hungarian. Each category would then occupy a single region of the formant transition onset F2-F3 acoustic space (Figure 1b). In a three category languages such as French, the perceptual boundaries afforded by flat F2 and F3 transitions might also be used as such for perceiving place contrasts but one region would then be perceptually void. This is probably what an entirely selective model of predispositions would predict, although the proponents of selective models did not address the present conjecture. However, as explained above, a more realistic model of speech development with predispositions is that there are also couplings between predispositions. In the present case, coupling between the predispositions for perceiving the F2 and F3 transitions might give rise to a new boundary that would be settled in the middle of the “void” region, while the two other boundaries stick to the natural settings (Figure 1c). While these two models of perceptual development are grounded on predispositions, a distance optimization view would call on the categories present in the environment for positioning the boundaries. Under this view, categories would tend to divide the space into three equal regions and the boundaries would be settled accordingly (Figure 1d), without evident relationship with natural settings. It should be stressed that while coupling between perceptual predispositions – such as those for perceiving changes in the direction of frequency transitions – might proceed in a variety of ways, the simplest hypothesis is that coupling is linear and simply additive, i.e. with equal weightings. Linear relationships are simpler on mathematical grounds than are nonlinear functions. Further, there are positive reasons for preferring linear functions when dealing with frequency transitions. Linear relationships between the onset and offset points of frequency transitions contribute to the discrimination between /b, d, g/ place of articulation categories (Sussman, Fruchter, Hilbert, & Sirosh, 1998). Finally, linear processing of frequency transitions might be anchored in highly performant processes evidenced in animals (bats) and which might also be present under some phylogenetically derived form in man (Sussman et al., 1998). Now, if linear relationships prevail in the processing of single frequency transitions, couplings between transitions should naturally follow linear rules. As to the additivity hypothesis, couplings might well rely on weighted combinations in which one transition might be more important than the other. However, if this was
Cross-linguistic trends in the perception of place of articulation
249
true, the boundaries might take a wide range of different directions in the F2-F3 space. The predictions of the coupling model would then not be different from those of the ‘Free Dispersion’ model. 1.3. The present study We recently collected perceptual evidence in support of combinations between phonetic features for place of articulation (in French: Serniclaes, Bogliotti, & Carré, 2003; in Hungarian: Geng, Mády, Bogliotti, MessaoudGalusi, Medina, & Serniclaes, 2005). While phonological boundaries did not always correspond to those included in the universal predispositions, thereby confirming that simple selectionnist approaches cannot account of perceptual development on its own, all the phonological boundaries were somehow related to the universal ones. This lends support to the hypothesis that phonological boundaries which are seemingly unrelated to the perceptual predispositions arise from ‘couplings’ (i.e. interactive combinations) between predispositions (Serniclaes, 2000). Also consistent with couplings is that similar place of articulation boundaries were found for the distinctions that are shared by French and Hungarian, despite the fact that Hungarian uses four place categories whereas French only uses three place categories. However, no direct comparisons between the French and Hungarian boundaries were performed in our previous reports. Here we present some new evidence in support of coupling between predispositions in the perception of phonological features based on crosslinguistic comparisons. A direct comparison between the labelling responses of either French or Hungarian listeners to the same stimuli was used to test the similarities and differences between these two languages. We expected that both languages would display the same perceptual boundaries for distinctions they share in common in the F2-F3 transition onset space. Further, we wanted to confirm that these boundaries correspond to natural boundaries or to some coupling between natural boundaries (see Figure 1c).
250
Willy Serniclaes and Christian Geng
2. Method 2.1. Participants Participants for the Hungarian subset were (a) participants of an undergraduate linguistics course or (b) volunteers contacted via a mailing list. Apart from their first language, all of them were familiar with at least one of the languages German, French or English. Most of them were participants of undergraduate exchange programs. They were between 18 and 53 years old with no reading or hearing impairment reported. The French dataset was similar in age structure: Subjects' age ranging between 17 and 59 years. Likewise, there were no known auditory problems. 2.2. Stimuli Twenty-three CV stimuli were generated with a parallel formant synthesizer (conceived by R. Carré: http://www.tsi.enst.fr/~carre/). F1-F2-F3 transitions ended at 500, 1500 and 1500 Hz respectively after a 27 ms transition. The VOT was set to -95 ms and the stable vocalic portion had a duration of 154 ms. The stimuli differed as to the onset of F2 and F3 transition. 14 stimuli were generated by separate modification of the F2 and F3 onsets along a "phonetic" continuum, normal to the locations of the natural boundaries – corresponding to either flat F2 or F3 transitions- as shown in Figure 1. Nine other stimuli were generated by joint modifications of the F2 and F3 onsets along a "phonological" continuum normal to the expected category boundaries separating the F2/F3-space into three distinct regions (for further details see: Serniclaes et al., 2003). Successive stimuli were 1 Bark apart on both continua. The present paper only deals with the data of the “phonetic” continuum. The same amount of stimuli was generated with the same basic data but an additional, constant burst-like signal portion. 2.3. Procedure Both continua were presented to each of the participants. The continua with and without burst were presented in alternating order resulting in a between-subject factor (order of presentation) which was used for control purposes. Hungarian participants were told that they would hear one of the
Cross-linguistic trends in the perception of place of articulation
251
four sounds “b”,”d”, “gy” or “g” and were instructed to report which of the four sounds they had heard. They were told that the sounds not necessarily were presented with equal frequency, and to judge each sound separately. For the French participants, the procedure was alike except that for them only three response alternatives were available. 2.4. Statistical models The data were fitted by Nonlinear Regression with a model in which the effect of F2 was nested in the effect of F3 and the latter was nested in the effects of Residual cues (i.e. the acoustic cues for place which were constant in the stimuli). This model is instantiated by Equations 1 and 2. Labelling responses depend on a Logistic Regression (LR) equation including Residual cues, a nested LR equation including F3, itself including a nested LR equation including F2. Each LR equation included different variables representing the effects of Burst and Language. Equation 1. Logistic Function 1 Φ (γ(cues)) =
- γ(cues)
1+e Where Φ is the Logistic function and γ is a Linear function. The model used here is a composed of three nested Logistic functions, as specified in Equation 2. Equation 2. Coupling model Labelling response = Φ (γ (R, Φ (γ(F3, Φ (γ(F2))))) Where R stands for “Residual Cues”. Φ and γ have the following parameters. γ(R, Φ (γ (F3, Φ (γ (F2)))) = A1 + B11*Burst + B12*Lang + B13* Burst*Lang + K1* Φ (γ(F3), Φ (γ(F2))) γ(F3, Φ (γ (F2))) = A21+A22*F3+B21*F3*Burst+B22*F3*Lang+B23*F3*Burst*Lang+ K2* Φ (γ(F2)) γ(F2) = A31+A32*F2+ B31*F2*Burst+ B32*F2*Lang+ B33*F2* Burst*Lang
Where F2 and F3 are the formant frequencies scaled in Barks. “Burst” and “Lang” are dichotomic variables. “Burst” stands for the presence vs. absence of a noise burst in the stimuli. “Lang” stands for the Hungarian vs. French group of participants.
252
Willy Serniclaes and Christian Geng
This is a ‘hierarchical coupling’ model. Coupling means perceptual interdependencies in the processing of different features and, accordingly, the model includes interdependencies in the perception of the different acoustic cues which convey these features. However, rather than being symmetrical, couplings are hierarchical in Equation 1, a working assumption for the sake of simplicity. A symmetrical model would indeed require feedback loops in the processing of the different cues.
3. Results Hungarian without burst
Hungarian with burst
100%
100%
75%
75%
50%
50%
25%
25%
0%
0%
1
2
3
4
5
6
7
8
9 10 11 12 13 14
labial
1
alveolar
2
3
4
5
6
7
8
9 10 11 12 13 14
palatal
French without burst
velar
French with burst
100%
100% 75%
75% 50%
50%
25%
25%
0%
0% 1
2
3
4
5
6
7
8
9 10 11 12 13 14
1 2 3 4
5 6 7 8 9 10 11 12 13 14
Figure 2. Labelling curves for the stimuli with or without burst in Hungarian and French.
Cross-linguistic trends in the perception of place of articulation
253
The labelling curves for the stimuli with or without burst in French and Hungarian are presented in Figure 2. Although there are obvious differences between the labelling curves for the stimuli with vs. without burst, the location of the boundaries (i.e. the stimuli collecting an equal number of responses for two adjacent categories) are only marginally affected. In French, there are three boundaries corresponding to (from left to right in Figure 2) the alveolar/labial, the labial/velar and the velar/alveolar distinctions. Interestingly, there is a secondary peak of velar responses around the alveolar/labial boundary, mainly for the stimuli without burst. In Hungarian, there are four boundaries corresponding to (from right to left in Figure 2) the palatal/alveolar, alveolar/labial, the labial/velar and the velar/alveolar distinctions. The distinctions between palatals, alveolars and velars are not very clearcut (Figure 2). However, the Hungarian palatal and alveolar functions, when taken together, correspond fairly well to the French alveolar function. This is clear from Table 1 which gives the different boundaries, assessed by linear interpolations around the 50% response points on the mean labelling functions (Figure 2). Examination of these boundaries indicate that while the alveolar/palatal boundary in Hungarian is fairly close to the expected flat F2 transition value, the alveolar/labial boundary in this language is far apart from the expected flat F3 transition value and much closer to the French alveolar/labial boundary. Table 1. Values of formant transitions at the perceptual boundary for the different place contrasts prevailing, for each burst condition (without vs. with), and for each language (Hungarian vs. French). The boundary values are given separately for the F2 and F3 transitions and scaled as the distances in Barks between the onset and offset frequencies. Positive values indicate falling transitions, negative values indicate rising transitions. Relevant values for assessing the hypotheses are presented in bold. For the labial/velar contrast, the F2 boundary values are fairly close to the expected flat boundary transition, as indicated by the near zero values, both for the stimuli with and without bursts and in both languages. For the velar/alveolar-palatal contrast, the values reported correspond to the velar/alveolar boundary in French and to the velar/palatal boundary in Hungarian. As expected, the F3 boundary values relative to this contrast are also fairly close to the flat boundary transition for both for the stimuli with and without bursts and in both languages. For the palatal/alveolar contrast, which is only present in Hungarian, the boundary is close to the
254
Willy Serniclaes and Christian Geng expected flat F2 transition for the stimuli with burst but not for the stimuli without burst. Finally, the Hungarian alveolar/labial boundary differs from the expected flat F3 boundary transition both for the stimuli with burst and for those without burst. The Hungarian alveolar/labial boundary is similar to the French alveolar/labial boundary, both corresponding approximately to a stimulus in which a rising F2 transition is compensated by a falling F3 transition although the trade-off is not perfect. Without burst Hungarian ∂ F2 ∂ F3
labial/velar velar/alv.-pal. palatal/ alv. alveolar/labial
0.3 1.7 1.0 -1.1
-1.4 0.0 1.4 1.4
With burst
French ∂ F2 ∂ F3 0.1 1.7
-1.4 0.2
-2.0
1.4
Hungarian ∂ F2 ∂ F3 0.1 1.7 0.3 -2.1
-1.4 0.3 1.4 1.4
French ∂ F2 ∂ F3 -0.1 1.7
-1.4 0.3
-1.9
1.4
The data were fitted with Non Linear Regressions (NLR) run on a hierarchical coupling model (see Method, Equations 1 and 2). A separate NLR was run for each contrast, i.e. labial/velar, velar/alveolar-palatal and alveolar-palatal/labial. NLR was used for testing the effect of language on place identification as well as specific hypotheses on the location of the place boundaries in the F2-F3 onset transition space. As explained in the Introduction, we expected that the place contrasts which are common to both languages display the same perceptual boundaries. We also wanted to confirm previous analyses conducted separately on the data collected for each language and which showed that the place boundaries correspond to natural boundaries or to some coupling between natural boundaries. Specifically, we expected that the labial/velar boundary would correspond to a flat F2 transition, that the velar/alveolar-palatal boundary would correspond to a flat F3 transition and that the alveolar-palatal/labial boundary would correspond to a tradeoff between a rising F2 transition and a falling F3 transition (see Figure 1c). The boundary estimations are given in Table 2. For the labial/velar contrast, the model only included a F2 component nested in a Residual cues component (cf. Eq. 1: (Φ (γ (R, Φ (γ(F2)))) submodel). The effect of F3 and its interactions with Burst and Language were not significant. There were 7 significant parameters. Burst and Language biases were not significant. The effects of F2 (bias and slope), the Burst x F2, Language x F2 (all p < .001) and Burst x language x F2 (p < .05) interactions were significant. The labi-
Cross-linguistic trends in the perception of place of articulation
255
al/velar boundary corresponds to an almost flat F2 transition in both languages, both for the stimuli with and without bursts (Table 2). Table 2. Values of formant transitions at the perceptual boundary for the place contrasts common to both languages, for each burst condition (without vs. with), and for each language (Hungarian vs. French). Each data cell gives the observed values, NLR estimations and 95% CI limits. For the labial/velar contrast, the boundary values are fairly close to the flat F2 boundary transition (11.2 Bark, 1500 Hz F2) in both languages, both for the stimuli with and without bursts. For the velar/alveolar-palatal contrast, the boundary values are close to the flat F3 boundary transition (14.5 Bark, 2500 Hz F3) in both languages, both for the stimuli with and without bursts. For the alveolar-palatal/labial contrast, boundary values are indexed by ratio of the extent of the F3 vs. the extent F2 transition in Bark. The F3/F2 transition extent ratio is fairly close to 1, except for the Hungarian data for stimuli with burst, and never significantly different from 1. This suggests that the alveolar-palatal/labial boundary corresponds to a trade-off between a rising F2 and a falling F3 transition in both languages. Contrast labial/ velar F2 onset velar/ alv.-pal. F3 onset Alv.-pal. /labial F3 extent/ F2 extent
Observed NLR CI limits Observed NLR CI limits Observed NLR CI limits
Without burst Hungarian French 11.5 11.3 11.6 11.1 11.2-12.2 10.7-11.6 14.3 14.7 14.3 14.7 13.8-14.6 14.6-14.9 1.0 0.7 0.9 0.9 0.5-1.3 0.5-1.3
With burst Hungarian French 11.3 11.1 11.3 11.0 10.9-11.9 10.7-11.5 14.8 14.8 14.8 14.9 14.7-14.9 14.9-15.0 0.5 0.7 0.7 0.7 0.4-1.0 0.4-1.0
For the velar/alveolar-palatal contrast, the model included an F3 component nested in a Residual cues component (cf. Eq. 1: Φ(γ(R, Φ (γ(F3)))) submodel). The effect of F2 (bias and slope), and its interactions with Burst and Language were not significant. There were 8 significant parameters. The effect of the Residual cues, Burst bias, Language bias and Burst x Language bias were significant (all p < .00). The effects of F3 (p < .001) and the Burst x F3 interaction (p < .01) were also significant. The velar/alveolar-palatal boundary corresponds to an almost flat F3 transition for
256
Willy Serniclaes and Christian Geng
the stimuli without burst and a slightly falling F3 transition for the stimuli with burst (Table 2). For the alveolar-palatal/labial contrast, the model included an F2 component nested in an F3 component (cf. Equation 1: Φ (γ(F3, Φ (γ(F2)))) submodel). The effects of the Residual cues as well as Burst and Language biases were not significant. There were 6 significant parameters. The effects of F2 and F3 (bias and slope), as well as the Burst x F2 and Burst x F3 interactions were significant (all p < .001). The tradeoffs between F2 and F3 transition onset values are presented in Table 2, per language and burst condition. A rising F2 transition is compensated by a falling F3 transition in both languages and both burst conditions, indicating that the alveolarpalatal/labial boundary corresponds to a trade-off between a rising F2 and a falling F3 transition.
Observed NLR Pred. LR Pred.
Observed NLR Pred. LR Pred.
100%
100%
75%
75%
50%
50%
25%
25%
0%
0% S12
S13
S14
S15
S8
S9 S10 S11 S12
Figure 3. Examples of the relative failure of the Logistic Regression (LR) vs. Non Linear Regression (NLR) for assessing perceptual boundaries (50% response points). Observed and expected response scores for the labial/velar contrast in French (right) and for the alveolar-palatal/velar contrast in Hungarian (left). In both cases the assessment of the boundary is much better with NLR than with LR.
The performances of the NLR models were compared to those of the simple Logistic Regressions with the same number of parameters. The percentage of explained variance amounted to 63.4 % with NLR vs. 61.8 % with
Cross-linguistic trends in the perception of place of articulation
257
LR for the labial/velar contrast, to 40 % with NLR vs. 38 % with LR for the velar/alveolar-palatal contrast, to 64.1 % with NLR vs. 60.4 % with LR for the alveolar-palatal/labial contrast. The NLR models fitted the data better than simple Logistic Regressions although the quantitative differences are fairly small. However, these differences are far from being negligible because the differences between expected and observed boundaries are much larger with the LR vs. NLR models. This is illustrated with two different examples in Figure 3. without burst
with burst
16 F3 onset (Bark)
F3 onset (Bark)
16 15 14 13 12
15 14 13 12
9
10
11
12
F2 onset (Bark)
13
9
10
11
12
13
F2 onset (Bark)
Figure 4. NLR estimations of the territorial maps per burst condition with both French boundaries (plain lines) and Hungarian boundaries (dotted lines). The labial/alveolar-palatal boundaries of the two languages overlap.
Territorial maps of the place categories in the F2-F3 onset frequencies are presented in Figure 4. These maps were obtained by calculating the boundaries between categories from the outputs of the Non-Linear Regressions (Equ.1). For both the stimuli with and without burst, the velar region corresponds to the lower right quadrant with boundaries corresponding to fairly flat F2 and F3 transitions (see Table 2 for details). The labial/alveolarpalatal boundary corresponds to the tradeoff between a rising F2 and a falling F3 transition. There is some tendency for the velar region to be narrower in Hungarian but differences between languages are fairly small. As for the differences between the dispersion and coupling theories (Fig. 1c vs. 1d), the confidence interval for the labial/velar boundary includes the 11.5 Bark value forecasted by the coupling theory for each Burst and Language condition whereas the 10.6 Bark value forecasted by the dispersion theory
258
Willy Serniclaes and Christian Geng
falls outside the confidence interval in all the conditions. For the velar/alveolar-palatal boundary, the confidence interval includes the 14.5 Bark value forecasted by the coupling theory in one over four conditions whereas the 14.8 Bark value forecasted by the dispersion theory falls inside the confidence interval in two over four conditions.
4. Discussion 4.1. Stability of place boundaries across languages The present results show that transitional features are used in much the same way in both Hungarian and French. Strikingly, the contrasts which are common in both languages use almost the same perceptual boundaries, especially for the stimuli with burst. Further, these common boundaries are not selected at random but correspond to qualitative changes in the direction of frequency transitions. 4.2. On the enrootment of place perception in natural boundaries The place boundaries evidenced in the present study are clearly related to natural settings. The labial/velar distinction is based on rising vs. falling direction of the F2 transition which obviously corresponds to a natural boundary. Similarly, the alveolar-palatal/velar distinction is based on the direction of the F3 transition. These represent two clear examples of direct implementation of natural boundaries in the phonological framework. Both infants and adults display a natural sensitivity for perceiving differences between rising and falling transitions in non-speech sounds (Aslin, 1989). This suggests that flat transition boundaries evidenced in the present study correspond to basic psychoacoustic limitations in the processing of frequency transitions. The relevance of these boundaries for speech perception in two languages with quite different place of articulation settings, i.e. French and Hungarian, clearly demonstrate the role of predispositions in speech development. However, it is also clear that predispositions for speech perception are not always directly suited for phonological purposes. Evidently, the distinction between labial and alveolar-palatal stops does not directly depend on either the sole F2 or the F3 transition. However, the bilabial/alveolar-palatal contrast calls on a tradeoff between the two transi-
Cross-linguistic trends in the perception of place of articulation
259
tions, a rising F2 being compensated by a falling F3 to yield a globally flat F2-F3 compound. This tradeoff does not depend on the specificities of place production in the two different languages. Rather, it corresponds to yet another qualitative difference between speech sounds: the difference between globally rising vs. globally falling F2 and F3 transitions. Though the difference in the global direction of the transitions is more complex than those between the individual directions of F2 and F3 transitions, they all correspond to qualitative changes in transition direction. Such qualitative changes are highly specific and the present results suggest that they impose strong constraints on the location of place boundaries. Finally, the present data do not allow to decide between the dispersion and coupling models. Dispersion Theory (Liljencrants & Lindblom, 1972; for a tentative expansion of this theory to consonants, see Abry, 2003) claims that language tend to optimally divide the acoustic space between phonological categories. This theory would predict an equal sharing of the acoustic space between perceptual categories, a prediction which is fairly well supported by the location of the alveolar-palatal/velar boundary in the present results. However, the location of the labial/velar boundary conforms more to the predictions of the coupling theory. The present data are probably not sensitive enough to distinguish between these theories as they make fairly equivalent predictions on the locations of the boundaries. Still another possibility is that perceptual development proceeds in two stages, couplings between predispositions being followed by an optimal share-out of the acoustic space. This is an interesting perspective in view of the fairly slow development of speech perception after a rapid adaptation of predispositions during the first year of life (Burnham, Tyler, & Horlyck, 2002). While present data were not designed to test the hypothesis of a two-stage “coupling then optimal dispersion” development, it should be addressed in future research in comparisons between children and adults. 4.3. Place of articulation categories and distinctive regions The results show that the perception of the fourfold place of articulation contrasts in Hungarian are partially based on the direction of F2 and F3 formant transitions. However, these features are clearly not sufficient for supporting the alveolar/ palatal contrast. Our working hypothesis in the present study was that the Hungarian /d/ stops should occupy the Distinctive Region with rising F2-falling F3 transitions (dental region, R7 in the
260
Willy Serniclaes and Christian Geng
DRM) and that palatal stops should occupy the Distinctive Region with rising F2-rising F3 transitions (R6 in the DRM). The present results seem to support our working hypothesis in two different ways. First, the Hungarian /d/ responses peaked inside the R7 region (Figure 2). Second, the alveolar peak was fairly weak. This last finding is in agreement with the fact that the R7 region can only produce fairly unstable transitional F2-F3 patterns (Carré, Serniclaes & Marsico, 2003): if, as we assumed, the Hungarian /d/ stops are produced in this region these percepts should only be weakly represented in the F2-F3 space and this is indeed what we found. Finally, as our data indicate that the French /d/ category covers the same regions as the Hungarian alveolar and palatal categories, it would seem that the French /d/ is perceived as alveo-palatal rather than alveo-dental. This is not directly compatible with the articulatory descriptions of /d/ as alveodental (see Ladefoged & Maddieson, 1995, p.23). While the present findings need confirmation with natural speech tokens, it is not excluded that perceptual representations of place categories do not readily fit the articulatory representations. Perceptual representations do not have to be veridical representations of the vocal tract categories for operating reliable distinctions between sounds. The intrinsic value of the /d/ category does not matter: all what matters is that /d/ is distinguished from /b/ and from /g/ on acoustic grounds. However, such acoustically–driven perceptual representations have to be transformed in specific ways to be related to motor representations. 4.4. Transitions vs. burst as vectors of place perception The poor representation of the alveolar/palatal distinction between Hungarian consonants in the F2-F3 transition space suggests that broader couplings are necessary for stabilizing the alveolar percepts. Other features, among which those provided by the burst spectrum, might be necessary for the addition of the palatal to the three principal places of articulation. To reveal the contribution of the burst-related features one has to use stimuli generated by factorial variation of burst and transitions. There have been several attempts in the past for separating the contributions of burst and transitions to place perception in stop consonants and most of these studies point to the functional equivalence of the two cues across phonetic contexts (e.g. Dorman, Studdert-Kennedy & Raphaël, 1977). However, these results were collected in languages with only three
Cross-linguistic trends in the perception of place of articulation
261
place categories which make it rather difficult for evidencing autonomous contributions of the two kinds of cues because both contributed to all the three possible contrasts. Things might turn out differently in a fourcategory language like Hungarian in which the different place contrasts might rely on different cues: as shown in the present results, transitions are sufficient as long as there is no alveolar/palatal contrast present. It is then possible that the perception of this contrast might rely on burst properties which are independent of the onset frequencies of the formant transitions. Future experiments with stimuli generated by factorial variation of burst and transitions should allow to clarify this point. Here one has to be aware that the phonemic status of the Hungarian palatal has been the topic of a long-standing debate on whether it should be classified as a true stop or an affricate (for a summary on the phonological treatment of this matter see Siptár & Törkenczy, 2000). Our own recordings of the Hungarian palatal (Geng & Mooshammer, 2004) suggested a high degree of acoustic variability in the realization of this sound comprising clear stop and affricate and even fricative-like realizations with signal portions we interpreted as a residual burst. More thorough and detailed spectral analyses of these data would be required before a conclusive categorization of the observed patterns could be possible. 4.5. Some implications for phonological systems From a systemic point of view, the present study lends further support to the idea that languages do not use all the possible combinations between two (or several) universal features. The data suggest that in languages with four place categories, such as Hungarian, F2 and F3 transitions are not sufficient for separating these categories and a third feature is necessary. The examination of laryngeal timing contrasts leads to similar conclusions. Although there are two different predispositions for perceiving negative and positive VOT, there are only three rather than four reliable VOT categories in the world’s languages and a third phonetic feature (manner of vocal fold vibration) is used in languages with four homorganic categories (Lisker & Abramson, 1964). The fact that the category boundaries stay relatively stable between both languages under consideration while adding an additional place of articulation seems to suggest that there is an upper limit of place categories implementable on transitional cues alone, which is potentially interesting for the
262
Willy Serniclaes and Christian Geng
architecture of phonological systems. We do not want to overemphasize this point, as there are other well-established mechanisms for the same explanandum, i.e. the distribution of voiced and voiceless stops at different places of articulation in the world’s languages: Ohala (1983) related the relative frequencies of stop categories in the world’s languages to aerodynamic and acoustic constraints, in particular with /g/ and /p/ being rare in the world’s languages due either to difficulties to maintain voicing (for /g/) or to low acoustic salience (for /p/). Maddieson (2003) referred to these data as “missing /g/” and “missing /p/” phenomena, analyzing the latter as a regional phenomenon though. Anyhow, it hardly seems possible to disentangle effects of the different phonetic mechanisms on frequency data as those available in the UPSID database. The results presented here do not render such speculations as prohibitive either though the density effect observed in this study and the aerodynamic considerations made popular by Ohala could even act in a synergistic fashion in the constitution of the emergent asymmetries observed in the obstruent inventories of the world's languages.
5. Conclusion The present findings bear several implications for both speech development and phonological systems. Firstly, the fact that, both in French and in Hungarian, perceptual boundaries align along natural boundaries for transition perception or along some combination of these natural boundaries gives further support to the coupling model of speech development. Further, as the alveolar/palatal Hungarian boundary is poorly represented in the transition space, at least one additional phonetic feature seems necessary for perceiving the fourfold place distinctions in Hungarian, thereby giving still a further example of coupling between universal features in the build-up of phoneme categories. Secondly, the fact that the boundaries are generated by lawful combinations of perceptual predispositions shows that the latter impose strong constraints on phonological development. An important question for future research will be to understand how these constraints might converge with processes based on distance between categories.
Cross-linguistic trends in the perception of place of articulation
263
References Abry, C. 2003 Aslin, R.N. 1989
[b]-[d]-[g] as a universal triangle as acoustically optimal as [i]-[a]-[u]. In Solé, M.-J., Rescasens, D. & Romero, J. (Eds.). Proceedings of the 15th International Congress on Phonetic Sciences, pp. 727-730.
Discrimination of frequency transitions by human infants. Journal of the Acoustical Society of America, 86:582–590. Burnham, D., Tyler, M., & Horlyck, S. 2002 Periods of speech perception development and their vestiges in adulthood. In Burmeister, Piske & Rohde (Eds.). An integrated View of Language Development (Papers in Honor of Henning Wode). Trier: WVT Wissenschaftlicher Verlag. pp. 281-300. Carré, R., & Mrayati, M. 1991 Vowel-vowel trajectories and region modeling. Journal of Phonetics, 19:433-443. Carré, R., Serniclaes, W., & Marsico, E. 2003 Formant transition duration versus prevoicing duration in voiced stop identification. In Solé, M.-J., Rescasens, D. & Romero, J. (Eds.). Proceedings of the 15th International Congress on Phonetic Sciences, pp. 415-418. Dart, S.N. 1998 Comparing french and English coronal consonant articulation. Journal of Phonetics, 26:71-94. Dorman, M.F., Studdert-Kennedy, M., & Raphaël, L.S. 1977 Stop consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues. Perception and Psychophysics, 22:109-122. Eimas, P.D. 1974 Auditory and linguistic processing of cues for place of articulation by infants. Perception and Psychophysics, 16:513-521. Geng, C., Mády, K., Bogliotti, C., Messaoud-Galusi, S., Medina,V., & Serniclaes, W. 2005 Do palatal consonants correspond to the fourth category in the perceptual F2-F3 space? In: V.Hazan & P. Iverson (Eds.). Proceedings of the ISCA Workshop on Plasticity in Speech Perception, London, June 15-17 2005, pp. 219-222. Geng, C., & Mooshammer, C. 2004 The Hungarian palatal stop: phonological considerations and phonetic data. ZAS Papers in Linguistics, 37: 221-243. Gerken, L. & Aslin, R.N. 2005 Thirty years of research on infant speech perception: The legacy of Peter W. Jusczyk. Language Learning and Development, 1:5-21.
264
Willy Serniclaes and Christian Geng
Guenther, F., & Bohland, J. 2002 Learning sound categories: A neural model and supporting experiments. Acoustical Science & Technology, 23:213-220. Harnad, S. 1987 Categorical perception: the groundwork of cognition. New-York: Cambridge University Press. Hoonhorst, I., Colin, C., Deltenre, P., Radeau, M., & Serniclaes, W. 2006 Emergence of a language specific boundary in perilinguistic infants. Early Language Development and Disorders, Latsis Colloqium of the University of Geneva, Program and Abstracts, 45. Jakobson, R., Fant, G., & Halle, M. 1952 Preliminaries to Speech Analysis. Cambridge: MIT Press. Jusczyk, P. W. 1997 The discovery of spoken language. Cambridge, MA: MIT Press. Bradford Books. Keating, P. 1988 Palatals as complex segments: X-ray evidence. UCLA Working Papers in Phonetics, 69:77-91. Kuhl, P.K. 2004 Early language acquisition: cracking the speech code. Nature Reviews, 5:831-843. Kuhl, P.K., Stevens, E., Hayashi, A., Deguchi, T. Kiritani, S., & Iverson, P. 2006 Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9:F13-F21. Ladefoged, P. and Maddieson, I. 1996 The sounds of the world’s languages. Cambridge, Mass.: Blackwell. Lasky, R.E., Syrdal-Lasky, A., & Klein, R.E. 1975 VOT discrimination by four to six and a half month-old infants from Spanish environment. Journal of Experimental Child Psychology, 20:215-225. Liberman, A.M. 1998 Three questions for a theory of speech. The Journal of the Acoustical Society of America, 103:3024. Liljencrants, J., & Lindblom, B. 1972 Numerical simulation of vowel quality systems: the role of perceptual contrast. Language, 48:839-862 Lisker, L., & Abramson, A.S. 1964 A cross-language study of voicing in initial stops: acoustical measurements. Word, 20, 384-422. Maddieson, I. 2003 Phonological typology in geographical perspective. In Solé, M.-J., Rescasens, D. & Romero, J. (Eds.). Proceedings of the 15th International Congress on Phonetic Sciences, pp. 719-722.
Cross-linguistic trends in the perception of place of articulation
265
Masin, S.C. 1993 Some philosophical observations on perceptual science. In S. Masin (ed.), Foundations of perceptual theory. Amsterdam: Elsevier. pp. 43-73. Massaro, D. W. 1987 Categorical Partition: A fuzzy logical models of categorization behavior. In S. Harnad, (ed.), Categorical perception: the groundwork of cognition. New-York: Cambridge University Press, pp. 254-286. Ohala, J. J. 1983 The origin of sound patterns in vocal-tract constraints. In: P. F. MacNeilage (ed.), The Production of Speech. New York: Springer. pp. 189-216. Pegg, J.E., & Werker, J.F. 1997 Adult and infant perception of two English phones. The Journal of the Acoustical Society of America, 102:3742-3753. Repp, B.H. 1979 Relative amplitude of aspiration noise as a voicing cue for syllableinitial stop consonants. Language and Speech, 22:173-189. 1982 Phonetic trading relations and context effects: New experimental evidence for a speech mode of perception. Psychological Bulletin, 92:81-110. Serniclaes, W. 2000 La perception de la parole. In P. Escudier & J.L. Schwartz (ed.), La parole: Des modèles cognitifs aux machines communicantes. Paris: Hermes, Science Publications. pp. 159-190. Serniclaes, W., & Carré, R. 2002 Contextual effects in the perception of place of articulation: a rotational hypothesis. In J.L.H. Hansen & B. Pellom (eds.), Proceedings of the 7th International Conference on Spoken Language Processing, pp. 1673-1676. Serniclaes, W., Bogliotti, C. & Carré, R. 2003 Perception of consonant place of articulation: phonological categories meet natural boundaries. In In Solé, M.-J., Rescasens, D. & Romero, J. (Eds.). Proceedings of the 15th International Congress on Phonetic Sciences, pp. 391-394. Siptár, P., & Torkenczy, M. 2000 The phonology of Hungarian. Oxford: Univ. Press. Snowdown , C. T. 1987 A naturalistic view of categorical perception. In S. Harnad, (ed.), Categorical perception: the groundwork of cognition. New-York: Cambridge University Press, pp. 332-354. Sussman, H. M., Fruchter, D., Hilbert, J., & Sirosh, J. 1998 Linear correlates in the speech signal: The orderly output constraints. Behavioral and Brain Sciences, 21:241-259.
266
Willy Serniclaes and Christian Geng
Vihman, M.V. 1996 Phonological development : The origins of language in the child. Cambridge (MA): Blackwell. Werker, J.F., & Tees, R.C. 1984a Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7:49-63. 1984b Phonemic and phonetic factors in adult cross-language speech perception. The Journal of the Acoustical Society of America, 75:1866-1878.
The complexity of phonetic features' organisation in reading Nathalie Bedoin and Sonia Krifi 1. Introduction The goal of the experimental work described here is twofold. First, we investigate the contribution of sub-phonemic knowledge to the phonological code in reading, using feature similarity as a manipulated factor in priming experiments. Second, we intend to show that the sensitivity to subphonemic similarity is not directly determined by the number of shared phonetic features, but is more complex and depends on the type of phonetic feature. An implicit hierarchical organisation of phonetic feature types is assumed to affect the time-course of priming effects for sequentially displayed printed stimuli, and to guide similarity judgements in a metalinguistic task.
2. Sub-phonemic effects in lexical processing Lexical activation in speech decoding has been described as a gradual process and a phoneme-based evaluation metric appears insufficient to account for the results of various studies on mismatches in lexical access (McQueen, Dahan, and Cutler 2003). Consequently, sub-phonemic information has been argued to modulate lexical activation (McMurray, Tanenhaus, and Aslin 2002; Blumstein 2004) and a few lexical processing models include a level in which phonetic features are mapped onto lexical forms (Marslen-Wilson and Warren 1994). Indeed, semantic facilitative priming effects have been recorded in Dutch for target words preceded by a pseudo-word that differs from an associated prime word by only one phoneme, provided that the difference did not exceed one or two phonetic features (Marslen-Wilson, Moss, and van Halen 1996). Similarly, facilitative priming effects in a cross-modal paradigm have been shown to be restricted to pairs differing by one, two, but not three phonetic features in English (e.g., tennis was better processed after zervice, but not after gervice) (Connine, Blasko, and Titone 1993).
268
Nathalie Bedoin and Sonia Krifi
Additionally, the lexical advantage classically observed for phoneme monitoring decreased if the target was embodied within a pseudo-word differing from a word by one phonetic feature, while it completely disappeared if the pseudo-word differed by additional features (Connine, Titone, Deelman, and Blasko 1997). Therefore, the mere evaluation of the phonemic similarity fails to explain such effects and the number of shared phonetic features between the signal and a lexical unit is critical to lexical access. Featural similarity effects have been more directly assessed with primetarget sub-phonemic similarity manipulation: in phonetic priming, prime and target share many phonetic features, but no entire phoneme, whereas they share at least one of their constituent phonemes in phonological priming. Opposite effects have been observed for phonetic and phonological priming. Auditory perception of words is improved if prime and target share one phoneme (phonological priming, Slowiaczek, Nusbaum, and Pisoni 1987), while inhibitory effects are observed if they are phonetically similar without any shared phoneme (phonetic priming, Goldinger, Luce, and Pisoni 1989; Goldinger, Luce, Pisoni, and Marcario 1992; Luce, Goldinger, Auer, and Vitevitch 2000). To explain this inhibitory phonetic priming effect, the authors assumed separate levels of representation for features, phonemes, and words, as described in interactive-activation models of word recognition (McClelland and Rumelhart, 1981). Excitatory activation is supposed to pass between levels, while inhibitory lateral connections may be present among nodes within each level. In priming experiments, lateral inhibition at the phonemic level would suppress competitors of the phonemes identified in the prime (i.e., phonemes that share many phonetic features with the phonemes of the prime). As a consequence, a high phonetic similarity between prime and target would impair the identification of the phonemes in the target, as it was observed in Goldinger et al.’s experiments (1989, 1992). Interestingly, speech production latencies also increased if the onset of visual prime and target shared phonetic features (Rogers and Storkel 1998). Inhibitory effects of phonetic similarity in both speech perception and speech production led us to investigate the existence of analogous effects in reading, since the involvement of phonological knowledge has been evidenced in printed stimuli processing.
The complexity of phonetic features' organization in reading 269
3. Phonological knowledge in printed word recognition Although the role of visual and orthographic information is crucial in reading, a great deal of research has also concerned the involvement of phonological knowledge in printed word recognition. Accumulated data argue for the fast activation of phonemic knowledge, which contributes to written word recognition in skilled readers (for reviews, Berent and Perfetti 1995; Frost 1998). Some models propose that lexical entries can be activated only on the basis of phonological representations (Van Orden’s Verification Model 1987). Other models assume that sub-lexical phonological units activate the word meaning in parallel with orthographic units (Ferrand and Grainger 1992, 1994), or participate in word recognition via bidirectional relations in the non-lexical route (i.e., the Grapheme - Phoneme Correspondences system in the Dual Route Cascaded Model, Coltheart, Rastle, Perry, Langdon, and Ziegler 2001). Finally, some models rule out the notion of independent routes of lexical access and describe a coherent visualphonological dynamic system based on self-consistent relations between three families of inter-dependent letter-nodes, phoneme-nodes, and semantic nodes (Bosman and Van Orden 1997; Van Orden and Goldinger 1994; Van Orden, Jansen op de Haar, and Bosman 1997; Van Orden, Johnston, and Hale 1988). However, the strength of the connections between node families is assumed to depend on the consistency of relations, which may be higher between letter-nodes and phoneme-nodes than between semanticnodes and the two other node families. Indeed, despite the fact that languages such as English and French have a phonologically “deep” written system (in comparison to the German language, for instance, which has more consistent grapheme-to-phoneme and phoneme-to-grapheme conversion rules, Ziegler and Jacobs 1995), the relations between letters and phonemes in alphabetic languages support the stronger bidirectional correlations. Therefore, activation would feed forward from letter-nodes to phoneme-nodes, which would in turn feed activation back to letter-nodes. The resonance emerging between letter-nodes and phoneme-nodes would provide the rapid and efficient selection of a combination of coherent nodes, which would explain why phonology might supply powerful constraints on printed word recognition. Despite the involvement of phonological knowledge in favour of rapid recognition of printed words, various experimental data show that its inescapable role may be sometimes detrimental to performance in reading tasks. On the contrary, optional and/or controlled involvement of phono-
270
Nathalie Bedoin and Sonia Krifi
logical knowledge would preclude such negative effects. For instance, homophony has been shown to increase error rates in semantic categorisation tasks (Van Orden 1987; Van Orden, Johnston, and Hale 1988; Peter and Turvey 1994), in semantic relatedness decision (Luo, Johnson, and Gallo 1998), in proofreading (Bosman and de Groot 1996; Sparrow and Miellet 2002), and in semantic relatedness judgement (Lesch and Pollatsek 1998). Phonological processing also conducts readers to misleading responses in letter detection tasks. More false alarms are made in detecting the letter “i” in a target where it is absent (brane) but whose homophone (brain) contains “i” (Ziegler and Jacobs 1995). Additionally, readers fail to detect « i » in a visual stimulus, if this stimulus has a homophone which does not contain “i” (Ziegler, Van Orden, and Jacobs 1997). The difficulty of preventing phonological knowledge from being entailed in reading is also evidenced by orthographic-phonological regularity effects obtained in lexical decision although the list included many pseudo-homophone foils to discourage the involvement of phonology (Gibbs and Van Orden 1998). Further lines of evidence supporting the strong impact of phonological constraints in reading come from phonologically mediated semantic priming experiments. Written word recognition and naming increase if the target is preceded by the homophone of a semantically related word. Facilitation in naming has been observed with a 100 msec-SOA in English (Lesch and Pollatsek 1993; Lukatela, Lukatela, and Turvey 1993; Lukatela and Turvey 1991) and in lexical decision in French regardless of whether the prime is a word or a pseudo-word (Bedoin 1995). Additionally, phonological priming effects have also provided convincing data. Target recognition is improved by homophonic or phonologically similar word or pseudo-word primes in Serbo-Croatian, Chinese, Dutch, English, and French (Berent 1997; Brysbaert 2001; Ferrand and Grainger 1992, 1994; Grainger and Ferrand 1996; Lukatela and Turvey 1994; Perfetti and Bell, 1991; Perfetti and Zhang 1991; Rayner, Sereno, Lesch and Pollatsek 1995). Prime-target phonological similarity effects are generally investigated using pairs of stimuli sharing the complete phonological form, or at least the onset, the rime or several phonemes, based on the implicit idea that the early phonological code in printed word recognition is coarse-grained. However, the data presented in the next part indicate that it is fine enough to involve sub-phonemic information.
The complexity of phonetic features' organization in reading 271
4. Sub-phonemic effects in printed word recognition We have already conducted a series of priming experiments with French adult good readers to assess their sensitivity to voicing similarity in reading (Bedoin 1998). Prime and target only differed by their initial consonant and both onsets shared voicing or not. Adult skilled readers were invited to read silently and to make a lexical decision on each target by pressing one of two keys. Response latencies were significantly longer when there was an additional voicing similarity between prime and target, whatever the stimuli onset asynchrony (SOA = 33, 66, 100 msec) and the frequency and lexical status of the prime (word or pseudo-word). The detrimental effect of the additional voicing similarity in printed target words has been confirmed with a 33 msec-SOA in another lexical-decision experiment (Bedoin & Chavand 2000). Taken together, these data argue for the contribution of sub-phonemic knowledge to the phonological code in reading and the rapid extraction of voicing. Yet, we have not systematically investigated manner and place similarity effects in reading, but preliminary results suggest that, contrary to voicing, the influence of place and manner similarity does not provide very consistent effects. Indeed, we sometimes observed that place or manner similarity produced a facilitative (and not negative) priming effect, which varied with the SOA. Facilitative phonetic priming effects in reading have been also observed by Lukatela, Eaton and Turvey (2001) with a 57 msecSOA. In their high-similarity condition, prime and target only differed by voicing, while they additionally differed by place or manner in the lowsimilarity condition. Therefore, the authors assessed place or manner similarity effects between printed primes and targets, and they observed improved performances in case of place or manner prime-target similarity. To account for facilitative and negative effects of sub-phonemic similarity, we proposed that two mechanisms were involved in phonetic priming (Bedoin 2003). On the one hand, potential phoneme candidates may be activated before the identification of a letter is completed using phonetic feature detectors. If a visual mask interrupts the stimulus processing at this moment, the activated phonetic feature detectors are assumed to reinforce consistent phoneme detectors via bottom-up excitatory relations to identify the corresponding letter. If the SOA is very short, this bottom-up excitatory mechanism may be still in progress when the subsequent stimulus appears, providing facilitative effects in case of high phonetic similarity between
272
Nathalie Bedoin and Sonia Krifi
prime and target. Although such an effect has not been observed with voicing similarity manipulation in adults, it could be expected with feature similarity based on other phonetic feature types, especially in case of very short SOA. This bottom-up mechanism could account for the facilitative priming effect observed by Lukatela et al. (2001) in case of place or manner similarity. On the other hand, phoneme detectors are assumed to be linked to each other via lateral inhibitory relations whose weight is proportional to the phonetic similarity between phonemes (Bedoin 2003). Therefore, withinlevel competition would occur between overlapping phoneme candidates, which would impair the identification of phonemes if the target is preceded by a phonetically similar prime. This mechanism would be trigerred slightly later than the inter-level phonetic features-to-phonemes relations. This view is consistent with models including within-level competition due to inhibitory links between overlapping candidates, such as the connectionist Interactive Activation Model (McClelland and Rumelhart 1981), the Neighborhood Activation Model (Luce, Pisoni, and Goldinger 1990), and the TRACE Model (McClelland and Elman 1986). Such a mechanism could account for the negative effect of phonetic similarity observed in our previous priming experiments (Bedoin, 1998; Bedoin and Chavand, 2000). Similarly, the inhibitory effect of a prime upon a phonetically similar target has been recorded in speech perception, and it has been accounted for by positing inhibitory between-phonemes connections (Goldinger et al. 1992). Our results argue for the involvement of such connections in reading. For example, if the voiceless feature value is activated from the initial consonant of the prime (e.g., /p/ in passe), feature-to-phoneme connections may preactivate voiceless phonemes. Once the level of activation for /p/ exceeds a certain threshold, lateral inhibitory relations between /p/ and other voiceless phonemes may inhibit the competitors of /p/, such as /t/, which would increase the emergence of /p/ as the better candidate. As a consequence, /t/ would be more difficult to identify as a subsequent target than /d/ for instance. Results obtained in backward-masking experiments provide support for this interpretation (Bedoin and Chavand 2000). In such experiments, participants were invited to recall a briefly presented visual target that had been immediately replaced (masked) by another printed stimulus. This task is difficult because the target processing is disrupted by the mask. However, if the mask processing is impaired, it is expected to provide less masking effect on the target. Consequently, because of lateral inhibitory
The complexity of phonetic features' organization in reading 273
relations, a mask preceded by a phonetically similar target may impair the target recall to a lesser extent than a mask that shares fewer phonetic features with the preceding target. This difference has indeed been observed in an experiment displaying the target at first (e.g., TYPE), immediately replaced with a mask that differed or not by voicing (e.g., zyve versus syfe). In sum, identity in voicing between sequentially presented written stimuli reduces performance on the second item (e.g., prime-target paradigm), while it improves performance on the first one (e.g., target-mask paradigm). This opposition has been confirmed in a letter identification task, where phonetic priming and masking effects were assessed within a single printed stimulus. Subjects were briefly presented with a C1VC2V pseudo-word for 50 msec (adult readers) or 85 msec (average-reading and dyslexic children). Then, the target letter appeared, printed in another case, and subjects had to decide whether it was present or not in the stimulus. In these experiments, decisions were more rapid for C1 than for C2, suggesting that the letter identification was achieved at different rates according to its position in the printed stimulus. As expected, voicing similarity between consonants was detrimental to C2 identification, but it increased performances for C1 in adults and in third-graders (Krifi, Bedoin, and Mérigot 2003). However, a reversed pattern of results was observed in second-graders and in dyslexic children, with voicing similarity improving C2 identification and decreasing C1 identification, which is consistent with the involvement of excitatory phonetic feature-phoneme connections in reading, but a lack of inhibitory inter-phonemic relations in such subjects (Krifi, Bedoin, and Herbillon 2003). Taken together, these results argue for the involvement of a sub-lexical phonological level of knowledge, composed of a complex (but organised) set of phoneme detectors in turn relying on sub-phonemic detectors. We propose that the weight of the inhibitory relations between phoneme detectors depends on the phonetic similarity between phonemes, defined in terms of phonetic features. However, we assume that the lateral inhibition strength between phonemes does not depend directly on the number of shared phonetic features. Different phonetic features could have distinct weights, or participate in lateral inhibition at different rates. To address this issue, we conducted lexical decision experiments to assess the impact of manner and place of articulation similarity.
274
Nathalie Bedoin and Sonia Krifi
5. Types of phonetic features in reading: voicing, manner and place of articulation Many current phonological theories consider that segments are organised in terms of phonetic features, which may be characterized by an internal structure (Clements 1985). Feature values that are mutually interdependent may form a feature type, and each type may be represented on a separate tier (e.g., laryngeal feature, nasal feature, manner feature, place of articulation feature). Evidence from aphasia is generally consistent with this view, with phoneme substitution errors reflecting changes within a single tier rather than across tiers (Blumstein 1990). Indeed, speech production errors may involve only voicing, only manner, or only place and are less likely to involve both place and voicing or manner and voicing. Similarly, difficulties experienced by aphasics in discriminating phonemes have been shown to be restricted to phonemes differing in place of articulation and not in voicing (Miceli, Caltagirone, Gainotti, and Payer-Rigo 1978; Oscar-Berman, Zurif, and Blumstein 1975). Conversely, a selective disturbance of voicing contrast discrimination has been described in another patient (Caplan and Aydelott Utman 1994). This double dissociation and the stability of the selective impairments over time and between tests (Gow and Caplan 1996) confirm the relative independence of place and voicing as phonetic types. Additionally, a clear distinction is assumed between articulator-free features (manner and sonorance) and articulator-bound features (place and voicing): in models of phoneme identification in connected speech, it has been assumed that the identification of articulator-free features provides the basis for the subsequent discrimination of articulator-bound features, since they establish regions in the signal where acoustic evidence for the articulator-bound features can be found (Stevens 2002). This is in accordance with an advantage for the discrimination of articulator-free over articulatorbound features observed in aphasic patients (Gow and Caplan 1996). Finally, neurophysiological and neuropsychological data converge in showing differences in the hemispheric functional asymmetry associated with place and voicing representation and processing. While the left hemisphere (LH) is sensitive to the place of articulation feature, voicing appears to be more strongly represented and processed to a greater extent in the right hemisphere (RH) than in the LH (Cohen and Segalowitz 1990a; Rebattel and Bedoin 2001). A RH advantage for the acquisition of a nonnative voicing distinction in adults (Cohen and Segalowitz 1990b), auditory evoked potentials recorded over the RH and LH during voicing contrast discrimina-
The complexity of phonetic features' organization in reading 275
tion (Molfese 1978; Molfese and Molfese 1988; Segalowitz and Cohen 1989), and performances of neurologically impaired patients (Miceli et al. 1978; Oscar-Berman et al. 1975) converge in supporting the view that the RH plays a special role in the processing of voicing (Simos, Molfese, and Brenden, 1997). Therefore, there is evidence from various domains that phonetic features pattern in natural classes, but their potential hierarchical organisation is still under question. Manner of articulation has been proposed to be at the prominent level, as it defines the representation of a segment within a syllable (Van der Hulst 2005). Estimates of psychological distance between consonants have been derived from similarity ratings performed by listeners on spoken consonants (Peters 1963). Manner of articulation was proven to be the most important auditory dimension, followed by voicing, and subsequently place of articulation. Moreover, articulator-free features, such as manner, have been considered to provide the basis for the subsequent discrimination of articulator-bound features (voice and place) (Stevens 2002). The advantage for the discrimination of articulator-free over articulator-bound features in aphasic patients provides support to this claim (Gow and Caplan 1996). Additionally, an early sensitivity to sound similarities in sequentially presented syllables has been evidenced in nine-month-old children, who listened longer to lists that embodied some sub-phonemic regularity (Jusczyk, Goodman, and Baumann 1999). Infants exhibited sensitivity to shared manner features occurring at the beginning of syllables, while they were insensitive to place similarity. In addition, Rogers and Storkel (1998) pointed out the negative effect of phonetic similarity between pairs of words upon speech production latencies, and showed that manner similarity was the most detrimental factor. However, the relative importance of manner of articulation has been challenged by data obtained in speech perception and showing better preservation of voicing and nasality under noisy listening conditions (Miller and Nicely 1955). Voicing information is still transmitted at signal-to-noise levels 18dB below those needed for place of articulation. Additionally, voicing and nasality are much less affected by a random masking noise than are the other features. Therefore, voicing has been claimed to be one of the more salient and robust features of English consonants. Nevertheless, the debate still remains, since Wang and Bilger (1973) showed better preservation of manner features under noisy listening conditions, perhaps because of their robust acoustic correlates. The relative importance of voicing
276
Nathalie Bedoin and Sonia Krifi
versus place is also equivocal. There is evidence from discrimination tasks that listeners are more sensitive to place (Miceli et al. 1978), whereas metalinguistic tasks requiring listeners to rate similarity of pairs of consonants found that voice contributed either equally (Perecman and Kellar 1981) or more to judgement than did place (Peters 1963). Additionally, the hierarchy of phonetic categories may evolve during childhood. Indeed, in a study about the “slip-of-the-tongue” phenomenon, Jaeger (1992) showed that place substitutions were the most frequent errors in both adults and children aged 1;7 to 6;0, but children made fewer voicing errors than adults, which suggested a more important role of voicing as an organisation criterion for them. Taken together, evidence for a prominent status of manner and voicing phonetic features has been provided, but the comparative importance of these features is not clear and it may depend on the task (phoneme identification, discrimination, production, and metalinguistic tasks). Namely, the structure of the representations used to consciously estimate phonemic similarities in metalinguistic tasks could be different from some aspects of the organisation of phonetic features and phoneme detectors involved in the first stages of phonemes and letters identification. In our previous experiments, phonetic similarity effects in reading have been assessed using voicing similarity, and we now propose Experiments 1-2 to manipulate phonetic similarity with other phonetic feature types. The time-course of the rapid involvement of phonetic features in directing the lateral inhibitory relations between phoneme detectors could vary with the feature types, and some kind of hierarchical organisation of the feature types could emerge. The first series of new experiments presented in this paper further assessed the role of sub-phonemic units during printed word recognition, and investigated the time course of manner and place of articulation involvement in priming. In Experiments 1a, 1b, 1c, French readers were invited to perform lexical decision on targets that were briefly primed by a printed pseudo-word, with 33 msec-, 66 msec-, and 100 msec-SOA, respectively. Prime and target initial consonants shared either voicing, or voicing and manner, or voicing and place. Since voicing similarity effects had been previously assessed in experiments using prime-target pairs that basically differed by another(s) phonetic feature(s), place and manner similarity effects were also tested with prime-target pairs that basically differed by another feature (i.e., voicing) in Experiments 2a, 2b and 2c. We addressed the question of whether place or manner prime-target similarity provided negative effects on the target processing (like voicing similarity in previous
The complexity of phonetic features' organization in reading 277
experiments), which should reflect the involvement of intra-level inhibitory relations. On the contrary, better performance in case of place or manner prime-target similarity should be explained by inter-level excitatory relations. It may be the case that phoneme detectors are organised on different tiers, allowing an asynchronous involvement of different phonetic feature types in the process of neighbourhood inhibition. As a consequence, it is hypothesized that additional prime-target phonetic similarity determined by a phonetic feature that does not have temporal priority would not be able to trigger inhibitory relations within the intra-level phonemic organisation in case of very short SOA, leaving the way for rapid inter-level activations to exert facilitative effects in our experiments. Hence, different patterns of priming effects (facilitative versus negative effects) observed for different phonetic feature types with short SOAs may provide information about the time-course of phonetic processing in reading and the hierarchical organisation of phonetic features. In Experiment 1, participants performed lexical decision on a printed target that was briefly preceded by a printed prime. In experimental trials, primes consisted of pseudo-words that differed from the target only by the initial consonant. Across three lists, three kinds of primes were paired with each target, so that each condition was tested on the same targets. Each participant processed only one list to preclude any target repetition. • VM condition: prime and target (9 words and 9 pseudo-words) differed by place of the initial consonant, but shared Voicing and Manner (e.g., BAME-dame; /bam/-/dam/); • VP condition: prime and target (9 words and 9 pseudo-words) differed by manner of the initial consonant, but shared Voicing and Place (e.g., ZAME–dame; /zam/-/dam/); • V condition: prime and target (9 words and 9 pseudo-words) differed by place and manner, but shared Voicing (e.g., VAME–dame; /vam//dam/). In our experiments, we used consonants, which were either voiced or voiceless, while they were either plosive or fricative in terms of manner of articulation. Three values of place of articulation were considered for plosive consonants (bilabial - /p, b/ ; dental - /t, d/ ; velar - /k, g/), and three values were considered for fricative ones (labiodental - /f, v/ ; dental – /s, z/ ; postalveolar - /ʃ, ʒ/). To make it easier regarding place of articulation, the consonants have been distributed into three categories: Category 1 was composed of /p, b, f, v), Category 2 contained /t, d, s, z/, and Category 3 contained /k, g, ʃ, ʒ/.
278
Nathalie Bedoin and Sonia Krifi
Target words contained 1 or 2 syllables (mean = 1.78), 4-6 letters (mean = 5.26), and 3-5 phonemes (mean = 4.44). Pseudo-words used as targets or primes were phonotactically legal sequences. Primes had no lexical neighbour more frequent than the target. In addition, 216 filler prime-target pairs without overlapping phonemes were included. In Experiment 1a (N = 27), the SOA lasted 33 msec, while a 66 msecSOA and a 100 msec-SOA were respectively used in Experiment 1b (N = 27) and Experiment 1c (N = 27). The 81 participants were native French speakers and participated in only one version of the experiment. They were University students, had normal or corrected-to-normal vision with no known speech or hearing disorder. The design of Experiment 2 was quite similar and three versions were proposed to test phonetic similarity effects with a 33 msec-SOA (Experiment 2a, N = 27), a 66 msec-SOA (Experiment 2b, N = 27), and a 100 msec-SOA (Experiment 2c, N = 27). No participant performed both Experiments 1 and 2. Three priming conditions were used: • PM condition: prime and target differed by voicing of the initial consonant, but shared Place and Manner (e.g. FAGUE-vague; /fag/-/vag/); • M condition: prime and target differed by voicing and place, but shared Manner (e.g. SAGUE–vague; /sag/-/vag/); • P condition: prime and target differed by voicing and manner, but shared Place (e.g. PAGUE–vague; /pag/-/vag/). Contrary to voicing similarity, which decreased lexical decision performances in previous priming experiments whatever the SOA, place and manner similarity effects, when statistically significant, were facilitative with a 33 msec-SOA, while they were detrimental to performance with a 66 msec- or 100 msec-SOAs. More precisely, when a 33 msec SOA was used in Experiment 1a, lexical decision latencies were longer if the initial consonants of prime and target only shared voicing rather than both voicing and manner, F(1, 48) = 5.42, p = .024, or voicing and place, F(1, 48) = 4.93, p = .031 (see Figure 1, left). Data recorded in Experiment 2a were consistent with the facilitative priming effect of additional manner similarity, since responses tended to be shorter in PM condition than in P condition for word targets, F(1, 48) = 4.00, p = .051, and for pseudo-word targets, F(1, 48) = 5.04, p = .029. Additionally, responses were more rapid in case of manner similarity (M condition) than in case of place similarity (P condition) for word targets, F(1, 48) = 5.05, p = .028 (Figure 1, right). The error rates did not provide any significant effect in Experiments 1a and 2a (Table 1).
The complexity of phonetic features' organization in reading 279 Table 1. Percentages of errors in the six priming experiments. Experiment 1a Word Pseudo-Word Experiment 1b Word Pseudo-Word Experiment 1c Word Pseudo-Word
VM
VP
V
4.93% 5.34%
4.52% 4.93%
5.34% 3.41%
7.41% 4.52%
7.00% 3.70%
3.70% 0.82%
4.93% 5.34%
4.52% 8.34%
5.34% 0.41%
Experiment 2a Word Pseudo-Word Experiment 2b Word Pseudo-Word Experiment 2c Word Pseudo-Word
PM
M
P
4.93% 2.47%
1.64% 1.23%
3.29% 4.52%
3.29% 3.29%
3.70% 1.23%
5.18% 2.06%
3.70% 3.70%
2.88% 4.11%
2.88% 2.88%
Taken together, results obtained with a short SOA in Experiments 1-2 suggest that sub-phonemic similarity can provide facilitative priming effects in reading, since manner similarity, and to a lesser extent place similarity, result in shorter response latencies. Inter-level excitatory connections can account for such facilitative priming effects produced by manner or place similarity. Therefore, during the first 33 msec of printed stimulus processing, phonetic feature knowledge is involved, but at different rates according to the feature types. At this processing stage, lateral inhibitory intra-level relations can already account for voicing similarity effects (Bedoin, 1998, 2003), while only rapid inter-level excitatory features-to-phonemes links can account for manner and place similarity effects. Therefore, it seems that during the first 33 msec of print processing, the time-course of lateral inhibitory relations differs with the phonetic feature types, and voicing may have some temporal priority with respect to this aspect of organisation. In addition, manner similarity provides a greater (or more systematic) facilitative effect than place similarity at this step of printed stimulus processing. TR (ms)
SOA = 33 ms
Word Pseudoword
TR (ms)
650
650
600
600
550
550
SOA = 33 ms
Word Pseudoword
500
500
VM
VP
V
PM
M
P
Figure 1. Mean response latencies and standard errors recorded with a 33 msecSOA in Experiment 1a (left panel), and Experiment 2a (right panel).
280
Nathalie Bedoin and Sonia Krifi
SOA = 66 ms
TR (ms)
Word Pseudoword
650
650
600
600
550
550
500
SOA = 66 ms
TR (ms)
Word Pseudoword
500
VM
VP
V
PM
M
P
Figure 2. Mean response latencies and standard errors recorded with a 66 msecSOA in Experiment 1b (left panel), and Experiment 2b (right panel).
With longer SOAs, the pattern of results was more homogeneous with those obtained for voicing similarity in previous experiments. Indeed, when feature similarity effects were significant, performance always decreased regardless of whether the additional phonetic similarity involved place or manner. With the 66 msec-SOA (Experiments 1b), more errors were made in VM condition than in V condition, F(1, 48) = 7.31, p = .010, and in VP than in V condition, F(1, 48) = 5.08, p = .029, reflecting the negative effect of manner and place similarity. The manner similarity negative priming effect was also significant for latencies, which were longer in VM condition than in V condition, F(1, 48) = 6.80, p = .012 (Figure 2, left). In Experiment 2b, increased response times were observed in case of place similarity (PM – M difference), F(1, 48) = 4.57, p = .038 (Figure 2, right), and no effect reached significance for error rates. TR (ms)
SOA = 100 ms
Word Pseudoword
TR (ms)
650
650
600
600
550
550
SOA = 100 ms
Word Pseudoword
500
500
VM
VP
V
PM
M
P
Figure 3. Mean response latencies and standard errors recorded with a 100 msecSOA in Experiment 1c (left panel), and Experiment 2c (right panel).
The complexity of phonetic features' organization in reading 281
Finally, less feature similarity priming occurred with a 100 msec-SOA. Consistent with data observed in Experiment 1b, error rates were lower in V condition than in VM and VP conditions in Experiment 1c, but the effects did not reach significance. The only significant effect on latencies was obtained in Experiment 2c (Figure 3): place similarity decreased performance for words, since response times were longer in PM condition than in M condition, F(1, 48) = 4.82, p = .030. Together with data from previous experiments assessing voicing similarity effects, these results suggest that phonetic priming in reading is not a phenomenon based merely on the number of shared phonetic features. Complexity in phonetic priming may be due to differences in the status of phonetic feature types and in the rate at which each type of feature participates in lateral inhibitory relations between phonemes. Our data showed that every feature prime-target similarity effect reaching significance with a 66 msec-SOAs decreased performance and was in line with lateral inhibitory relations assumed to organize phoneme detectors. It can be noticed that a place similarity negative effect remained with a 100 msec-SOA, whereas manner similarity did no longer influence lexical decision. Phonetic similarity effects recorded with a 33 msec-SOA also differed according to the phonetic feature type, which is of major importance regarding the timecourse of phonetic processing in reading and provides some information about the hierarchical organisation of phonetic features. With this very short SOA, voicing similarity already decreases performances, which can be accounted for by lateral inhibitory relations between phoneme detectors (Bedoin, 1998). On the contrary, the present data show that manner similarity, and to a lesser extent, place similarity, facilitates target processing, which can be interpreted as the result of pre-activation based on bottom-up excitatory relations. Therefore, it seems that, 1) place and manner release a mechanism based on lateral inhibitory relations between phoneme detectors around 66 msec of print processing, whereas voicing triggers this mechanism as soon as 33 msec after the stimulus appeared (Bedoin, 1998, 2003), 2) the role of place similarity in this mechanism is still observed with a 100 msecSOA, which is not the case for manner similarity, and 3) manner is of major importance in the inter-level excitatory mechanism during the 33 first msec of print processing (see Table 2).
282
Nathalie Bedoin and Sonia Krifi
Table 2. Summary of sub-phonemic similarity effects in Experiments 1a, 2a, 1b, 2b, 1c and 2c (+ and - respectively for facilitative and negative effects). Phonetic feature similarity
SOA 33 msec
66 msec
Manner
++
--
Place
+
--
100 msec
-
The facilitative priming effect observed in case of manner or place similarity with a 33 msec-SOA is in line with facilitative sub-phonemic priming effects observed by Lukatela, Eaton and Turvey (2001) with a short SOA (57 msec). These authors investigated phonetic prime-target similarity effects using conditions that differed as regards the place and/or manner prime-target similarity. Thus, the facilitative priming effect that they recorded with a short SOA is consistent with our data in Experiments 1a and 1b. Therefore, phonetic similarity effects in reading are rather complex and it appears that the feature types govern the time-course of their involvement within the internal mental organisation of phoneme detectors. According to the present results, the voicing feature enjoys one kind of priority over place and manner within the hierarchical organisation of phonetic feature types, since it seems to be the most rapid to direct lateral inhibitions among phoneme detectors. The prolonged negative effect of place similarity, in comparison to manner similarity effect, may be brought together with data reported by Vallée, Rossato and Rousset (this volume). They show that various articulation places for two consonants are favoured within CVC syllables as for CV.CV consecutive syllables. It can be noticed that the preference for patterns where the first consonant is a labial and the second one is a coronal (i.e. the Labial-Coronal effect, McNeilage and Davis 2000), and the place similarity avoidance principle (Frisch, Pierrehumbert, and Broe 2004) are also in line with this idea. Negative place similarity priming effects may contribute to preclude place similarity between consonants of consecutive syllables.
The complexity of phonetic features' organization in reading 283
6. Syllables matching: evidence for a hierarchical organisation of phonetic feature types from a metalinguistic task Experiments 1-2 suggested that the time-course of some phonetic processing in reading depends on the type of phonetic features, and that the inhibitory effect of voicing similarity occurs in priority. Does it mean that voicing is of prominent availability to readers in the conceptualization of phonological information? Since some data emphasize the prominent role of manner as a phonetic feature type (Gow and Caplan 1996; Jusczyk, Goodman, and Baumann 1999; Peters 1963; Rogers and Storkel 1998; Stevens 2002; Van der Hulst 2005; Wang and Bilger 1973) but also the role of voicing (Jaeger 1992; Miller and Nicely 1995; Peters 1963), we assumed that adults rely strongly on voicing but also on manner to estimate the similarity between consonants. Therefore, our second goal is to provide evidence for an implicit hierarchical representational structure of phonetic feature types from a metalinguistic task. Classification data have been recorded in forced-choice syllable matching experiments in adults and children to investigate how the mental organisation of phonemes progressively sets up in normally developing readers (Krifi, Bedoin, and Herbillon 2005). In Experiments 3a,b, 4 a,b and 5 a,b, we assessed the relative weight implicitly granted to voicing, manner and place of articulation, to guide responses in a forced choice syllable matching task. Systematic biases in decisions would argue for an implicit hierarchy of phonetic feature types, since the proposed competitors did not differ by the number of shared feature values, but by the type of shared features. Twenty-four students participated in Experiments 3a, 4a, and 5a (visual versions), and 17 others were tested in the audio-visual versions (Experiments 3b, 4b, and 5b). All were native French speakers, had normal or corrected-to-normal vision, and had no known speech or hearing disorder. Each trial was composed of one printed CV target syllable and two other CVs were simultaneously displayed above. The three syllables remained on the screen until the subject pressed one of the keys to indicate which of the two response CVs could be paired with the target according to intuitively estimated acoustic similarity. In the audio-visual version, the three printed syllables were additionally heard from headphones. There were 6 voiced consonants (/d/, /b/, /g/, /v/, /z/, /ʒ/) and 6 voiceless ones (/t/, /p/, /k/, /f/, /s/, /ʃ/) and vowels were always /a/ (see Table 3 for examples). As in the previous experiments, plosive and fricative consonants were distributed into three categories, regarding place of articu-
284
Nathalie Bedoin and Sonia Krifi
lation: Category 1 (/p, b, f, v/), Category 2 (/t, d, s, z/), and Category 3 (/k, g, ʃ, ʒ/). Table 3. Examples of trials in experiments 3, 4, and 5. Target syllables are in bold.
Experiment 3 Manner vs. Place ba za ta
Experiment 4 Manner vs. Voicing da sa pa
Experiment 5 Voicing vs. Place ga pa va
Two features types were pitted against each other in each experiment: manner and place (Experiment 3), manner and voicing (Experiment 4), place and voicing (Experiment 5). Additionally, we evaluated if one feature type had priority whatever its value, which would reflect the consistency of the feature type hierarchical status. For instance, we tested if manner similarity was preferred, regardless of whether it was represented by plosive or fricative pairs. In Experiment 3, subjects’ responses differed from chance with a preference for manner similarity over place similarity in the visual version (67%), t(23) = 2.92, p = .0038, as in the audio-visual one (76%), t(16) = 9.64, p < .0001. Matching was consistently guided by manner similarity, regardless of whether the paired syllables were plosives, t(23) = 2.82, p = .0048 (version a), t(16) = 6.61, p < .0001 (version b) or fricatives, t(23) = 2.73, p = .0060 (version a), t(16) = 10.06, p < .0001 (version b). Data recorded in Experiment 4 confirmed the prominent status of manner as a criterion to evaluate similarity, since choices differed from chance to the advantage of manner (rather than voicing) in the visual version (77%), t(23) = 8.04, p < .0001, as in the audio-visual one (80%), t(16) = 6.08, p = .0001, regardless of whether the phonetic similarity concerned the plosive value, t(23) = 6.43, p < .0001, t(16) = 5.51, p = .0003, or the fricative one, t(23) = 6.24, p < .0001, t(16) = 5.76, p = .0002 (versions a and b, respectively). However, the percentage of manner choices was lower when competition comes from two voiced consonants than from two voiceless ones, in version a (difference = 8%, t(23) = 2.80, p = .0079) and in version b (difference = 10%, t(16) = 5.1, p = .0353). In Experiment 5, the pattern of results differed between the visual and the audio-visual versions. Place similarity was preferred to voicing similarity above chance (59%) in the visual version, t(23) = 2.89, p = .0042, which
The complexity of phonetic features' organization in reading 285
was confirmed for pairs of consonants whose place of articulation value was of Category 1, t(23) = 3.36, p = .0013, or Category 2, t(23) = 2.00, p = .0284, but not Category 3. Additionally, place similarity was not preferred to voicing similarity when it competes with a pair of voiced consonants. In the audio-visual version, responses did not globally differ from chance, except that voicing similarity was preferred over place similarity for pairs of voiced consonants (62%), t(16) = 2.57, p = .0152. To investigate the progressive organisation of a hierarchy across phonetic feature types, Experiments 3-5 were presented to normal reading children (10 second graders and 10 third graders in the visual version, and 11 third graders in the audio-visual version). Choices never differed from chance in second graders; third graders exhibited a growing sensitivity to manner similarity (57%) over place similarity, since their responses differed from chance in Experiment 3, t(9) = 2.72, p = .0117 (version a), t(10) = 4.00, p = .0013 (version b). However, this effect was restricted to fricative consonants, t(9) = 2.39, p = .0203 (version a), t(10) = 2.86, p = .0085 (version b). When manner similarity was pitted against voicing similarity, third graders matched consonants that shared manner more often than would be expected by chance (56% in version a, t(9) = 2.64, p = .0135; 55% in version b, t(10) = 2.60, p = .0133). However, the prominent status of manner was less prevailing than in adults, since children preferred manner similarity only for plosive consonants in the visual version, t(9) = 2.58, p = .0148, and only for fricative consonants in the audio-visual one, t(10) = 2.10, p = .0310. Finally, when place and voicing similarity were set in competition with one another, third graders’ responses did not differ from chance. Taken together, the data recorded in Experiments 3-5 suggest that syllables can be processed in terms of phonetic features not only in speech perception but also in reading. The phonetic feature types shared by two printed CVs are critical to bias syllable sorting and the data allow us to draw the main outlines of a hierarchical organisation of these feature types.
7. Hierarchies of phonetic feature types in the first steps of printed word recognition and in a metalinguistic task First of all, this research has confirmed the existence of sub-phonemic priming effects for printed stimuli. The results obtained in Experiments 1-2 with various SOAs are partly in agreement both with the rare data obtained
286
Nathalie Bedoin and Sonia Krifi
with phonetic priming experiments in reading (Bedoin 1998, 2003; Krifi et al. 2003, 2005; Lukatela et al. 2001) and with sub-phonemic priming effects in speech processing experiments (Goldinger et al. 1989, 1992; Luce et al. 2000). Additionally, they provide arguments for major differences in status for the phonetic feature types. However, these categories do not seem to be organized in a unique hierarchical structure. Indeed, depending on the task requirements, the investigated feature hierarchy can refer either to the weight of feature types in intuitively guided but consciously accessed representations in metalinguistic tasks (Experiments 3-5), or to the time-course of their involvement in rapid, automatic and transient processes taking part in printed word recognition (Experiments 1-2). Taken together, the data presented in this chapter are consistent with the existence of reliable hierarchies, which can nevertheless slightly differ according to the investigated mechanisms. In the first series of experiments, manner and place similarity between prime and target produced either facilitative or negative effects, depending on the processing step. Indeed, manner or place similarity provided facilitative priming effects only with a very brief SOA (33 msec), while the pattern of results was quite different if the SOA lasted 66 msec. This confirmed the facilitative priming effect observed by Lukatela and colleagues (2001) with a SOA that was also shorter than 66 msec (57 msec). The authors compared conditions in which prime and target differed only by voicing, or by voicing and manner, or voicing and place. The facilitative priming effect obtained with a brief SOA in case of manner and, to a lesser extent, place similarity is in accordance with the prediction of the previously proposed model (Bedoin 2003), according to which the more rapid influence of phonetic features extracted from a printed stimulus is based on inter-level phonetic feature-to-phoneme detectors links and mainly provides pre-activation of phonemes. In an experiment conducted with second graders who had a normal reading level and with older dyslexic children, we showed that such a facilitative priming effect also occurred for them in case of voicing similarity, which could be accounted for by a mere inter-level mechanism (Krifi et al. 2003). However, in previous experiments we showed that voicing similarity provided a negative priming effect although the SOA was very brief in adult good readers and in third graders with a normal reading level. Additionally, with a longer SOA, this negative priming effect remained for voicing similarity (Bedoin, 1998), and the present paper showed that it appeared for manner and place similarity. Accordingly, another mechanism that also
The complexity of phonetic features' organization in reading 287
involved phonetic knowledge may occur in print processing. Indeed, this negative priming effect can be accounted for by an intra-level transient mechanism of lateral inhibition, intended to reduce neighborhood effects between phoneme detectors (Bedoin, 2003), consistent with connectionist models (McClelland and Rumelhart 1981). According to the present results, it seems that such a mechanism is dependent on the type of phonetic features: voicing may be able to provide such lateral inhibition after only 33 msec of print processing, whereas manner and place features require more time (at least 66 msec) to play this role. The involvement of each phonetic feature type in reading actually seems to be a question of time: manner and place similarity exert the same negative priming effect as voicing similarity, provided that the SOA is longer. In light of the present results, the phoneme detectors seem to be organized by shared phonetic features, and the working of this structure may be dependent itself on time constraints applied to the phonetic feature types. In the second series of experiments, the participants were not constrained by the presentation time and the required evaluation was made consciously, even though intuitive criteria were probably used. The forcedchoice syllables matching task may provide information about an internal organisation of feature types and the possible priority of one type was evaluated not only on the basis of the expressed preferences, but took also into account the consistency of these preferences. For instance, the prominent status of manner in this task is shown by the higher frequency of syllable matching according to manner than to place similarity (Experiment 3) or voicing similarity (Experiment 4), but it is also supported by the maintenance of this preference whatever the manner value. Indeed, it was as apparent for a pair of plosive consonants as for a pair of fricative ones. This bias in favour of manner similarity was observed both in the visual and the audio-visual versions, and slightly increased in this last version. The prominent impact of manner over place and voicing similarity is in accordance with models where this phonetic category is assumed to hold the top level as it defines the representation of a segment within a syllable (Van der Hulst 2005), and as it may provide the basis for the subsequent discrimination of articulator-bound features such as place and voicing (Stevens 2002). Our results are also consistent with other experiments, where manner was evidenced to be more important than voicing and place to rate the similarity of syllables (Peters 1963). Additionally, an early sensitivity to subphonemic similarity based on manner of articulation has been evidenced in infants (Jusczyk, Goodman, and Baumann 1999) and our results show that
288
Nathalie Bedoin and Sonia Krifi
the first type of shared phonetic features that emerges as a salient one for sorting printed syllables by third graders is also manner. However, the major importance of manner for syllable matching is at odds with the very rapid involvement of voicing in the lateral inhibitory relations, which is assumed to reflect the rising of negative priming effects at a shorter SOA in case of voicing similarity than in case of manner or place similarity. Voicing seems to take the better place in the time-course of this automatic and transient phonetic mechanism of lateral inhibition in reading. On the contrary, manner plays the most important role in metalinguistic tasks designed to investigate how categories are represented. This difference underlines the importance of the paradigms used to investigate the relative weight of feature types, since they could address different kinds of hierarchies, functional ones and representational ones. Although manner has a prominent status and manifests a good consistency as a category of phonetic features in the metalinguistic task, the effects of manner similarity and place similarity appear to be reduced when they are in competition with similarity based on the shared voiced (but not voiceless) feature value. This result shows that voicing can modulate the evaluation of syllable similarity, which is consistent with data showing its robustness for speech perception in noise (Miller and Nicely 1955), its important role in similarity rating (Peters 1963), and in children’s productions (Jaeger 1992). Additionally, some data have shown that voicing may be less adversely affected than place by conditions of dichotic competition (Studdert-Kennedy and Shankweiler 1970). Nevertheless, our result also reveals that voicing is less consistent than manner as a similarity criterion: the voiced value better represents this category than the voiceless one to compete with manner. Previous investigations of the relative impact of shared place or voicing to evaluate syllables similarity have provided mixed results. When they were invited to sort triads of stop consonants into pairs adjuged most similar, normal subjects sorted on the basis of place and voicing equally often (Perecman and Kellar 1981). Similarly, the benefit provided by shared phonetic features to process dichotic stimuli has been reported to be more important in case of shared place than in case of shared voicing, which suggests that voicing is less affected by the negative effect of competition in speech perception (Studdert-Kennedy and Shankweiler 1970), but this difference has not been systematically replicated (Oscar-Berman et al. 1975; Studdert-Kennedy, Shankweiler, and Pisoni 1972). In addition, place substitution errors were the most frequent slip-of-the-tongue errors (Jaeger
The complexity of phonetic features' organization in reading 289
1992), while listeners were more responsive to place than to voicing in discrimination tasks (Miceli et al. 1978). These mixed results are in agreement with the main importance of task requirements in the assessment of phonetic features hierarchies. In our metalinguistic experiments, the relative importance of place and voicing similarity also remains unresolved. When place and voicing similarity were pitted against each other, different patterns of results were observed in the visual and the audio-visual versions. Namely, pairings were more frequently based on shared place (at least within Category 1 and Category 2) in the visual version, while shared voicing guided responses more clearly, at least for voiced consonants, in the audio-visual version. Additionally, contrary to manner, place and voicing were not consistently represented by their various values. Indeed, subjects favoured place similarity over voicing similarity in the visual version to match syllables sharing place values of Category 1 or Category 2 but not syllables sharing place value of Category 3. Similarly, voicing was better represented by the voiced value than by the voiceless one, since syllable pairing on the basis of manner or place decreased when it was in competition with a pair of voiced consonants but not with a pair of voiceless ones, both in the visual and the audio-visual versions. Therefore, the data presented in this chapter argue for the prominent status of manner as a phonetic feature type to guide intuitively estimated acoustic similarity between printed consonants, both in the visual and audio-visual versions of the syllable matching task. Voicing seems to play a secondary role, but influences decisions all the same, especially when voicing similarity is represented by a pair of voiced consonants. Finally, the hierarchy of phonetic feature types seems to be progressively taken into account during childhood in our experiments. In second graders, no significant bias to pair syllables according to one or another phonetic category was observed. On the contrary, responses provided by third graders reflect the emergence of the prominent status of manner similarity over place and voicing similarity. However, this prominence was building up and was not as consistent as in adults, since manner was not yet preferred regardless of the feature value (plosive or fricative) shared by the consonants. The testing of fourth and fifth graders is in progress and may allow us to investigate more precisely the gradual involvement of a hierarchical organisation of phonetic feature types in metalinguistic tasks.
290
Nathalie Bedoin and Sonia Krifi
References Bedoin Nathalie 1995 Articulation de codages phonologiques et sémantiques en lecture silencieuse. Revue de Phonétique Appliquée 115: 101-117. 1998 Phonological feature activation in visual word recognition: The case of voicing. Paper presented at the Xth Conference of the European Society for Cognitive Psychology (ESCOP), September 1998, Jerusalem, Israël. 2003 Sensitivity to voicing similarity in printed stimuli: Effect of a training programme in dyslexic children. Journal of Phonetics 31: 541-546. Bedoin, Nathalie and Hubert Chavand 2000 Functional hemispheric asymmetry in voicing feature processing in reading. Paper presented at the Tenth Annual Meeting of the Society for Text and Discourse, July 2000, Lyon, France. Berent, Iris 1997 Phonological priming in the lexical decision task: Regularity effects are not necessary evidence for assembly. Journal of Experimental Psychology: Human Perception and Performance 23: 1727-1742 Berent, Iris and Charles A. Perfetti 1995 A Rose is a REEZ: The two-cycles model of phonology assembly in reading English. Psychological Review 102: 146-184. Blumstein, Sheila E. 1990 Phonological deficits in aphasia: Theoretical perspectives. In: A. Caramazza (ed.), Cognitive neuropsychology and neurolinguistics: Advances in models of cognitive function and impairment, 33-53. Hillsdale NJ: Lawrence Erlbaum Associates. Bosman, Anna M. T. and Annette M. B. de Groot 1996 Phonologic mediation is fundamental to reading: Evidence from beginning readers. The Quarterly Journal of Experimental Psychology 49: 715-744. Bosman, Anna M. T. and Guy C. Van Orden 1997 Why spelling is more difficult than reading. In: Charles A. Perfetti, Laurence Rieben and Michel Fayol (eds.), Learning to spell, 173194. Hillsdale NJ: Lawrence Erlabaum Associates. Brysbaert, Marc 2001 Prelexical phonological coding of visual words in Dutch: Automatic after all. Memory and Cognition 29: 765-773. Caplan, David and Jennifer Aydelott Utman 1994 Selective acoustic phonetic impairment and lexical access in an aphasic patient. The Journal of Acoustical Society of America 95: 512-517. Clements, Nick 1985 The geometry of phonological features. Phonology Yearbook 2: 225-252.
The complexity of phonetic features' organization in reading 291 Cohen, Henri and Norman S. Segalowitz 1990a The role of linguistic prosody in the perception of time-compressed speech: A laterality study. Journal of Clinical and Experimental Neuropsychology 12: 39. 1990b Cerebral hemispheric involvement in the acquisition of new phonetic categories. Brain and Language 38: 398-409. Coltheart, Max, Kathleen Rastle, Conrad Perry, Robyn Langdon and Johannes Ziegler 2001 DRC: A dual route cascaded model of visual word recognition and reading aloud. Psychological Review 108: 204-256. Connine, Cynthia M., Dawn G. Blasko and Debra Titone 1993 Do the beginning of spoken words have a special status in auditory word recognition? Journal of Memory and Language 32 : 193-210. Connine, Cynthia M., Debra Titone, Thomas Deelman and Dawn G. Blasko 1997 Similarity mapping in spoken word recognition. Journal of Memory and Language 37: 463-480. Ferrand, Ludovic and Jonathan Grainger 1992 Phonology and orthography in visual word recognition: Evidence from masked non-word priming. The Quarterly Journal of Experimental Psychology 45: 353-372. 1994 Effects of orthography are independent of phonology in masked form priming. The Quarterly Journal of Experimental Psychology 47(A): 365-382. Frisch, Stefan A., Janet B. Pierrehumbert and Michael B. Broe 2004 Similarity avoidance and the OCP. Natural Language and Linguistic Theory 22: 179-228. Frost, Ram 1998 Toward a strong phonological theory of visual word recognition: True issues and false trails. Psychological Bulletin 123: 71-99. Gibbs, Patrice and Guy C. Van Orden 1998 Pathway selection’s utility for control of word recognition. Journal of Experimental Psychology: Human Perception and Performance 24: 1162-1187. Goldinger, Stephen D., Paul A. Luce and David B. Pisoni 1989 Priming lexical neighbors of spoken words: Effects of competition and inhibition. Journal of Memory and Language 28: 501-518. Goldinger, Stephen D., Paul A. Luce, D. B. Pisoni and J K. Marcario 1992 Form-based priming in spoken word recognition: The roles of competition and bias. Journal of Experimental Psychology: Learning, Memory and Cognition 18: 1211-1238. Gow, David W. and David Caplan 1996 An examination of impaired acoustic-phonetic processing in aphasia. Brain and Language 52: 386-407.
292
Nathalie Bedoin and Sonia Krifi
Grainger, Jonathan and Ludovic Ferrand 1996 Masked orthographic and phonological priming in visual word recognition and naming: Cross-task comparisons. Journal of Memory and Language 35: 623-647. Jaeger, Jeri J. 1992 Phonetic features in young children’s slips of the tongue. Lang. Speech 35: 189-205. Jusczyk, Peter W., Mara B. Goodman and Angela Baumann 1999 Nine-month-olds’ attention to sound similarities in syllables. Journal of Memory and Language 40: 62-82. Krifi, Sonia, Nathalie Bedoin and Vania Herbillon 2003 Phonetic priming and backward masking in printed stimuli: A better understanding of normal reading and dyslexia. Paper presented at the XIIIth Conference of the European Society for Cognitive Psychology (ESCOP), September 2003, Granada, Spain. 2005 The hierarchy of phonetic features categories in printed syllables matching: Normal reading and developmental dyslexia. Poster presented at the XIVth Conference of the European Society for Cognitive Psychology (ESCOP), September 2005, Leiden, Netherlands. Krifi, Sonia, Nathalie Bedoin and Anne Mérigot 2003 Effects of voicing similarity between consonants in printed stimuli, in normal and dyslexic children. Current Psychology Letters: Behaviour Brain, and Cognition 10: 1-7. Lesch, Mary F. and Alexander Pollatsek 1993 Automatic access of semantic information by phonological codes in visual word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition 19: 285-294. 1998 Evidence for the use of assembly phonology in accessing the meaning of printed words. Journal of Experimental Psychology: Learning, Memory and Cognition 24: 573-592. Luce, Paul A., Stephen D. Goldinger, Edward T. Auer JR. and Michael S. Vitevitch 2000 Phonetic priming, neighborhood activation, and PARSYN. Perception and Psychophysics 62: 615-625. Lukatela, Georgije, Thomas A. Eaton, C. Lee and M. T. Turvey 2003 Does visual word identification involve a sub-phonemic level ? Cognition 78: 41-52. Lukatela, Georgije, Katerina Lukatela and M. T. Turvey 1993 Further evidence for phonological constraints on visual lexical access: TOWED primes FROG. Perception and Psychophysics 53: 461-466. Lukatela, Georgije and M. T. Turvey 1991 Phonological access of the lexicon: Evidence from associative priming with pseudohomophones. Journal of Experimental Psychology: Human Perception and Performance 4: 951-966.
The complexity of phonetic features' organization in reading 293 1994
Visual lexical access is initially phonological: 2. Evidence from phonological priming by homophones and pseudohomophones. Journal of Experimental Psychology: General 123: 331-353. Luo, Chun R., Reed A. Johnson and David A. Gallo 1998 Automatic activation of phonological information in reading: Evidence from the semantic relatedness decision task. Memory and Cognition 26: 833-843. MacNeilage, Peter and Barbara L. Davis 2000 On the origin of internal structure of word forms. Science 288: 527-531. Marslen-Wilson, William D., Helen E. Moss and Stef van Halen 1996 Perceptual distance and competition in lexical access. Journal of Experimental Psychology: Human Perception and Performance 22: 1376-1392. Marslen-Wilson, William D. and Paul Warren 1994 Levels of perceptual representation and process in lexical access: Words, phonemes, and features. Psychological Review 101: 653-675. McClelland, James L. and David E. Rumelhart 1981 An interactive activation model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review 88: 375-407. McMurray, Bob, Mickael K. Tanenhaus and Richard N. Aslin 2002 Gradient effects of within-category phonetic variation on lexical access. Cognition 86: 33-42. McQueen, James M., Delphine Dahan and Anne Cutler 2003 Continuity and gradedness in speech processing. In: Niels Schiller and Antje Meyer (eds.), Phonetics and phonology in language comprehension and production: Differences and similarities, 39-78. Berlin: Mouton de Gruyter. Miceli, Gabriele, Carlo Caltagirone, Guido Gainotti and Paola Payer-Rigo 1978 Discrimination of voice versus place contrasts in aphasia. Brain and Language 6: 47-51. Miller, George A. and Patricia E. Nicely 1955 An analysis of perceptual confusions among some English consonants. The Journal of Acoustical Society of America 27: 338-352. Molfese, Dennis L. 1978 Neural correlates of categorical speech perception in adults. Brain and Language 5: 25-35. Molfese, Dennis L. and Veronica J. Molfese 1988 Right-hemisphere responses from preschool children to temporal cues to speech and non-speech materials: Electrophysiological correlates. Brain and Language 33: 245-259. Oscar-Berman, Marlene, Edgar B. Zurif and Sheila Blumstein 1975 Effects of unilateral brain damage on the processing of speech sounds. Brain and Language 2: 345-355.
294
Nathalie Bedoin and Sonia Krifi
Perecman, Ellen and Lucia Kellar 1981 The effect of voice and place among aphasic, nonaphasic rightdamaged, and normal subjects on a metalinguistic task. Brain and Language 12: 213-223. Perfetti, Charles A. and Laura Bell 1991 Phonemic activation during the first 40 ms of word identification: Evidence from backward masking and priming. Journal of Memory and Language 30: 473-485. Perfetti, Charles A. and Sulan Zhang 1991 Phonemic processes in reading Chinese words. Journal of Experimental Psychology: Learning, Memory and Language 1: 633-643. Peter, Mira and M. T. Turvey 1994 Phonological codes are early sources of constraint in visual semantic categorization. Perception and Psychophysics 55: 497-504. Peters, Robert W. 1963 Dimensions of perception of consonants. The Journal of Acoustical Society of America 35: 1985-1989. Rayner, Keith, Sara C. Sereno, Mary F. Lesch and Alexander Pollatsek 1995 Phonological codes are automatically activated during reading: Evidence from an eye movement priming paradigm. Psychological Science 6: 26-32. Rebattel, Magalie and Nathalie Bedoin 2001 Cerebral hemispheric asymmetry in voicing and manner of articulation processing in reading. Poster presented at the XIIth Conference of the European Society of Cognitive Psychology (ESCOP 12), September 2001, Edinburgh, Scotland. Rogers, Margaret A. and Holly L. Storkel 1998 Reprogramming phonologically similar utterances: The role of phonetic features in pre-motor encoding. Journal of Speech, Language, and Hearing Research 41: 258-274. Segalowitz, Norman S. and Henri Cohen 1989 Right hemisphere EEG sensitivity to speech. Brain and Language 37: 220-231. Simos, Panagiotis G., Denis L. Molfese and Rebecca A. Brenden 1997 Behavioral and electrophysiological indices of voicing-cue discrimination: Laterality patterns and development. Brain and Language 57: 122-150. Slowiaczek, Louisa M., Howard C. Nusbaum and David B. Pisoni 1987 Phonological priming in auditory word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition 13: 64-75. Sparrow, Laurent and Sébastien Miellet 2002 Activation of phonological codes during reading: Evidence from errors detection and eye movements. Brain and Language 81: 509-516.
The complexity of phonetic features' organization in reading 295 Stevens, Kenneth N. 2002 Toward a model for lexical access based on acoustic landmarks and distinctive features. The Journal of Acoustical Society of America 111: 1872-1891. Studdert-Kennedy, Michael and Donald Shankweiler 1970 Hemispheric specialization for speech perception. The Journal of Acoustical Society of America 48: 579-594. Studdert-Kennedy, Michael, Donald Shankweiler and David B. Pisoni 1972 Auditory and phonetic processes in speech perception: Evidence from a dichotic study. Cognitive Psychology 3: 455-466. Vallée, Nathalie, Solange Rossato and Isabelle Rousset To appear this volume. Van der Hulst, Harry 2005 Molecular structure of phonological segments. In: Philip Carr, Jacques Durand and Colin J. Ewen (eds.), Headhood, elements, specification and contrastivity, 193-234. John Benjamins Publishing Company. Van Orden, Guy C. 1987 A ROWS is a ROSE: Spelling, sound, and reading. Memory and Cognition 15: 181-198. Van Orden, Guy C. and Stephen D. Goldinger 1994 Interdependance of form and function in cognitive systems explains perception of printed words. Journal of Experimental Psychology : Human Perception and Performance 20: 1269-1291. Van Orden, Guy C., Marian A. Jansen op de Haar and Anna M. T. Bosman 1998 Complex dynamic systems also predict dissociations, but they do not reduce to autonomous components. In: Alfonso Caramazza (ed.), Access of phonological and orthographic lexical forms: Evidence from dissociations in reading and spelling, 131-166. Hove: The Psychology Press. Van Orden, Guy C., James C. Johnston and Benita L. Hale 1988 Word identification in reading proceeds from spelling to sound to meaning. Journal of Experimental Psychology: Learning, Memory, and Cognition 14: 371-386. Wang, Marilyn D. and Robert C. Bilger 1973 Consonant confusions in noise: A study of perceptual features. The Journal of Acoustical Society of America 54: 1248-1266. Ziegler, Johannes C. and Arthur M. Jacobs 1995 Phonological information provides early sources of constraint in the processing of letter strings. Journal of Memory and Language 34: 567-593. Ziegler, Johannes C., Guy C. Van Orden and Arthur M. Jacobs 1997 Phonology can help or hurt the perception of print. Journal of Experimental Psychology: Human Perception and Performance 23: 845-860.
Part 4: Complexity in the course of language acquisition
Self-organization of syllable structure: a coupled oscillator model Hosung Nam, Louis Goldstein and Elliot Saltzman 1. Syllable structure* It has generally been claimed that every language has syllables with onsets (CV structure), while languages may or may not allow coda consonants (VC structure). While recent work on Arrernte (Breen & Pensalfini, 1999) has cast doubt on the absolute universality of onsets, it is clear that there is a significant cross-linguistic preference for CV structure (Clements & Keyser, 1983; Clements, 1990). In addition, evidence from phonological development shows that CV structure is typically acquired before VC structure (e.g., Vihman & Ferguson, 1987; Fikkert, 1994; Demuth & Fee, 1995; Gnanadesikan, 1996; Salidis & Johnson, 1997; Levelt et al., 2000). This preference for CV structure in distribution and acquisition has been claimed as arising from universal grammar (UG: Chomsky, 1965) where CV is the unmarked, core syllable structure. Yet, the UG hypothesis does not answer the question “Why is CV the most unmarked structure?” This study aims to provide a rationale, grounded in dynamical systems theory, for why CV is favored across languages. The self-organization of syllable structure in phonological development is simulated using a model in which syllable structures are defined by the coupling graph in a system of gestural planning oscillators that control patterns of relative intergestural timing. The simulation shows that, due to the hypothesized stronger coupling inherent in CV graphs compared to VC graphs, CVs emerge earlier than VCs. The model of syllable structure based on coupled planning oscillators (see section 2 below) has been developed to account for a variety of empirical observations about speech production (Nam & Saltzman, 2003; Goldstein et al., 2006; Nam et al, submitted a, b), independently of any consideration of acquisition facts. In addition to the explanatory weakness of the UG hypothesis, two additional empirical observations about phonological development are not easily accommodated by the UG hypothesis. One is that the delay in the emergence of coda consonants varies across languages as a function of the frequency of coda consonants in adults’ word production (Roark & Demuth
300
Hosung Nam, Louis Goldstein and Elliot Saltzman
2000). Thus, both intrinsic and extrinsic factors interact in the development of syllable structure. The second observation is that, unlike the acquisition patterns of single consonants, it has been shown in several languages that consonant clusters can appear earlier in word- (or syllable-) final position than word-initial position (Mexican-Spanish: Macken, 1977; Telugu: Chervela, 1981; German and Spanish: Lleó & Prinz, 1996; Dutch: Levelt, Schiller, & Levelt 2000; English: Templin, 1957, Paul & Jennings, 1992, Dodd, 1995, Watson & Scukanee, 1997, McLeod et al. 2001, Kirk and Demuth 2003), though the opposite pattern can also be observed (for example, Dutch: Levelt et al. ,2000). In the self-organization model presented in this paper, both of these facts are readily accounted for. The first follows from hypothesizing a selforganizing process that includes both intrinsic dynamic constraints on planning intergestural timing and the attunement of speaker/listener agents to the behavior of other agents. The second follows from the independently motivated hypothesis (Browman & Goldstein, 2000; Nam & Saltzman, 2003) that the production of consonant clusters can involve competition between the coupling of (all of) the consonants to the vowel, and the coupling of the consonants to one another. As we will see in the rest of the paper, the fact that CV coupling is stronger than VC coupling makes it easier to learn to produce single onset Cs than coda Cs, but competition provided by the stronger CV coupling in onsets makes it more difficult to learn to coordinate consonant clusters.
2. A coupled oscillator model of syllable structure Within the framework of articulatory phonology (e.g., Browman & Goldstein, 1992; 1995), word forms are analyzed as molecules built up by combining discrete speech gestures, which function simultaneously as units of speech production (regulating constriction actions) and units of (phonological) information. In these molecules, the relative timing of gestures is also informationally or phonologically significant. For example, the words ‘mad’ and ‘ban,’ are composed of the identical set velum, tongue tip, tongue body, and lip gestures. The only difference between these two words is in velum gesture’s timing with respect to the other gestures. Thus, there must be some temporal glue in speech production that keeps the gestures appropriately coordinated. As with the gestures themselves, this glue appears to have both a regulatory and an informational function.
Self-organization of syllable structure
301
2.1. A coupled oscillator model of intergestural timing We have been developing a model of speech production planning in which dynamic coupling plays the role of temporal glue (Saltzman & Byrd, 2000; Nam & Saltzman, 2003; Goldstein et al, 2006; Nam, in press; Nam et al., submitted a,b). The central idea is that each speech gesture is associated with a nonlinear planning oscillator, or clock, and the activation of that gesture is triggered at a particular phase (typically 0o) of its oscillator. A pair of gestures can be coordinated in time by coupling their corresponding oscillators to one another so that the oscillators settle into a stable pattern of relative phasing during planning. Once this pattern stabilizes, the activation of each gesture is triggered by its respective oscillator, and a stable relative timing of the two gestures is achieved. There are two sources of evidence supporting the hypothesis that the relative timing of gestures is controlled by coupling their planning oscillators. The first comes from experiments on phase-resetting. When subjects repeat a particular word, the gestures composing that word exhibit stable relative phasing patterns. When the ongoing repetition is mechanically perturbed, the characteristic pattern of phasing is quickly re-established (reset) in a way that is consistent with the behavior of coupled oscillators (Saltzman et al., 1998). When a word is produced only once, instead of repeatedly, qualitatively similar phase-resetting is also observed. The second source of evidence comes from experiments on the kinematics of speech errors (Goldstein et al., in press). The subjects repeat phrases like “cop top,” in which the tongue dorsum gesture for /k/ and the tongue tip gesture for /t/ alternate. Over time, subjects’ productions tend to shift to a new pattern, in which the tongue tip and the tongue dorsum gestures are produced synchronously at the beginning of both words, causing the perception of speech errors, (Pouplier & Goldstein, 2005). These errors have been interpreted as resulting from a shift to a more stable mode of frequency-locking among the gestural timing oscillators that compose these words. Specifically, in normal productions, there is a 1:2 relation between the frequency of the tongue tip (or dorsum) oscillators and the oscillators for a vowel or final C. The new (more effortful) pattern exhibits a more stable, 1:1 frequency relation. Such shifts to more stable modes of frequency-locking have been observed in several types of bimanual coordination in humans (Turvey, 1990; Haken et al., 1996). Very similar kinds of changes are observed in tasks that do not involve any overt repetition (Pouplier, in press); when the shifts must occur in planning process.
302
Hosung Nam, Louis Goldstein and Elliot Saltzman
To illustrate the planning model, consider Fig. 1. The word “bad” is composed of three gestures: a Lip closure, a wide palatal constriction of the Tongue Body, and a Tongue Tip closure. A typical arrangement of these gestures in time, as can be observed from kinematic data, is shown in the gestural score in (a). The width of each box represents that gesture’s activation interval, on the time during which its dynamics control the appropriate constricting system (lips, tongue body, tongue tip). These activation intervals are the output of the coupled oscillator model of planning. The input to the planning model is a coupling graph, shown in (b). The graph specifies how the oscillators controlling the gestures’ timing are coupled to one another. The graph shows that the oscillator for the palatal Tongue Body (/a/) gesture is coupled to both the Lip (labial closure) and Tongue Tip (alveolar closure) oscillators. The solid line connecting the TB gesture with the Lip gesture indicates that the coupling target of those gestures is specified as an in-phase relation between the oscillators, while the dotted line connecting the TB gesture to the Tongue Tip gesture indicates that an anti-phase coupling target is specified.
Figure 1. (a) Gestural score for ‘bad.’ Time is on the horizontal axis. (b) Coupling graph for ‘bad.’ Solid line indicates in-phase coupling target, dotted line indicates anti-phase coupling target. After Goldstein et al, (2006).
At the onset of the planning simulation for an utterance, each (internal) oscillator is set to an arbitrary initial phase and the oscillators are set into motion. Over time, the coupling among the oscillators causes them to settle into a stable pattern of relative phasing. The model that accomplishes this,
Self-organization of syllable structure
303
the task dynamics model of relative phasing, first developed by Saltzman & Byrd (2000) for single pairs of gestures, has been extended to a network of multiple couplings (Nam & Saltzman, 2003; Nam et al., submitted a, b). The coupling between each pair of gestures is controlled by a (cosineshaped) potential function defined over their relative phase, with a minimum at the target (intended) relative phase value. When the relative phase of two oscillators differs from its target, forces derived from the potential functions are applied to the individual oscillators that have the effect of bringing their relative phase closer to the target value. The steady-state output of this planning process is a set of oscillations with stabilized relative phases. From this output, the gestural score (for example, the one for ‘bad’ in (a)) is produced where gestural onset and offset times are specified as a function of the steady-state pattern of inter-oscillator phasing. So, because the TB and Lip oscillators (Fig. 1b) settle into a steady-state, in-phase pattern, their activation intervals (Fig. 1a) are triggered synchronously. The TB and TT oscillators settle into an anti-phase pattern, and their activations show the TT gesture to be triggered substantially later than the TB gesture. Gestural scores can be the input for the constriction dynamics model (Saltzman & Munhall, 1989), which generates the resulting time functions of constriction (tract) variables and articulator trajectories. Articulator trajectories can then be used to calculate the acoustic output in our vocal tract model. 2.2. Intrinsic modes of coupling One theoretical advantage of modeling timing using oscillator coupling graphs is that systems of coupled nonlinear oscillators can display multiple stable modes. These modes have been shown to play a role in the coordination of oscillatory movements of multiple human limbs (fingers, arms, legs; see Turvey, 1990, for a review). Such experiments show that when asked to oscillate limbs in a regular coordinated way, subjects can do so readily, without any training or learning, as long as the task is to coordinate them using in-phase (0o relative phase) or anti-phase (180o relative phase) patterns. Other phase relations can be acquired through learning, e.g. complex drumming, but only after significant training. While these two modes of coupling are (intrinsically) available without training, they are not equally stable. This has been demonstrated in experiments in which the frequency (rate) of subjects’ oscillation is manipulated.
304
Hosung Nam, Louis Goldstein and Elliot Saltzman
When subjects oscillate two limbs in an anti-phase pattern and the frequency is increased, the relative phasing undergoes a spontaneous transition to the in-phase pattern. However, if subjects begin oscillating in the in-phase pattern, an increase of oscillation frequency has no effect on the relative phase. From these results, it has been concluded that the in-phase mode is the more stable one. These experimental results were the basis for Haken, Kelso and Bunz’s (1985) model of coordinated, rhythmic behavior, which can be understood as a self-organized process governed by low-dimensional nonlinear dynamics. They developed a simple potential function (the HKB potential function) that can account quantitatively for the results of these experiments.
V(ψ) = -a cos(ψ) - b cos (2ψ); (ψ = φ2-φ1) Figure 2. HKB potential function. a is varied from 1 (left) to 2 (center) to 4 (right).
The function, shown in Fig.2, is the sum of two cosine functions of relative phase, one of which has half the period of the other. a and b are weighting coefficients of the two cosine functions respectively and their ratio, b/a, determines the shapes of the potential landscapes. The left-most example in Fig. 2 shows the shape of the function for b/a = 1. There are two potential minima at 0 and 180 degrees and, depending on initial conditions, relative phasing can stabilize at either of the two minima, making them attractors. However, the valley associated with the in-phase minimum is both deeper and broader. Thus, it technically has a larger basin because there is a larger range of initial values for ψ that will eventually settle into that minimum. The experimental results suggest that frequency of oscillation (rate) is a control parameter for the system: as it is scaled upwards in a continuous fashion, the behavior of the system will undergo a qualitative change at some critical point. If the value of b/a is specified as an inverse function of oscillation frequency, then the HKB model in Fig. 2 predicts an abrupt phase transition from the anti-phase to in-phase pattern as the rate increases. This can be seen by comparing the function for the different values of b/a shown in Fig. 2. As b/a decreases from 1.0, the basin of the antiphase mode becomes shallower. Eventually, the attractor disappears when
Self-organization of syllable structure
305
b/a = 1/4. At this point, the anti-phase pattern becomes unstable and the relative phase will be attracted to the minimum at 0o. 2.3. Syllable structure and coupling modes For a system like speech, which is acquired early in a child’s life and without explicit training, it would make sense for the early coordination of speech gestures to take advantage of these intrinsically available modes. It has been proposed (e.g., Goldstein et al., 2006) that phonological systems make use of the in-phase and anti-phase modes, which form the basis of syllable structure in phonology. If we treat phonology as a fundamentally combinatorial system, consider the problem of coordinating two gestures, a consonant gesture with a vowel gesture, given the predisposition to exploit the presence of the intrinsically distinct in-phase and anti-phase modes. Goldstein et al. (2006) have proposed the coupling hypothesis of syllable structure: in-phase coupling of C and V planning oscillators underlies what we observe as CV structures, and more generally underlies the relation between onset and nucleus gestures. The anti-phase mode of coupling planning oscillators underlies VC structures and the relation between nucleus and coda gestures. Evidence for the in-phase mode in CV structures can be found in the fact that the constriction actions for the C and V gestures in CVs are initiated synchronously. For example, in Fig. 1, the activation of the Lip gesture and the Tongue Body gestures begins at the same time. This synchrony follows from the hypothesis that the oscillators associated with the C and V gestures are coupled so as to settle into an in-phase mode, together with the model assumption that gestural activation is triggered at phase 0 of a gesture’s oscillator. The idea that consonant and vowel gestures are triggered synchronously goes back to the pioneering work of Kozhevnikov & Chistovich (1965). Kinematic data for V1pV2 and V1bV2 utterances presented by Löfqvist & Gracco (1999) show that the onset of lip movement for /p/ or /b/ and the onset of tongue body movement for V2 occur within 50 ms of one another, across all 4 subjects and all six different V1V2 patterns, with only 2 outlier values. In the case of a coda /p/, the relation to the vowel is obviously not one of synchrony. The evidence that the oscillators exhibit the anti-phase relation is necessarily indirect. The anti-phase relation implies that the final /p/ will be triggered at 180o of the vowel gesture. The point in time that corresponds to 180o will, of course, depend on the frequency of
306
Hosung Nam, Louis Goldstein and Elliot Saltzman
the vowel oscillator. In Nam et al. (submitted, b), we show that simple hypotheses about the frequencies of vowel and consonant oscillators, combined with the hypothesis of in-phase CV and anti-phase VC coupling, can account for a rich set of quantitative phonetic timing data. The coupling hypothesis can also be used to explain a variety of qualitative properties of CV and VC structures. These include the following: • Universality. CV syllables occur in all human languages, while VC ones do not. Since the in-phase mode is more stable, stronger (the potential well in Fig. 2 is deeper for the in-phase model), and has a larger basin of attraction than the anti-phase mode, it follows that the in-phase (CV) mode should always be available for coordinating Cs and Vs in a language, while the anti-phase (VC) mode may not be. • Combinatoriality. Even in languages which allow VC structures, vowels and codas often exhibit restrictions when combined. Onsets and rimes, in contrast, can usually combine freely in languages. Indeed, their relatively free combinatoriality is the major source of phonological generativity and the basis for the traditional decomposition of the syllable into onset and rime. Goldstein et al. (2006) propose that there is a relation between the stability/strength of coupling and combinatoriality. The idea is that it is possible to jointly perform any two actions as long as they are coordinated in-phase because this coordination is intrinsically the most stable. Even though anti-phase coordinations are more stable than other out of phase modes, speakers may learn not to use the more stable in-phase coordination for the forms that have coda Cs. Some combinations may be difficult to learn, leading to combinatorial restrictions. • Re-syllabification. Single, intervocalic coda consonants may be resyllabified into onset position in running speech, particularly as speech rate increases (Stetson, 1951; Tuller & Kelso, 1991; de Jong, 2001a, b). This follows automatically from the HKB model (Fig. 2), where CV is defined as in-phase and VC is defined as anti-phase (Re-syllabification takes place on the following syllable, rather than on the preceding one because the final C and following V are already roughly synchronous, even though they may not be coupled with each other at all. So, they would fall into the basin of an inphase attractor) Another key hypothesis in the coupled oscillator planning model is that incompatible coupling specifications can compete with one another and
Self-organization of syllable structure
307
that, during planning, the system of oscillators settles to a set of steadystate relative phases that is the result of the competition. The use of competitive coupling was originally proposed for CC and CV coupling in onset clusters (Browman & Goldstein, 2000). All C gestures in an onset were hypothesized to be coupled in-phase with the vowel (this is what defines an onset consonant). For some combinations of C gestures, such as oral constrictions with velic or glottal gestures, synchronizing multiple C gestures results in a recoverable structure. The result is what is usually analyzed as a multiple-gesture segment (nasal, lateral, voiceless stop). In other cases (clusters such as /sp/ for example), synchronous coordination does not produce a recoverable structure, and the two gestures must be at least partially sequential (Goldstein et al, 2006; Nam, in press). Therefore, the oral consonant gestures must also be coupled anti-phase to each other. Browman and Goldstein (2000) proposed that this competitive structure could account for a previously observed generalization about the relative timing of consonant and vowel gestures in forms with onset clusters (the so-called ‘c-center’ effect, Browman & Goldstein, 1988; Byrd, 1995). As Cs are added to an onset, the timing of all Cs relative to the vowel is shifted: the C closest to the vowel shifts rightward to overlap the vowel more, while the first C slides leftward away from the vowel. The temporal center of the sequence, the c-center, maintains relatively invariant timing with respect to the vowel. This effect has since been modelled in coupled oscillator simulations (Nam & Saltzman, 2003; Nam, et al, submitted-b). In-phase coupling of two onset Cs with the V and anti-phase coupling of the two Cs with each other results in a output in which the phasing of C1 and C2 to the V is -60o and 60o, respectively, and the C1C2 phasing is 120o, a pattern consistent with available data. However, available evidence suggests that this kind of competitive structure may or may not be found in coda consonants, depending on the language (Nam, in press), or possibly the speaker. In English, coda clusters do not exhibit the c-center effect consistently (Honorof & Browman, 1995), though it may be found for some speakers (Byrd, 1995). Browman & Goldstein (2000) hypothesized a non-competitive structure for codas in English: the first coda C is coupled anti-phase with the vowel and the second coda C is coupled anti-phase with the first. More recent work (Nam & Saltzman, 2003; Nam, et al, submitted-b) has shown that this hypothesized difference between onset and coda clusters for English accounts for the lack of a coda c-center effect and also for the fact that gestures in onset clusters exhibit less variability in relative timing than do gestures in coda clusters. When
308
Hosung Nam, Louis Goldstein and Elliot Saltzman
noise is added to the coupled oscillator simulation, the competitivelycoupled onset oscillators exhibit less trial-to-trial variability than do the noncompetitively-coupled coda oscillators. This hypothesized difference in coupling topology between onsets and codas has also been argued (Nam, in press) to be part of the explanation for the cross-linguistic generalization: while coda clusters can add metrical weight to a syllable but onsets clusters rarely do. (One exception is Ratak, a dialect of Marshallese; Bender, 1999). The idea is that weight is partly determined by the duration of a syllable (Gordon, 2004). While adding a C to a coda increases the duration of the rime (and the whole syllable) by the duration of that consonant, adding a C to the onset increases the duration of the whole syllable by only about half the duration of the C. Cross-language differences in the presence of competitive vs. noncompetitive coupling structure in codas have been proposed (Nam, in press) in order to account for the differing moraic status of coda Cs across languages. English, and other languages in which coda Cs are moraic, are modeled with a non-competitive structure in codas: adding Cs to a coda is predicted not to decrease the duration of the vowel, so the added C increases the duration of the entire syllable substantially. Languages in which coda Cs are not moraic (e.g. Malayalam), are modeled with a competitive structure in the coda, which causes vowel shortening as Cs are added to the coda, and, resultingly, a lack of weight associated with the added C. (Nam (in press) showed how these hypothesized coupling differences can account for acoustic data from these language types (Broselow et al., 1997)) Thus, there is a strong asymmetry between onsets and codas. Onsets always have a competitive structure, but this may be lacking in codas. However, regardless of topological differences in the coupling structures of onsets and codas, onset Cs are characterized by in-phase couplings while coda Cs are characterized by anti-phase couplings. The asymmetry between onsets and codas led to the hypothesis that, due to the greater intrinsic strength of in-phase coupling, all prevocalic Cs are pulled into the inphase relation with the V, whereas coda Cs can (and do in some languages) escape the pull of anti-phase coupling with the V (Nam, in press). In this paper, we show that the difference in coupling modes between onsets and codas can also account for the difference in the time course of acquisition between CV and VC structures. To show this, we performed simulation experiments investigating the self-organization of syllable structure, in which the only relevant pre-linguistic structure attributed to the
Self-organization of syllable structure
309
child is that (s)he comes equipped with (a) the HKB potential function for the pairwise coordination of multiple actions and (b) the ability to attune his/her behavior to the behaviors of others in his/her environment.
3. Self-organization of syllable structure We investigated the self-organization of syllable structure in a series of simulations with a computational agent model. Agent models have been employed to investigate several aspects of phonological and phonetic structure such as partitioning of physical continua into discrete phonetic categories (Oudeyer, 2002, 2005, 2006; Goldstein, 2003), the structure of vowel systems (deBoer, 2001; Oudeyer, 2006), consonant-vowel differentiation (Oudeyer, 2005), and sequentiality in consonant sequences (Browman & Goldstein, 2000). In these simulations, the agents interact using a very simplified set of local behaviors and constraints. Through these interactions, the agents’ internal states evolve, as do the more global properties of the system. Depending on the choice of constraints, the system may evolve to have quite different properties. Thus, for example, the importance of some constraint (k) in the evolution of some property of interest (P) can be evaluated by contrasting the results of simulations with and without that constraint. The models are not meant to be faithful simulations of the detailed process of (phylogenetic or ontogenetic) evolution of some property, but rather a way of testing the natural attractors of a simple system that includes the constraint of interest. The models employed here involve a child agent with no syllable structure, and an adult agent with a developed syllable structure. The child comes to the simulation with two distinct classes of actions (C and V), and it attempts to coordinate them in time. The existence of distinct C and V actions early in the child’s development (e.g. during babbling) has been denied in the frame-content model of speech production development (MacNeilage and Davis, 1998). That model’s treatment of syllable structure and its emergence contrasts, therefore, with the one proposed here. This conflict will be addressed further in the Discussion. At this point, however, we note that even if the frame-content view is correct in excluding a C-V distinction during the babbling stage, this distinction could still have evolved by the time the child is producing words with onset and coda consonants, which is the age we are simulating here.
310
Hosung Nam, Louis Goldstein and Elliot Saltzman
3.1. Emergence of CV vs. VC structures Learning in this model is accomplished by self-organization under conditions imposed by both intrinsic constraints on coordination and attunement to the coordination patterns implicit in and presumably recoverable from, the acoustic environment structured by the ambient language. A Hebbian learning model was employed, in a manner similar to that used by Oudeyer (2006) to model the emergence of discrete phonetic units. The simulation includes a child agent and an adult agent. Both have a probabilistic representation of the distribution of intended relative phasing between a pair of gestures that evolves over time. The adult representation includes modes corresponding to CV (in-phase) and VC (anti-phase), where the relative strength of these modes can differ from language to language, corresponding to the relative frequency of CV and VC structures in that language. At the onset of learning, the child’s representation does not include any CV or VC modes, so the child displays no preference for producing any relative phases over others; the relative phases produced are randomly distributed. As the result of the learning process, modes develop that correspond to the modes found in the adult speakers’ representations and to their relative strength. What we predict is that even though the child will develop the same modes as the adult partner, the rate at which the CV mode develops will be faster than that of the VC mode, regardless of the ultimate relative strength of the modes.
Self-organization of syllable structure
311
Figure 3. Self-organization learning model for emergence of CV and VC structures.
The learning simulation proceeds as follows (Fig. 3). On a given learning iteration, the child randomly selects a relative phase value, ψSEL, to produce from its evolving distribution of relative phases, and a single-well intended potential function with a minimum at that value is added to the double-well (HKB) intrinsic potential function to create a resultant composite potential function. The agent then plans the production of a pair of gestures by using the composite potential to specify the coupling function between a pair of corresponding planning oscillators. Oscillator motion is initialized with a random pair of initial phases, and oscillator motions settle in to a stabilized relative phase, ψOUT, in accordance with the shape of the composite potential. The child then produces a pair of gestures with relative phase ψOUT; the child also compares ψOUT to the (veridically) perceived relative phase of an utterance token produced by the adult, ψADULT. If the difference between these two relative phases falls within a criterion tolerance, the child tunes its relative phase density distribution to increase the likelihood of producing that phase, ψOUT, again. The details of the model are now described. 3.1.1. Phase representation and selection model The target relative phase parameter values of the interoscillator coupling function between C and V gestures is represented (for both the child and the adult agents) by a set of virtual (“neural”) units, ψi, each of which represents some value of the relative phase parameter. At the outset of the
312
Hosung Nam, Louis Goldstein and Elliot Saltzman
simulation, values of relative phase are assigned to these neural units in one degree increments from -179o to 360o Thus, ψ1 =-179o, ψ2 =-178o,… ψ540 =360o, for a total of 540 units1. On a given learning trial, one of the 540 units is selected and its relative phase value, ψSEL, is used as the agent’s intended relative phase. Since at the beginning of the simulation, the units’ relative phase values are uniformly distributed across the relative phase range, the value of ψSEL will be completely random. As learning progresses through attunement (section 3.1.3 below), the values of the neural units will come to be clustered, with most units having values near 0o or 180o. We will represent this clustering at various points in the simulation by plotting a frequency histogram, showing the number of units as a function of relative phase. We will refer to this distribution as the density distribution, and the number of units sharing a value of relative phase as its density. The density distribution will develop peaks at 0o and 180o; therefore, since ψSEL values are chosen by randomly sampling the density distribution, the values of ψSEL will also tend to be either 0o or 180o as the distribution develops over the course of learning. 3.1.2. Planning and production model Once an intended relative phase, ψSEL, is selected by the child, it is used to construct the system’s composite potential function, which will shape the evolution of interoscillator relative phase between V and C gestures. To model the fact that not all relative phases patterns can be as easily learned or produced, not only does the intended single-well relative phase potential contribute to the shape of this composite potential, but so does the HKB double-well potential function (see section 2.2) that represents the intrinsic modes of coupling two oscillators. The intended potential, Pintended, is modeled by a cosine function whose minima are at ψSEL and whose peaks have been flattened (see “intended potential” inset in Fig. 3) according to the following expression:
Pintended
⎛ 1 = −δ ⎜⎜ SEL ⎝ cosh α cos(ψ −ψ ) − 1
((
))
2
⎞ ⎟⎟ α cos ψ −ψ SEL ⎠
(
)
(1)
where α = 1, and δ can vary between 0.5-1.0 according to the value of the density distribution at ψSEL (see Equation 2 below). The intended potential function is added to the HKB intrinsic potential function to build the composite potential function. The relative contribution of the intended potential
Self-organization of syllable structure
313
in the composite should depend on how well learned, or well-practiced, the intended pattern is. To implement this, the coupling strength associated with the intended potential, δ, (Equation 1) is scaled between 0.5-1.0 according to the density (i.e., the number of relative phase neural units) defined for ψSEL in the child’s evolving, experience-dependent probability density distribution for relative phase (section 3.1.1). The coupling strength associated with the intrinsic HKB potential is defined as (1 – δ). δ is a standard logistic squashing function that takes density for ψSEL, D(ψSEL) as its argument, as defined in equation (2):
δ=
β
λ D(ψ SEL )−D0 )⎞ ⎛ ⎜1+ e ( ⎟ ⎝ ⎠
+γ
(2)
In this equation, β = 0.5, γ = 0.5, λ = 0.15, and D0 = 70. This function is shown in Fig. 4.
Figure 4. Logistic squashing function relating strength of intended coupling, δ, to probability density of ψSEL, D(ψSEL).
Once the composite potential is specified for the planning simulation, the initial phases of the planning oscillators are chosen at random; over the course of the simulation, the oscillators settle to a stable relative phase under the control of the composite potential function. This final, steady-state relative phase value, ψOUT, is thus determined both by the landscape of the composite potential and the basin selected by the randomly chosen initial
314
Hosung Nam, Louis Goldstein and Elliot Saltzman
conditions. As a result, the produced ψOUT may correspond to neither the intended relative phase, ψSEL, nor to the in-phase modes intrinsic to the HKB potential, nor to the anti-phase ones. This final relative phase of the planning oscillators could then be used to trigger activation of C and V gestures but, in this agent model, the simulation simply stops with the oscillators settling at their final, steady-state pattern. 3.1.3. Attunement model The attunement of the child to the language environment is modeled by comparing the child’s produced relative phase (ψOUT) on a given learning
Gi =
1 e σ 2π
SEL ⎞ 2 1 ⎛ ψ −ψ − ⎜⎜ i ⎟⎟ 2⎝ σ ⎠
(3)
cycle to a randomly sampled relative phase from the adult probability distribution, ψADULT. If the child’s produced value matches the randomly chosen adult value such that |ψOUT–ψADULT| < 5o, then the intended phase used by the child on that trial (ψSEL) is gated into the tuning (or learning) process. Tuning occurs as the units of phase representation that have values (ψi) similar to ψSEL respond by increasing their level of activation as a function of the proximity of ψi and ψSEL in phase space, ψi–ψSEL. Specifically, the “receptive field” of each unit-i is a Gaussian function of ψi with mean, ψSEL, and standard deviation, σ = 40o, as described in (3). The values of the all units, ψi, are then attracted to ψSEL in proportion to their activation levels, according to the parameter-dynamics equation in (4), where ψi′ is the new unit value and r is a learning rate parameter (equal to 1
ψ ′i = ψi − rGi (ψ i − ψ SEL )
(4)
in this simulation). The result of this parameter-dynamic tuning process is an evolution of the density distribution of units along the relative phase continuum. The example in Fig. 5 shows the initial uniform state of the child’s density distribution and the effect on the units of gating a value of ψSEL = 2o into the tuning process. Tuning ends one cycle of the phase learning process.
Self-organization of syllable structure
315
Figure 5. Visualization of tuning (learning) process in self-organization model
The adult’s phase distribution was varied across simulations to model ambient languages with different properties. For example, both English and Spanish exhibit preference for onsets (CV) over codas (VC) in production but coda consonants are more frequently produced in English than in Spanish. This kind of asymmetry can be expressed by the difference between the phase distributions of English and Spanish adult speakers, with the probability of units clustered in the anti-phase region being higher for English than Spanish. In the simulations presented here, three different hypothetical languages with different probabilities of in-phase and anti-phase modes were tested: a) CV>VC (in-phase = .6, anti-phase = .4); b) CV=VC (inphase = anti-phase = .5); and c) CVVC; top right: CV=VC; bottom left: CV0. Then, in order to make the simulation somewhat dependent on the goodness of the match between the correctly ordered CCs, the learning rate (r in equation 4) was set to be proportional to the inverse of the relative phase mismatch between output and adult: r = min (a / | ψCCADULT – ψCCOUT |, 3) (5) where a = 20. 3.2.2. Results Results of the CCV and the VCC simulations are shown in Fig. 8. The density of the CC mode grows much more quickly in the VCC simulation than in the CCV simulation. We assume that until stable sequential coupling of CC is acquired, phasing to the vowel will result in multiple Cs being produced synchronously.
Figure 8. Density of CC anti-phase mode, as a function of iteration number for CCV and VCC simulations.
Self-organization of syllable structure
321
Therefore, they will not be readily perceivable in the child’s output, and the child would be described as not producing clusters of the relevant type. Thus, the model predicts that we should perceive children as producing VCC structures before CCV structures, because CC coupling stabilizes earlier in VCC structures.
4. Discussion In summary, the results presented here show that it is possible to model the course of acquisition of CV vs. VC and CCV vs. VCC structures as emerging from a self-organized process if we make three basic hypotheses that form the boundary conditions for the process: 1) syllable structure can be modeled in terms of modes of coupling in an ensemble of gestural planning oscillators; 2) infants come to the learning process with very generic constraints that predispose them toward producing in-phase and anti-phase coordinations between pairs of gestures; and 3) infants attune their action patterns to those they perceive in the ambient language environment. Our results are striking because the seemingly contradictory acquisition trends in the emergence of onsets and codas with single Cs vs. C clusters follow from the same principle in this model—the relatively greater strength of inphase than anti-phase coupling. There are, of course, many limitations to the type of modeling presented here. One major limitation is that we do not provide an explicit account of how the child agent is able to extract relevant phase information from the articulatory and acoustic patterns that result from adults’ phasing patterns. Behavioral evidence across several domains shows that sensory information must make contact in some common form with motor plans (evidence for ‘common currency’ in speech gestures—Goldstein & Fowler, 2003; more generally a ‘common coding’ principle in action systems, (Prinz, 1997; Galantucci, et al, 2006), and the discovery of mirror neurons (Rizzolatti et al, 1988) has made this notion seem more biologically tractable. However, it would certainly strengthen the kind of simulations presented here if we could show how that is accomplished in the case of a relatively abstract property like phase3. The cluster simulation has a more specific limitation. For reasons discussed in section 3.2, we assumed in both CCV and VCC learning simulations that both Cs are identically coupled to the V, in-phase in CCV, and anti-phase in VCC. This was appropriate to do, we argued, as we wanted to
322
Hosung Nam, Louis Goldstein and Elliot Saltzman
assume complete parallelism between onsets and codas. We also wanted to show that the observed differences could emerge from only differential coupling strength of in-phase vs. anti-phase modes. However, in our model, the child acquires the coupling associated with a language in which coda Cs are not moraic, and we are left with the problem of how a child might acquire the pattern exhibited by English and other languages in which coda Cs are moraic. One possible answer is that in this case, adult CC phasing in VCCs would presumably be 180o (as opposed to the 120o employed here), so perhaps the infant would never, in this case, attempt to use the VC coordination to produce the final coda C. However, the implications of this would have to be tested in further simulations. We should also consider how the model would handle cases (like Dutch) where some children are reported to acquire onset clusters before coda clusters. Note that the effect obtained in our model crucially depended on the fact that the mode associated with onsets is stronger than the one associated with codas. However, if the mode associated with codas were stronger at the time clusters begin to be acquired, then the model output would be reversed. A strong mode associated with codas could indeed develop in Dutch, with the possibility of relatively frequent coda clusters in child-directed speech (Levelt et al, 2000). There are also predictions made by the model that could, in theory, be tested, although the relevant data are not yet available. One prediction is that CV structures should appear earlier than VC even in a language like Arrernte, in which the CV structures would presumably be phonologically ill-formed, as Arrernte has been argued to have no onsets. Such developmental data are not presently available. A second is that careful analysis of children’s early productions of intended CCVs that are perceived by adult transcribers as CV should reveal cases in which both Cs are being produced, but synchronously. Testing this would require articulatory data from children that is also not presently available. Conversely, there are patterns of acquisition that have been reported (e.g., Rose, this volume) that our model has not yet been tested against. For example, CCV is reported as acquired earlier than CCVC. Testing such patterns would require developing a more complete (and much slower) model in which the syllable position modes and the cluster sequentiality mode are all allowed to evolve together. Finally, we should consider alternative accounts for distributional and developmental regularities of syllable structure. Ohala (1996) attributes the CV preference in languages to the perceptual robustness of initial versus
Self-organization of syllable structure
323
final Cs, because particularly in the case of stops, acoustic information that affords perceptual recovery is more salient in CV (e.g., the intensity of release bursts). While it is plausible that such differences form part of the explanation for the CV preference, it is not clear how this explanation could be extended to the developmental lag of CCV compared to VCC. For example, a stop-liquid onset cluster would still retain these desirable burst properties, yet such structures can in many languages be acquired later than liquid-stop coda clusters that lack these burst cues. The other major alternative model of the development of syllable structure is the frame-content model (e.g., MacNeilage, 1998; MacNeilage & Davis, 2000). The model hypothesizes that a syllable structure ‘frame,’ based on mandibular oscillation, develops before the ‘content’ provided by individual C or V gestures. While the model has some plausibility, arguments have been raised that MacNeilage and Davis’ evidence for jaw-only oscillations in children’s babbling (a preponderance of certain CV combinations) cannot by itself be used as evidence for the jaw-only strategy (Goldstein et al, 2006; Giulivi et al, 2006). Regardless of how that issue is resolved, it is not clear how the frame-content model would account for the pattern of results predicted by the coupled oscillator model: the earlier acquisition of CV compared to VC, but the earlier acquisition of VCC compared to CCV.
Notes * We gratefully acknowledge support of the following NIH grants: DC-00403, DC-03663, and DC-03782. 1. Since a cycle of tuning is done by attracting neural units to an experienced stimulus, the density of the units can grow differently at the ends of the range covered by the units. The following two-step procedure was performed in order to prevent mode growths from emerging at the boundaries, i.e., to eliminate boundary effects in the space of neural units: First, 179 units were added below in-phase (0°) and 180 units were added above anti-phase (180°), where 0° and 180° are the predicted modes in this simulation. Second, prior to their use in the tuning process, phases were wrapped between –90° and 270°, i.e., phases less than –90o or greater than 270o were re-expressed as equivalent values within the interval [–90o, 270o] which is medially positioned within the range of the units.
324
Hosung Nam, Louis Goldstein and Elliot Saltzman
2. We assumed that, at this stage of the simulations, the intrinsic HKB function is so weak relative to the intended ψSEL potentials that it could be ignored. When we experimented by adding a relatively strong intrinsic potential function to each of the ψSEL intended potential functions, the result was to create multiple attractors due to the competitive structure of the onset and coda graphs used in the simulations. As a result of this multistability, the simulations could produce relative phase patterns that were linguistically inappropriate. It is interesting to speculate, however, that some of these inappropriate patterns may underlie gestural misorderings observed developmentally during the acquisition of consonant sequences. . 3 It is encouraging to note that some progress has been made along these lines in extracting syllabic phase (a continuously varying, normalized measure of temporal position within syllables) from speech acoustics (Hartley, 2002).
References Bender, B. 1999 Marshallese grammar (Chapter 1, 2). Ms., University of Hawaii. Breen, G. & R. Pensalfini 1999 Arrernte: A language with no syllable onsets. Linguistic Inquiry, 30: 1-26. Browman, C. P. & L. Goldstein 1988 Some notes on syllable structure in articulatory phonology. Phonetica 45: 140-155. 1992 Articulatory phonology: An overview. Phonetica 49: 155-180. 1995 Gestural syllable position effects in American English. In F. BellBerti & L. Raphael (Eds.) Producing speech: Contemporary issues. 19-33. NY: American Institute of Physics. 2000 Competing constraints on intergestural coordination and selforganization of phonological structures, Les Cahiers de l'ICP, Bulletin de la Communication Parlée, vol. 5, pp.25-34. Byrd, D. 1995 C-Centers revisited. Phonetica, 52:263-282. Chervela, N. 1981 Medial consonant cluster acquisition by Telugu children. Journal of Child Language, 8, 63-73. Chomsky, N. 1965 Aspects of the Theory of Syntax. Cambridge: The MIT Press. Clements, G. N. 1990 The role of the sonority cycle in core syllabification. In John Kingston & Mary Beckman, (Eds.), Papers in Laboratory Phonology I, 283-333. Cambridge: Cambridge University Press.
Self-organization of syllable structure
325
Clements, G. N. & S. J. Keyser 1983 CV phonology. Cambridge, MA: MIT Press. de Boer, B. 2001 The Origins of Vowel Systems. Oxford: Oxford University Press. de Jong, K. 2001a Effects of syllable affiliation and consonant voicing on temporal adjustment in a repetitive speech production task. Journal of Speech, Language, and Hearing Research, 44, 826-840. 2001b Rate-induced resyllabification revisited. Language and Speech, 44, 197-216. Demuth, K. & Fee, E.J. 1995 Minimal prosodic words in manuscript, Brown University and Dalhousie University. Dodd, B. 1995 Children's acquisition of phonology. In B. Dodd (Ed.), Differential diagnosis and treatment of speech disordered children, 21-48. London: Whurr. Fikkert, P. 1994 On the acquisition of prosodic structure. Doctoral dissertation, University of Leiden, The Netherlands. Galantucci, B., C. A. Fowler & M. T. Turvey 2006 Psychonomic Bulletin & Review 2006, 13 (3), 361-377. Giulivi, S., D. H. Whalen, L. M. Goldstein, & A. G. Levitt 2006 Consonant-vowel place linkages in the babbling of 6-, 9- and 12month-old learners of French, English, and Mandarin. Journal of the Acoustic Society of America 119, 3421. Goldstein, L. 2003 Emergence of discrete gestures. Proceedings of the 15th International Congress of Phonetic Sciences. Barcelona, Spain.August 3-9, 2003. Universitat Auto`noma de Barcelona Goldstein, L. & C. A. Fowler 2003 Articulatory phonology: A phonology for public language use. In Schiller, N.O. & Meyer, A.S. (eds.), Phonetics and Phonology in Language Comprehension and Production, pp. 159-207. Mouton de Gruyter. Goldstein, L., D. Byrd, & E. Saltzman 2006 The role of vocal tract gestural action units in understanding the evolution of phonology. From action to language: The mirror neuron system: Michael Arbib (eds.), 215-249 Cambridge: Cambridge University Press. Goldstein, L., M. Pouplier, L. Chen, E. Saltzman, & D. Byrd in press. Dynamic action units slip in speech production errors. Cognition
326
Hosung Nam, Louis Goldstein and Elliot Saltzman
Gordon, M. 2004. Syllable weight. In Bruce Hayes, Robert Kirchner, and Donca Steriade (eds.), Phonetic Bases for Phonological Markedness, pp. 277-312. Cambridge: Cambridge University Press. Gnanadesikan, A. 1996 Markedness and faithfulness constraints in child phonology. Ms., University of Massachusetts, Amherst. Haken, .H., J. Kelso, & H. Bunz 1985 A theoretical model of phase transitions in human hand movements. Biological Cybernetics 51: 347-356. Haken, H, C.E. Peper, P.J. Beek, & A. Daffertshofer 1996 A model for phase transitions in human hand movements during multifrequency tapping. Physica D 90(1–2):179–196 Hartley, T. 2002 Syllabic phase: A bottom-up representation of the temporal structure of speech. In J. A. Bullinaria, & W. Lowe, (Eds). Proceedings of the 7th Neural Computation and Psychology Workshop: Connectionist Models of Cognition and Perception. New York: World Scientific Press, Pp. 277-288. Honorof, D.N. & C. P. Browman 1995 The center or edge: how are consonant clusters organised with respect to the vowel? Proceedings of the XIIIth International Congress of Phonetic Sciences (3), K. Elenius and P. Branderud (eds.), 552-555. Stockholm, Sweden: Congress Organisers at KTH and Stockholm University. Kirk, C. & K. Demuth 2003 Onset/coda asymmetries in the acquisition of clusters. In Proceedings of the 27th Annual Boston University Conference on Language Development, Barbara Beachley, Amanda Brown, and Frances Conlin (eds.), 437-448. Somerville, MA: Cascadilla Press. Kozhevnikov, V. A. & L. A. Chistovich 1965 Speech: Articulation and Perception. English translation: U. S. Dept. of Commerce, Clearing House for Federal Scientific and Technical Information. Levelt, C., N. Schiller, & W. Levelt 2000 The acquisition of syllable types. Language Acquisition, 8, 237-264. Lleó, C. & M. Prinz 1995 Consonant clusters in child phonology and the directionality of syllable structure assignment. Journal of Child Language, 23, 31-56. Löfqvist, A. & V. L. Gracco 1999 Interarticulator programming in VCV sequences: lip and tongue movements. Journal of the Acoustic Society of America 105, 1854-1876.
Self-organization of syllable structure
327
Macken, M. A. 1977 Developmental reorganization of phonology: A hierarchy of basic units of acquisition. Papers and Reports in Child Language Development, 4, 1-36. MacNeilage, P. F. 1998 The frame/content theory of evolution of speech production”, Behavioral and Brain Science., 21, 499–511. MacNeilage, P.F. & B.L. Davis 1998 Evolution of speech: The relation between phylogeny and ontogeny. Paper presented at the Second International Conference on the Evolution of Language, London. 2000 On the origin of internal structure of word forms. Science, 288, 527-531. McLeod, S., J. van Doorn, & V. A. Reed 2001 Normal Acquisition of Consonant Clusters. American Journal of Speech-Language Pathology, Vol. 10 Issue 2, p99, 12p, 3 charts Nam, H. in press A competitive, coupled oscillator model of moraic structure: Splitgesture dynamics focusing on positional asymmetry. Laboratory phonology 9 Nam, H., L. Goldstein, & E. Saltzman submitted a Intergestural timing in speech production: the role of graph structure. submitted b A dynamical model of gestural coordination. Nam, H. & E. Saltzman 2003 A competitive, coupled oscillator of syllable structure. Proceedings of the XIIth International Congress of Phonetic Sciences (3): 2253-2256. Ohala, J. J. 1996 Speech perception is hearing sound, not tongues. Journal of the Acoustic Society of America 99, 1718–1725. Oudeyer, P-Y. 2002 The origins of syllable systems: an operational model, in Proceedings of the 23rd Annual Conference of the Cognitive Science Society, COGSCI’2001, J. Moore and K. Stenning, Eds, London: Laurence Erlbaum Associates, 2001, pp. 744–749. 2005 The self-organization of speech sounds, Journal of Theoretical Biology., 233, 435–449. 2006 Self-Organization in the Evolution of Speech. Studies in the Evolution of Language. Oxford University Press. Paul, R. & Jennings, P. 1992 Phonological behaviour in toddlers with slow expressive language development. Journal of Speech and Hearing Research, 35, 99-107. Pouplier, M. in press Tongue kinematics during utterances elicited with the SLIP technique. Language and Speech.
328
Hosung Nam, Louis Goldstein and Elliot Saltzman
Pouplier, M. & L. Goldstein 2005 Asymmetries in the perception of speech production errors. Journal of Phonetics 33, 47-75. Prinz, W 1997 Perception and action planning. Eur. J. Cognit. Psychol. 9: 129–154. Rizzolatti, G., R. Camarda, L. Fogassi, M. Gentilucci, G. Luppino, & M. Matelli 1988 Functional organization of inferior area 6 in the macaque monkey: II. Area F5 and the control of distal movements. Experimental Brain Research, 71, 491-507.Rizzolatti et al Roark, B. & K. Demuth 2000 Prosodic constraints and the learner’s environment: A corpus study. In Proceedings of the 24th Annual Boston University Conference on Language Development, S. Catherine Howell, Sarah A. Fish, and Thea Keith-Lucas (eds.), 597-608. Somerville, MA: Cascadilla Press. Salidis, J. & J.S. Johnson 1997 The production of minimal words: a longitudinal case study of phonological development. Language Acquisition 6, 1–36. Saltzman, E. & D. Byrd 2000 Task-dynamics of gestural timing: Phase windows and multifrequency rhythms, Human Movement Science, vol. 19, pp.499-526. Saltzman, E., A. Löfqvist, B. Kay, J. Kinsella-Shaw & P. Rubin 1998 Dynamics of intergestural timing: a perturbation study of lip-larynx coordination. Experimental Brain Research, 123 (4): 412-424. Saltzman, E. & K. Munhall 1989 A dynamical approach to gestural patterning in speech production. Ecological Psychology 1: 333-382. Stetson, R. H. 1951 Motor Phonetics. Amsterdam: North-Holland Templin, M. 1957 Certain language skills in children (Monograph Series No. 26). Minneapolis, MN: University of Minnesota, The Institute of Child Welfare. Tuller, B. & J.A.S. Kelso 1991 The Production and Perception of Syllable Structure. Journal of Speech and Hearing Research, 34: 501-508. Turvey, M. 1990 Coordination, American Psychologist, vol. 45, 938-953. Vihman, M.M. & Ferguson, C.A. 1987 The acquisition of final consonants. In: Viks, U. (Ed.), Proceedings of the Eleventh International Congress of Phonetic Sciences. Tallinn, Estonia, USSR, pp. 381-384. Watson, M. M. & G. P. Scukanec 1997 Profiling the phonological abilities of 2-year-olds: A longitudinal investigation. Child Language Teaching and Therapy, 13, 3-14.
Internal and external influences on child language productions Yvan Rose 1. Introduction* Over the past three decades, statistical approaches have been successfully used to explain how young language learners discriminate the sounds of their mother tongue(s), perceive and acquire linguistic categories (e.g. phonemes), and eventually develop their mental lexicon. In brief, input statistics, i.e. the relative frequency of the linguistic units that children are exposed to (e.g. phones, syllable types), appear to provide excellent predictors in the areas of infant speech perception and processing. This research offers useful insight into both the nature of the linguistic input that infants attend to and how they sort out the evidence from that input (see Gerken 2002 for a recent overview). Building on this success, a number of linguists have recently proposed statistical explanations for patterns of phonological productions that were traditionally accounted for through typological universals, representational complexity, grammatical constraints and constraint rankings, or lower-level perceptual and articulatory factors. For example, Levelt, Schiller and Levelt (1999/2000) have proposed, based on longitudinal data on the acquisition of Dutch, that the order of acquisition of syllable types (e.g. CV, CVC, CCV) can be predicted through the relative frequency of occurrence of these syllable types in the ambient language. Following a similar approach, Demuth and Johnson (2003) have proposed that a pattern of syllable truncation resulting in CV forms attested in a learner of French was triggered by the high frequency of the CV syllable type in this language. However, important questions need to be addressed before one can conclude that statistical approaches, or any mono-dimensional approach based on a single source of explanation, truly offer strong predictions for developmental production patterns. For example, one must wonder whether input statistics, which are mediated through the perceptual system and computed at the cognitive level, can have such an impact on production, given that production, itself influenced by the nature of phonological representations,
330
Yvan Rose
involves a relatively independent set of cognitive and physiological mechanisms, some of which presumably independent from statistical processing. In this paper, I first argue that while statistics of the input may play a role in explaining some phenomena, they do not make particularly strong predictions in general, and, furthermore, simply cannot account for many of the patterns observed in early phonological productions. Using this as a stepping-stone, I then argue that the study of phonological development, similar to that of any complex system, requires a multi-dimensional approach that takes into consideration a relatively large number of factors. Such factors include perception-related representational issues, physiological and motoric aspects of speech articulation, influences coming from phonological or statistical properties of the target language and, finally, the child’s grammar itself, which is constantly evolving throughout the acquisition period and, presumably, reacting or adapting itself to some of the limitations that are inherent to the child’s immature speech production system. I conclude from this that any analysis based on a unique dimension, be it statistical, perceptual or articulatory, among many others, restricts our ability to explain the emergence of phonological patterning in child language. To illustrate this argument, I discuss a number of patterns that are well attested in the acquisition literature. I argue that explanations of these patterns require a consideration of various factors, some grammatical, some external to the grammar itself. The paper is organized as follows. In section 2, I discuss the predictions made by statistical approaches to phonological development, using the results from Levelt et al. (1999/2000) for exemplification purposes. I then confront these predictions with those made by more traditional approaches based on structural complexity and language typology, in section 3. I introduce the approach favoured in this paper in section 4. In section 5, I discuss a series of examples that provide support for the view that the acquisition of phonology involves a complex system whose sub-components may interact in intricate ways. I conclude with a brief discussion in section 6.
2. Statistical approaches to phonological productions: an example Statistical approaches, when used to account for production patterns, make three main predictions, listed in (1). All other things being equal, they predict that the most frequent units found in the ambient language should appear first in the child’s speech. As opposed to this, the least frequent units
Internal and external influences on child language productions
331
should appear last. Finally, units of equivalent frequency are predicted to emerge during the same acquisition period (itself determined through relative frequency) but to display variation in their relative order of appearance. (1) Statistical approaches to phonological development: predictions a. Frequent units: acquired early b. Infrequent units: acquired late c. Units with similar frequencies: variable orders of acquisition A clear illustration of the predictions made by the statistical approach comes from Levelt et al. (1999/2000), who conducted a study of the acquisition of syllable types by twelve monolingual Dutch-learning children. Their main observations are schematized in (2). As we can see, all learners’ first utterances were restricted to the four types of syllables that are the least complex (CV, CVC, V, VC). Following this, the learners took one of two different paths, defining groups A (nine children) and B (three children). During this second phase, the groups either acquired pre-vocalic clusters before post-vocalic ones (CCV > VCC) or vice versa (VCC > CCV). Finally, all learners acquired the more complex CCVCC syllable towards the end of the acquisition period. (2) Acquisition of syllable types in Dutch (Levelt et al. 1999/2000) Group A: CVCC > VCC > CCV > CCVC
↗
↘
↘
↗
CV > CVC > V > VC
Group B: CCV > CCVC > CVCC > VCC
CCVCC
We can see in (3) that the four syllable types acquired early (in (2)) are also the most frequently occurring ones in Dutch. The following four types, which distinguish the two groups of learners in (2), display relatively similar frequencies of occurrence in the language. Finally, the last syllable type acquired by all children (CCVCC) is also the one that occurs with the lowest frequency in the language. (3) Frequency of syllable types in Dutch (Levelt et al. 1999/2000) CV > CVC > VC > V > {CVCC ≈ CCVC ≈ CCV ≈ VCC} > CCVCC
332
Yvan Rose
The correlation between the relative frequency of syllable types in Dutch and their order of acquisition thus seems to provide support for Levelt et al.’s suggestion that the emergence of production patterns in child language can be predicted through input statistics. For example, both orders of appearance and the variability that we observe between groups A and B seem to correspond to the statistical facts observed. In the next section, however, I introduce an alternative perspective on these same data.
3. Statistical frequency or representational complexity? In light of the above illustration, one could be tempted to extend the statistical approach to a larger set of phenomena observed in child language. For example, we could hypothesise that the development of syllable structure in a given language is essentially governed by input statistics. However, important issues remain to be addressed before we can jump to such a conclusion and favour the statistical approach over more traditional ones. Such approaches have indeed been successful at accounting for various phenomena in child language, for example the acquisition of multi-syllabic word shapes (and related truncation patterns), or that of syllable structure (e.g. Ferguson and Farwell 1975, Fikkert 1994, Demuth 1995, Freitas 1997, Pater 1997, Rose 2000). As was noted in the preceding section, the rate of acquisition of a given structure may be correlated with its frequency of occurrence in the target language. In contrast to this, an approach based on representational complexity predicts that the phonologically simplest units (e.g. singleton onsets) should be acquired before more complex units (e.g. complex onsets). However, in the case at hand (as well as, presumably, in most of the literature on the development of syllable structure), both the frequency-based and the complexity-based approaches make essentially identical predictions, because of the fact that, as far as syllable types are concerned, the most frequent also tend to be the simplest ones. This is certainly the case in Dutch where we can see that the four syllable types that were acquired first by all children in (2) are the ones that are the most frequent in (3) and also those that arguably show no complexity in their internal constituents.1 From this perspective, we are at best witnessing a tie between the two approaches under scrutiny. However, a further look at the data that enable a distinction in learning paths between groups A and B in (2) actually raises doubts on the predic-
Internal and external influences on child language productions
333
tive power of the statistical approach. Indeed, if we consider only the acquisition order of the four syllable types that differentiate the two groups of learners, which are deemed to have equivalent frequency values in the target language, the statistical approach predicts a total of 24 possibilities (4! or 4x3x2=24). Yet only two of these 24 potential learning paths are attested in the data, despite the fact that twelve children were included in the study. While one may be tempted to blame the relatively small population investigated for this, it is important to note that the two sequences attested correspond exactly to those that an approach based on phonological complexity would predict. Indeed, as mentioned above, the learners from group A acquired post-vocalic consonant clusters before complex onsets ([CVCC > VCC] >> [CCV > CCVC]), while the learners from group B followed the opposite path and acquired complex onsets before post-vocalic clusters ([CCV > CCVC] >> [CVCC > VCC]). However, none of the potential paths intertwining pre-vocalic clusters with post-vocalic ones is attested. (4) Unattested patterns a. *CCV > CVCC > CCVC > VCC b. *VCC > CCV > CVCC > CCVC c. *… Under the assumption that the representations of only two units have in fact been acquired (those for pre-vocalic versus post-vocalic clusters), but at different times, these data would suggest that a complexity-based approach enables both an accurate description of the data and an explanation for the non-attested acquisition paths. In contrast, the statistical approach overgenerates; it predicts many more learning paths than the ones attested. As rightly pointed out by an anonymous reviewer, if only two units (representations for pre- and post-vocalic clusters) need to be acquired by the children, then the syllable types containing a single ‘new’ unit (e.g. CVCC and VCC, both of which show a post-vocalic cluster), should be acquired during the same developmental stage (see also Fikkert 1994 and Pan and Snyder 2003 for related discussions). While the data description provided by Levelt et al. (1999/2000) does not enable a complete verification of this prediction, it certainly points in its direction. Three data points are discussed by Levelt et al., namely after the first, third, and sixth recording sessions. I address each of these data points in the following paragraphs.
334
Yvan Rose
After the first recording session, while most (eight of the twelve) children systematically failed to produce pre- or post-vocalic clusters, child David had CVCC but not VCC, Catootje had CVCC, VCC, CCV but not CCVC, Enzo had CCV, CCVC, CVCC but not VCC, while Leon had the four syllable types with complex constituents, and only lacked CCVCC (the type also missing from all of the other children’s productions). While these results are relatively mixed, the productions (or absence thereof) from the first eight children fully support the current hypothesis, since they display no unsystematic gaps. Also, given that the data were naturalistically recorded, the few apparently unsystematic gaps in the other four children’s productions (e.g. the fact that both David and Enzo displayed CVCC but lacked VCC) may have occurred simply because the children did not attempt a particular syllable type. It is indeed likely that the sample available in the corpus underestimates the children’s true phonological abilities, since the non-occurrence of a given syllable type may simply be an artefact of data sampling, especially for the rarely occurring types in the language. This conjecture is in fact supported by Levelt et al. (1999/2000:259), who show that VCC displays the second lowest frequency of all syllable types in Dutch, with a frequency value (1.03), which is only slightly above that of the CCVCC type (0.97). As opposed to this, the CVCC type shows a much higher relative frequency, at 5.51. Given these figures, we can hypothesize that both VCC and CCVCC syllable types were very seldom attempted by Dutch-learning children. This empirical issue suggests that an approach considering attempted syllables, in addition to the attested ones, should have been favoured (see Pan and Snyder 2003 for further discussion). At the second data point, six children still had no complex constituents. One child, Tirza, had post-vocalic but no pre-vocalic clusters. Three children (David, Catootje and Leon) had both pre- and post-vocalic clusters but no CCVCC syllables, while child Eva had CVCC but not VCC. Finally, Enzo lacked VCC syllables but yet displayed CVCC and CCVCC. Similar to the first data point, the apparently unsystematic gaps again come from the rarely occurring (and presumably rarely attempted) VCC and CCVCC syllable types. Aside from this issue, the patterns from this second data point reveal generally systematic behaviours, if taken from a representational complexity perspective. This latter observation is further reinforced by the third sample, where nine children (those from group A in (2)) show either post-vocalic or both pre- and post-vocalic clusters. Also, the rarely occurring CCVCC syllable type is only attested in the productions of children who independently dis-
Internal and external influences on child language productions
335
played both clusters. Finally, of the three children from group B, two display pre-vocalic but no post-vocalic clusters, while the last one has the CCV but not the CCVC syllable type. This gap is the only one left unexplained by the complexity approach, but again without a means to verify whether that syllable type was even attempted by the child. We can see from the above discussion that the vast majority of the observations lend support to an approach based on representational complexity, especially if one considers the possibility that the absence of a given cluster may be attributed to the fact it was not attempted. Put in the larger context of linguistic universals, the representational approach advocated here also finds independent motivation in factorial typology. As reported by Blevins (1995), word-initial and word-final consonant clusters pattern in independent ways across languages. We can see in (5) that genetically unrelated languages such as Finnish (Finno-Ugric) and Klamath (Plateau Penutian) allow for post-vocalic but not pre-vocalic consonant clusters. As opposed to these, languages such as Mazateco (Oto-Manguean) and Sedang (North Bahnaric) allow for pre-vocalic clusters but ban post-vocalic ones. (5) CC clusters across languages (Blevins 1995) a. Finnois, Klamath: CVCC but not *CCV b. Mazateco, Sedang: CCV but not *CVCC An analysis of the distribution of these clusters requires a formal distinction between the two cluster types (pre- and post-vocalic), such that complexity can be allowed in one independently of the other. Under the view that children’s grammars are not fundamentally different from that of adults (e.g. Pinker 1984, Goad 2000, Inkelas and Rose 2008), children can acquire these clusters in various orders. Also, as predicted by an approach based on phonological complexity (as opposed to frequency), discontinuous learning paths such as the unattested ones in (4) should generally not occur.2 Finally, when we consider the issue of the predictive power of the statistical approach from a larger perspective, other questions arise as well. Child phonological patterns often have no direct correlates with the target languages being acquired (e.g. Bernhardt and Stemberger 1998). These emergent patterns include, among many others, consonant harmony (e.g. gâteau ‘cake’ [gato] → [tato]; Smith 1973, Goad 1997, Pater 1997, Rose 2000, dos Santos 2007), velar fronting (e.g. go → [do]; Chiat 1983, Stoel-Gammon 1996, Inkelas and Rose 2008), segmental substitutions (e.g. vinger ‘finger’
336
Yvan Rose
["vINər] → ["sINə]; Levelt 1994, Dunphy 2006), consonant cluster reductions (e.g. brosse ‘brush’ → [bOs]; Fikkert 1994, Freitas 1997, Rose 2000), syllable truncations (e.g. banana → [bana]; Ferguson and Farwell 1975, Fikkert 1994, Pater 1997) and syllable reduplication (e.g. encore ‘again’ → [kOkO]; Rose 2000). Because of their emerging nature, these processes cannot be predicted from the kind of statistical tendencies that would enable one to distinguish either languages or language learners from one another. While a certain relationship obviously exists between the manifestation of these processes and the sound patterns that compose the target language, this relationship typically relates to phonological or lower-level articulatory aspects of child language development, not statistics. Furthermore, the occurrence of a given process seems to be randomly distributed among the population of learners (e.g. Smit 1993). Despite some implicational relationships which have been argued for in the acquisition literature (e.g. Gierut and O’Connor 2002), no one can predict, given any population of learners, which children will or will not display a given process. Therefore, no direct relationship seemingly exists between emergent processes and the statistical properties of target languages. Note however that this claim does not rule out the possibility that specific statistics of the target language affect the actual manifestation of a process. As I will discuss further below, it is logical to think that a child may select a given segment or articulator as default because of its high frequency in the language. When taken together, the observations above suggest that while statistics of the input should not be dismissed entirely, they should only be taken as one of several factors influencing phonological productions. In the next section, I discuss a number of additional factors, all of which should also be considered.
4. A more encompassing proposal In order to provide satisfactory explanations for the patterns observed in child language, I argue that one needs to consider the two general types of factors listed in (6), which may either manifest themselves independently or interact with one another in more or less complex ways in child phonological productions.
Internal and external influences on child language productions
337
(6) Factors influencing child language phonological productions a. Grammatical (internal) b. Non-grammatical (external) Approaching child language through (6a) is by no means a novel idea. It has been pervasive in the acquisition literature since the 1970s (see Bernhardt and Stemberger 1998 for a comprehensive survey) and in works on learnability (e.g. Dresher and van der Hulst 1995). However, in contrast to most grammatical analyses proposed in the literature, I propose to bring the study of early productions into a broader perspective, one that extends beyond grammatical considerations and incorporates factors that relate to perception, physiology, articulation as well as statistics, to name a few (see also Inkelas and Rose 2008, and Fikkert and Levelt, in press). In the next section, I discuss a series of phonological patterns observed in child language, some of which have been discussed extensively in the literature, often because of the analytical challenges they offer. I argue that each of these patterns lends support to the multi-dimensional approach advocated in this paper.
5. The multiple sources of phonological patterning in child language I begin the discussion with the process of positional velar fronting, in 5.1, which highlights the interaction between grammatical and articulatory factors. In section 5.2, I discuss in turn a number of patterns that have been described as opaque chain shifts in the literature. I argue that these patterns are in fact opaque in appearance only. I propose that they are entirely predictable within a transparent grammatical system once we take into account the possible impacts of non-grammatical factors. Following a similar reasoning, I discuss, in section 5.3, a potential interaction between articulatory and statistically-induced pressures on the emergence of consonant harmony in a Dutch learner’s productions. Finally, in 5.4, I briefly highlight observations that would be difficult to explain through lower-level (e.g. articulatory or perceptual) influences. All of these observations point to strong grammatical influences on child language development. Because of space limitations, more comprehensive accounts than the ones sketched below would ideally be required as well as a consideration of issues such as variation, both within and across language learners. My aim
338
Yvan Rose
is thus limited here to suggesting what I consider to be a sensible approach to the data, leaving the fine details of analysis for future work. 5.1. Grammatically-induced systematic mispronunciations Velar fronting consists of the pronunciation of target velar consonants as coronal (e.g. ‘go’ → [do]). What is peculiar about this process is that when it does not apply to all target velars, it affects velars in prosodically strong positions (e.g. in word-initial position or in word-medial onsets of stressed syllables; see (7a)) without affecting velars in weak positions (e.g. medial onsets of unstressed syllables, codas; see (7b)) (e.g. Chiat 1983, StoelGammon 1996, Inkelas and Rose 2008). (7) Positional velar fronting (data from Inkelas and Rose 2008) a. Prosodically strong onsets ‘cup’ 1;09.23 [ˈtʰʌp] [ˈdoː]
‘go’
1;10.01
[əˈdɪn]
‘again’
1;10.25
‘hexagon’ 2;02.22 [ˈhɛksəˌdɔn] b. Prosodically weak onsets; codas ‘monkey’ 1;08.10 [ˈmɑŋki] [ˈbejgu]
‘bagel’
1;09.23
[bʊkʰ]
‘book’
1;07.22
[ˈpædjɔk]
‘padlock’
2;04.09
As discussed by Inkelas and Rose (2008), positional velar fronting is, on the face of it, theoretically unexpected, because positional neutralization in phonology generally occurs in prosodically weak, rather than strong, positions. Taking this issue as their starting point, Inkelas and Rose offer an explanation that incorporates both an articulatory and a grammatical component. The articulatory component of their explanation relates to the fact that young children are equipped with a vocal tract that is different in many respects from that of an adult, as illustrated in (8).
Internal and external influences on child language productions
339
(8) Child vocal tract
Source: http://www.ling.upenn.edu/courses/Fall_2003/ling001/infant.gif
Inkelas and Rose emphasize the facts that (a) the hard palate of children is proportionally shorter than that of adults and (b) the tongue is proportionally larger and its mass is located in a more frontal area of the vocal tract. Adult vocal tract shapes and proportions are attained between six and ten years of age (e.g. Kent and Miolo 1995, Ménard 2002). In addition, young children do not possess the motor control abilities that adult speakers generally take for granted (e.g. Studdert-Kennedy and Goodell 1993). These differences in vocal tract shape and control, Inkelas and Rose argue, are not without consequences for the analysis of early phonological productions. Certain sounds and sound combinations are inherently more difficult to produce for children than for adults. This is particularly evident in the acquisition of phonological contrasts that involve lingual articulations. For example, in languages like English in which we find a contrast between /s/ and /T/, (e.g. sick /sIk/ ~ thick /TIk/), this contrast is often acquired late (e.g. Smit 1993, Bernhardt and Stemberger 1998). In addition, it is often the case that young children across languages show frontal lisp-like effects (e.g. /s/ → [T]). The relative size and frontness of the tongue body, compounded by an imperfect control of motor abilities may both be at least partly responsible for the emergence of this phenomenon. Coming back to positional velar fronting, Inkelas and Rose further argue that the positional nature of this phenomenon is not simply the result of articulatory pressures; it also has a significant grammatical component. It is well known that speech articulations are more emphasized in prosodically strong positions such as word-initial or stressed syllables (e.g. Fougeron and Keating 1996). It is also well known that children’s developing grammars are particularly sensitive to the prosodic properties of their target lan-
340
Yvan Rose
guage (e.g. Gerken 2002). It follows from this that children should be faithful to the strengthening of speech articulations in prosodically strong positions that they identify in the adult language. Building on these observations, Inkelas and Rose propose that the children who display positional velar fronting are in fact attempting to produce these stronger articulations in prosodically strong contexts. However, because of the articulatory factors listed above, the strengthening of their target velars results in an articulation that extends too far forward, into the coronal area of the hard palate, yielding the fronted velars on the surface. Inkelas and Rose’s argument for the grammatical conditioning of positional velar fronting is further supported through another process observed in the same learner, that of positional lateral gliding, which takes place following the same strong/weak dichotomy of contexts as velar fronting even though the articulatory underpinnings of gliding are completely independent from those of fronting. In both cases, the child is cosmetically unfaithful to target segments but yet abides by strong requirements of the target grammar. This explanation has the advantage over previous analyses of reconciling the positional velar fronting facts with phonological theory, especially given that articulatory strengthening should occur in prosodically strong, not weak, positions.3 In the context of the current argument, it also provides a clear case where nongrammatical, articulatory factors can interact with developing grammatical systems to yield the emergence of systematic patterning in child language. In the next section, I address other patterns which may look suspicious from a grammatical perspective, as they suggest opacity effects in child grammars. I argue that once they are considered in their larger context, these apparently opaque processes can be explained in transparent ways. 5.2. Apparent chain shifts A number of child phonological patterns that take the shape of so-called chain shifts have been considered cases of grammatical opacity in the literature, thereby posing theoretical and learnability problems (e.g. Smith 1973, Smolensky 1996, Bernhardt and Stemberger 1998, Hale and Reiss 1998, Dinnsen 2008). In line with Hale and Reiss’ (1998) suggestion that (apparent) chain shifts are not a problem for theories that consider both competence and performance, I argue that these patterns can in fact be seen as entirely transparent if one incorporates factors pertaining to speech perception and/or articulation into the analysis.
Internal and external influences on child language productions
341
Consider first the data in (9). As we can see in (9a), the child produces the target consonant /z/ as [d] in words like puzzle. This process of stopping, often observed in child language data (e.g. Bernhardt and Stemberger 1998), may by itself be related to articulatory or motor factors such as the ones listed in the preceding section. However, as we can see in (9b), target /d/ is itself pronounced as [g] in words like puddle. (9) Chain shift (data from Amahl; Smith 1973) a. puzzle /pʌzl/ → [pʌdɫ̩] (/z/ → [d]) b. puddle /pʌdl/ → [pʌgɫ̩] (/d/ → [g]; *[d]) If the child were grammatically able to produce [d] in puzzle, why is it that he could not produce this consonant in puddle. Schematically, if A→B, then why B→C (and not *B→B)? This apparent paradox, previously discussed by Macken (1980), reveals the importance of another non-grammatical factor, that of perception, which may have indirect impacts, through erroneous lexical representations, on the child’s speech productions. As Macken argues, the child, influenced by the velarity of word-final [:], perceived the /d/ preceding it in puddle as a velar consonant (/g/). Because of this faulty perception, he built a lexical representation for puddle with a word-medial /g/. The production in (9b) thus results from a non-grammatical, perceptual artefact which, itself, contributes to the emergence of a paradoxical production pattern. The paradox is only apparent, however; it is not inherent to the grammar itself.4 Another possibility for chain shifts emerges when both perceptual and articulatory factors conspire to yield phenomena that should be unexpected, at least from a strict grammatical perspective. An example of this, also from Smith (1973) is provided in (10) (see also Smolensky 1996 and Hale & Reiss 1998 for further discussion of this case). As we can see, /T/ is realized as [f] (in (10a)), even though it is used as a substitute for target /s/ (in (10b)). (10) Circular chain shift (data from Amahl; Smith 1973) a. /T/ → [f] (thick /TIk/ → [fIk]) b. /s/ → [T] (sick /sIk/ → [TIk]) Again here, why cannot the child realize target /T/ as such if [T] is otherwise possible in output forms (from target /s/)? Consistent with the current approach, I argue that patterns such as the one in (10) should simply not be
342
Yvan Rose
considered for grammatical analysis, because it arises from a conspiracy of independent factors, namely perception, which affects the building of lexical representations, and articulation, which yields surface artefacts in output forms. First, the realization of /T/ as [f] can arise from a perceptual problem caused by the phonetic similarity between these two segments. Indeed, the contrast between these two sounds is often neutralized by both first and second language learners, who tend to realize both consonants as [f] (e.g. Levitt, Jusczyk, Murray and Carden 1987, Brannen 2002). This phenomenon is peculiar because it involves consonants with different places of articulation. However, since /f/ and /T/ are acoustically extremely similar (e.g. Levitt et al. 1987), the merger is not surprising: if the contrast cannot be perceived by the learner, it cannot be represented at the lexical level and, consequently, cannot be reproduced in production. Coming back to the examples in (10), the child thus perceives /T/ as [f] and, consequently, lexically encodes a target word such as thick with a word-initial /f/ (/fIk/). This enables an account of the assimilation observed in (10a). Second, if the same child has not yet mastered the precise articulation required for the production of /s/, which is realized as [T] for reasons such as the ones mentioned in section 5.1, we obtain the second element of the apparent chain shift in (10b). The examples discussed thus far highlight ways in which phonetic considerations may affect the child’s analysis of the ambient language, for example by imposing perceptually driven biases on lexical representations or articulatorily induced artefacts on speech production. Building on this argument, Hale and Reiss (1998) would further suggest, quite controversially, that examples such as this one basically discredit the study of child language phonology from a production perspective. I argue that Hale and Reiss are in fact making a move that is tantamount to throwing the baby out with the bath water. Contra Hale and Reiss, and in line with most of the researchers in the field of language development, I support the claim that the child’s developing grammatical system plays a central role in the production patterns observed, with the implication that productions are worthy of investigation in our quest to unveil the grammatical underpinnings of child language development. This position is further substantiated in the next two sections, where I discuss examples of processes that reveal more abstract aspects of phonological (grammatical) processing.
Internal and external influences on child language productions
343
5.3. Interaction between cognitive and articulatory factors Despite the criticisms formulated against statistical approaches in section 3, I reiterate that the argument of this paper is not about rejecting statistical influences altogether, but rather to incorporate them into the larger picture of what factors can influence grammatical development. This is especially true in cases where a given unit (e.g. sound, syllable type) can be singled out as statistically prominent in the ambient language and thus selected by the learner’s grammar as representing a default value. As discussed in section 3, if this default option correlates with articulatory simplicity, then there is no easy way to firmly conclude which factor (statistical or articulatory) is the determining one. However, if the default option from a statistical perspective does not correlate with articulatory simplicity, then we should be expecting children to display variation between the two alternatives. In this section, I discuss patterns of segmental substitution attested in the productions of Jarmo, a young learner of Dutch. We will see that when confronted with a sound class that he cannot produce, Jarmo opts for various production strategies, which themselves suggest a number of influences on his developing grammar. As Dunphy (2006) reports, Jarmo displays difficulties with the production of labial continuants (e.g. /f, v, ʋ, w/) in onsets. However, instead of producing these consonants as stops, a strategy that would appear to represent the simplest solution, his two most prominent production patterns consist of either substituting labial continuants by coronals or debuccalizing these consonants through the removal of their supralaryngeal articulator. Stopping occurs but is only the third preferred strategy, as evidenced by the breakdown in (11). (11) Realization of labial continuants in onsets (Dunphy 2006) Attempted forms 229 Target-like 44 19% Coronal substitution 98 43% Debuccalization 34 15% Stopping 22 10% Velar substitution 11 5% Other 19 8% The two main strategies, coronal substitution and consonant debuccalization, are exemplified in (12a) and (12b), respectively.
344
Yvan Rose
(12) Examples of substitution strategies for labial continuants a. Coronal substitution b. Debuccalization vis ["vIs] → ["siS] visje [ˈvɪʃə̟ ] → [ˈʔis͡jə] fiets ["fits] → ["tIt] willy [ˈʋɪli] → [ˈhili] vinger ["vINər] → ["sINə] fiets [ˈfits] → [ˈʔiʃ] In the face of these data, we must find out why the child favoured two strategies affecting the major place of articulation of the target consonants. It is also necessary to determine whether there is a formal relationship between coronals and laryngeals in the child’s grammar, given that both of them act as favoured substitutes for target labial continuants. First, the distribution of coronals in Dutch (as well as in many of the world’s languages; see contributions to Paradis and Prunet 1991) provides support for the hypothesis that the child can analyze them as default (statistically unmarked) consonants in the language. Indeed, coronals account for 55% of all onset consonants and 65% of all coda consonants in spoken Dutch (van de Weijer 1999). In addition, from the perspective of syllable structure, coronals are the only consonants that can occupy appendix positions in Dutch (see, e.g. Fikkert 1994 and Booij 1999 for summaries of the research on syllable structure in Dutch). From both statistical and distributional perspectives, coronals can thus appear to the learner as having a special, privileged status. Second, laryngeals are considered to be the simplest consonants from an articulatory perspective by many phonologists and phoneticians (e.g. Clements 1985). Indeed, these consonants do not involve any articulation in the supralaryngeal region of the vocal tract. Both coronals and laryngeals thus offer the child good alternatives, which manifest themselves in output forms. 5.4. Grammatical influences Finally, the argument presented above would not be complete without a discussion of influences on the child’s productions that seem to be inherent to the grammatical system itself. Despite perceptual and articulatory effects such as the ones discussed in the preceding sections, several facts documented in the literature on phonological development strongly suggest the presence of general grammatical principles whose effects can be observed independently in language typology, as already discussed in section 3. For example, while various combinations of perceptual and articulatory factors
Internal and external influences on child language productions
345
should yield fairly extensive variation between learners, even within the same target language, it is generally noted that variation is in fact fairly restricted. Also, several works attribute some of the variability observed between learners to differences between individual rates of acquisition rather than actual discrepancies in grammatical analyses once the target phonological structure is mastered by the learners (e.g. Fikkert 1994, Levelt 1994, Freitas 1997, Goad and Rose 2004). In addition, relationships between various levels of phonological representation, for example, the role of prosodic domains such as the stress foot, the syllable, or syllable sub-constituents in segmental patterning all point towards clear grammatical influences over child language productions (e.g. contributions to Goad and Rose 2003; see also section 5.1 above). Note also that in the vast majority of the cases documented in the literature, the emerging properties of child language are grammatically similar to those of adult languages. There are also strong reasons to believe that apparent counter-examples to this generalization are in fact more cosmetic than reflecting truly unprincipled grammatical patterns (e.g. Inkelas and Rose 2008), in the sense that these counter-examples derive from nongrammatical factors such as those discussed in the above subsections. Indeed, we can generally account for sound patterns in child language using theories elaborated on the basis of adult languages. This in itself implies a strong correspondence between the formal properties of developing grammars and that of end-state (adult) systems. This correspondence in turn reveals a set of grammatical principles that should be considered in analyses of child language productions. In this regard, it is also important to highlight the fact that most of the analyses proposed in the literature on phonological development require a certain degree of abstraction, one that extends beyond perception- or articulation-related issues such as the ones noted in preceding sub-sections. While more observations should be added to this brief survey, we can reasonably conclude that despite the fact that child language is subject to non-grammatical influences, its careful study reveals a great deal of systematic properties. In turn, these properties can be used to formally characterize the stages that the child proceeds through while acquiring his/her target grammar(s).
346
Yvan Rose
6. Discussion In this paper, I have discussed phonological patterns that offer strong empirical arguments against any mono-dimensional approaches to phonological development, be they based solely on statistical, phonetic or grammatical considerations. I argued that an understanding of many developmental patterns of phonological production requires a multi-dimensional approach incorporating, among others, perceptual factors that can affect the elaboration of lexical representations, articulatory factors that can prevent the realization of certain sounds, as well as the phonological properties of the target language itself (e.g. phonological and phonetic inventories, distributions and statistics; prosodic properties). A consideration of these factors offers many advantages, including both the avoidance of unnecessary analytical issues imposed by true grammatical opacity and, crucially, the explanatory power of the more transparent analyses proposed. As in all multi-factorial approaches, one of the main challenges lies in the determination of what factors are involved and of how these factors interact to yield the outcomes observed in the data. For example, one important issue that was left open in this paper concerns the fact that while statistics of the input seem to play a central role in infant speech perception, such statistics appear to be only one of the many factors underlying patterns observed in speech production. The relationship between perception and production thus remains one that warrants further research. In order to tackle this issue, we should favour strong empirical, cross-linguistic investigations within which all of the languages involved would be compared on the basis of their distinctive linguistic properties. By combining the results obtained through such investigations with those from research on speech perception and articulation by children, we should be in a better position to improve our understanding of phonological development, from the earliest months of life through the most advanced stages of attainment.
Notes * Earlier versions of this work were presented during a colloquium presentation at the Universidade de Lisboa (May 2005), during the Phonological Systems and Complex Adaptive Systems Workshop at the Laboratoire Dynamique du Langage, Université Lumière Lyon 2 (July 2005) and at the 2006 Annual Congress of the Canadian Linguistic Association. I am grateful to all of the participants to these events for enlighting discussions, especially Peter Avery, Abigail
Internal and external influences on child language productions
1.
2.
3. 4.
347
Cohn, Christophe Coupé, Elan Dresher, Maria João Freitas, Sónia Frota, Sophie Kern, Alexei Kochetov, Ian Maddieson, Egidio Marsico, Noël Nguyen, François Pellegrino, Christophe dos Santos and Marina Vigário. I would also like to thank one anonymous reviewer for useful comments and suggestions. Of course all remaining errors or omissions are my own. One could argue that post-vocalic consonants in VC and CVC forms involve complexity at the level of the rhyme constituent. This position is however controversial; several authors have in fact noted asymmetrical behaviours in the development of word-final consonants and argued that these consonants cannot always be analyzed as true codas (rhymal dependents) in early phonologies and should considered as onsets of empty-headed syllables (e.g. Rose 2000, 2003, Barlow 2003, Goad and Brannen 2003). Of course, one should not rule out the possibility that a regression in the acquisition of consonant clusters yields one of the patterns in (4). The presumption here is that such regressions are unlikely to occur, especially in typically developing children (e.g. Bernhardt and Stemberger 1998). As correctly noted by an anonymous reviewer, it is not clear whether the child analyses the strong and weak velars as allophones or separate phonemes. This issue is however tangential to the analysis proposed. An anonymous reviewer notes that there may be perceptual or articulatory factors involved in the pronunciation of /z/ as [d]. This point reinforces the argument of this paper about the need to entertain several potential factors in the analysis of child phonological data.
References Barlow, Jessica 2003 Asymmetries in the Acquisition of Consonant Clusters in Spanish. Canadian Journal of Linguistics 48(3/4):179-210. Bernhardt, Barbara and Joseph Stemberger 1998 Handbook of Phonological Development from the Perspective of Constraint-Based Nonlinear Phonology. San Diego: Academic Press. Blevins, Juliette 1995 The Syllable in Phonological Theory. In The Handbook of Phonological Theory, John A. Goldsmith (ed.). Cambridge, MA: Blackwell. 206-244. Booij, Geert 1999 The Phonology of Dutch. Oxford: Oxford University Press. Brannen, Kathleen 2002 The Role of Perception in Differential Substitution. Canadian Journal of Linguistics 47(1/2):1-46.
348
Yvan Rose
Chiat, Shulamuth 1983 Why Mikey’s Right and My Key’s Wrong: The Significance of Stress and Word Boundaries in a Child’s Output System. Cognition 14:275-300. Clements, George N. 1985 The Geometry of Phonological Features. Phonology 2:225-252. Demuth, Katherine 1995 Markedness and the Development of Prosodic Structure. In Proceedings of the North East Linguistic Society, Jill N. Beckman (ed.). Amherst: Graduate Linguistic Student Association. 13-25. Demuth, Katherine and Mark Johnson 2003 Truncation to Subminimal Words in Early French. Canadian Journal of Linguistics 48(3/4):211-241. Dinnsen, Daniel A. 2008 A Typology of Opacity Effects in Acquisition. In Optimality Theory, Phonological Acquisition and Disorders, Daniel A. Dinnsen and Judith A. Gierut (eds.). London: Equinox Publishing. 121-176. dos Santos, Christophe 2007 Développement phonologique en français langue maternelle: Une étude de cas. Ph.D. Dissertation. Université Lumière Lyon 2. Dresher, B. Elan and Harry van der Hulst 1995 Global Determinacy and Learnability in Phonology. In Phonological Acquisition and Phonological Theory, John Archibald (ed.). Hillsdale, NJ: Lawrence Erlbaum. 1-21. Dunphy, Carla 2006 Another Perspective on Consonant Harmony in Dutch. M.A. Thesis. Memorial University of Newfoundland. Ferguson, Charles and Carol B. Farwell 1975 Words and Sounds in Early Language Acquisition. Language 51:419-439. Fikkert, Paula 1994 On the Acquisition of Prosodic Structure. HIL Dissertations in Linguistics 6. The Hague: Holland Academic Graphics. Fikkert, Paula and Clara Levelt 2008 How does Place Fall into Place? The Lexicon and Emergent Constraints in Children’s Developing Grammars. In Contrast in Phonology, Peter Avery, B. Elan Dresher and Keren Rice (eds.). Berlin: Mouton de Gruyter. 231-268. Fougeron, Cécile and Patricia A. Keating 1996 Articulatory Strengthening in Prosodic Domain-initial Position. UCLA Working Papers in Phonetics 92:61-87.
Internal and external influences on child language productions
349
Freitas, Maria João 1997 Aquisição da Estrutura Silábica do Português Europeu. Ph.D. Thesis. University of Lisbon, Lisbon. Gerken, LouAnn 2002 Early Sensitivity to Linguistic Form. In Annual Review of Language Acquisition, Volume 2, Lynn Santelmann, Maaike Verrips and Frank Wijnen (eds.). Amsterdam: John Benjamins. 1-36. Gierut, Judith A. and Kathleen M. O’Connor 2002 Precursors to Onset Clusters in Acquisition. Journal of Child Language 29:495-517. Goad, Heather 1997 Consonant Harmony in Child Language: An Optimality-theoretic Account. In Focus on Phonological Acquisition, S. J. Hannahs and Martha Young-Sholten (eds.). Amsterdam: John Benjamins. 113142. 2000 Phonological Operations in Early Child Phonology. SOAS colloquium talk. University of London. Goad, Heather and Kathleen Brannen 2003 Phonetic Evidence for Phonological Structure in Syllabification. In The Phonological Spectrum, Vol. 2, Jeroen van de Weijer, Vincent van Heuven and Harry van der Hulst (eds.). Amsterdam: John Benjamins. 3-30. Goad, Heather and Yvan Rose 2004 Input Elaboration, Head Faithfulness and Evidence for Representation in the Acquisition of Left-edge Clusters in West Germanic. In Constraints in Phonological Acquisition, René Kager, Joe Pater and Wim Zonneveld (eds.). Cambridge: Cambridge University Press. 109-157. Goad, Heather and Yvan Rose (eds.) 2003 Segmental-prosodic Interaction in Phonological Development: A Comparative Investigation: Special Issue, Canadian Journal of Linguistics 48(3/4): 139-152. Hale, Mark and Charles Reiss 1998 Formal and Empirical Arguments Concerning Phonological Acquisition. Linguistic Inquiry 29(4):656-683. Inkelas, Sharon and Yvan Rose 2008 Positional Neutralization: A Case Study from Child Language. Language 83(4):707-736. Kent, Ray D. and Giuliana Miolo 1995 Phonetic Abilities in the First Year of Life. In The handbook of Child Language, Paul Fletcher and Brian MacWhinney (eds.). Cambridge, MA: Blackwell. 303-334.
350
Yvan Rose
Levelt, Clara 1994
On the Acquisition of Place. HIL Dissertations in Linguistics 8. The Hague: Holland Academic Graphics. Levelt, Clara, Niels Schiller and Willem Levelt 1999/2000 The Acquisition of Syllable Types. Language Acquisition 8:237264. Levitt, Andrea, Peter Jusczyk, Janice Murray and Guy Carden 1987 Context Effects in Two-Month-Old Infants' Perception of Labiodental/Interdental Fricative Contrasts. Haskins Laboratories Status Report on Speech Research 91:31-43. Macken, Marlys 1980 The Child’s Lexical Representation: The ‘Puzzle-Puddle-Pickle’ Evidence. Journal of Linguistics 16:1-17. Ménard, Lucie 2002 Production et perception des voyelles au cours de la croissance du conduit vocal : variabilité, invariance et normalisation. Ph.D. Dissertation. Institut de la communication parlée, Grenoble. Pan, Ning and William Snyder 2003 Setting the Parameters of Syllable Structure in Early Dutch. In Proceedings of the 27th Boston University Conference on Language Development, Barbara Beachley, Amanda Brown and Frances Conlin (eds.). Somerville, MA: Cascadilla Press. 615-625. Paradis, Carole and Jean-François Prunet, eds. 1991 The Special Status of Coronals: Internal and External Evidence. San Diego: Academic Press. Pater, Joe 1997 Minimal Violation and Phonological Development. Language Acquisition 6(3):201-253. Pinker, Steven 1984 Language Learnability and Language Development. Cambridge, MA: Harvard University Press. Rose, Yvan 2000 Headedness and Prosodic Licensing in the L1 Acquisition of Phonology. Ph.D. Dissertation. McGill University. 2003 Place Specification and Segmental Distribution in the Acquisition of Word-final Consonant Syllabification. Canadian Journal of Linguistics 48(3/4):409-435. Smit, Ann Bosma 1993 Phonologic Error Distribution in the Iowa-Nebraska Articulation Norms Project: Consonant Singletons. Journal of Speech and Hearing Research 36:533-547.
Internal and external influences on child language productions
351
Smith, Neilson 1973 The Acquisition of Phonology, a Case Study. Cambridge: Cambridge University Press. Smolensky, Paul 1996 On the Comprehension/Production Dilemma in Child Language. Linguistic Inquiry 27:720-731. Stoel-Gammon, Carol 1996 On the Acquisition of Velars in English. In Proceedings of the UBC International Conference on Phonological Acquisition, Barbara H. Bernhardt, John Gilbert and David Ingram (eds.). Somerville: Cascadilla Press. 201-214. Studdert-Kennedy, Michael and Elizabeth Goodell 1993 Acoustic Evidence for the Development of Gestural Coordination in the Speech of 2-year-olds: A Longitudinal Study. Journal of Speech and Hearing Research 36(4):707-727. van de Weijer, Joost 1999 Language Input for Word Discovery. Ph.D. Dissertation. Max Planck Institute.
Emergent complexity in early vocal acquisition: Cross linguistic comparisons of canonical babbling Sophie Kern and Barbara L. Davis Phonetic complexity, as evidenced in speech production patterns, is based on congruence of production system, perceptual, and cognitive capacities in adult speakers. Pre-linguistic vocalization patterns in human infants afford the opportunity to consider first stages in emergence of this complex system. The production system forms a primary site for considering determinates of early output complexity, as the respiratory, phonatory, and articulatory subsystems of infant humans support the types of vocal forms observed in early stages as well as those maintained in phonological systems of languages. The role of perceptual input from the environment in earliest stages of infant learning of ambient language phonological regularities is a second locus of emergent complexity. Young infants must both attend to and reproduce regularities to master the full range of phonological forms in their language. Cross linguistic comparison of babbling in infants acquiring typologically different languages including Dutch, Romanian, Turkish, Tunisian Arabic and French are described to consider production system based regularities and early perceptually based learning supporting emergence of ambient language phonological complexity.
1. Theoretical background 1.1. Common trends in babbling Canonical babbling marks a seminal step into the production of syllablelike outputs in infants. Canonical babbling is defined as rhythmic alternations between consonant and vowel-like properties, giving a percept of rhythmic speech that simulates adult output without conveying meaning (Davis & MacNeilage, 1995; Oller, 2000). These rhythmic alternations between consonants and vowels are maintained in adult speakers and form the foundation for complexity in languages (Maddieson, 1984). Longitudinal investigations of the transition from canonical babbling to speech have shown continuity between phonetic forms in infant pre-linguistic vocalizations and earliest speech forms (Oller, 1980; Stark, 1980; Stoel-Gammon &
354
Sophie Kern and Barbara L. Davis
Cooper, 1984; Vihman et al., 1986). This continuity supports the importance of considering canonical babbling as a crucial first step in the young child’s journey toward mastery of ambient language phonology. Strong similarities in sound and utterance type preferences in canonical babbling across different language communities have been documented, suggesting a universal basis for babbling (Locke, 1983). For consonants, stop, nasal and glide manner of articulation are most frequently reported (Locke, 1983; Robb & Bleihle, 1994; Roug et al., 1989; Stoel-Gammon, 1985; Vihman et al., 1985). Infants tend to produce consonants at the coronal and labial consonant place of articulation (Locke, 1983) and few dorsals are noted (Stoel-Gammon, 1985). Vowels from the lower left quadrant of the vowel space (i.e. mid and low front and central vowels) are most often observed (Bickly, 1983; Buhr, 1980; Davis & MacNeilage, 1990; Kent & Bauer, 1985; Lieberman, 1980; Stoel-Gammon & Harrington, 1990). The phenomenon of serial ordering is one of the most distinctive properties of speech production in languages (Maddieson, 1984). In a typical utterance, consonants and vowels do not appear in isolation but are produced serially. Within-syllable patterns for contiguous consonants and vowels provide a site for considering the emergence of complexity in utterance structures, as rhythmic consonant and vowel syllables emerge typically at around 8-9 months; in previous stages infant vocalizations do not exhibit rhythmic syllable-like properties (see Oller, 2000, for a review). Three preferred within-syllable co-occurrence patterns have been reported in studies of serial properties; coronal (tongue tip closure) consonants with front vowels (e.g. /di/), dorsal (tongue back closure) consonants with back vowels (e.g. /ku/), and labial (lip closure) consonants with central vowels (e.g. /ba/). These widely observed serial patterns are predicted by the Frame Content hypothesis (MacNeilage & Davis, 1990). The Frame Content hypothesis proposes that the tongue does not move independently from the jaw within syllables, but remains in the same position for the consonant closure and the open or vowel portions of rhythmic cycles. Within syllable consonant vowel characteristics are based on these rhythmic jaw close open close cycles without independent movement of articulators independent of the jaw. In studies of 6 English-learning infants during babbling (Davis & MacNeilage, 1995) and 10 infants during the single word period (Davis et al., 2002), all three predicted co-occurrences of the Frame Content perspective were found at above chance levels; other potential co-occurrences did not occur above chance. Evidence for these serial patterns have also been
Emergent complexity in early vocal acquisition
355
found in analyses of 5 French, 5 Swedish and 5 Japanese infants from the Stanford Child Language database (Davis & MacNeilage, 2000), 2 Brazilian-Portuguese learning children (Teixiera & Davis, 2002), 7 infants acquiring Quechua (Gildersleeve-Neuman & Davis, 1998) and 7 Korean-learning infants (Lee, 2003). Some counterexamples to these CV co-occurrence trends have been reported (Boysson-Bardies, 1993; Oller & Steffans, 1993; Tyler & Langsdale, 1996; Vihman, 1992). However, most differences in outcome may result from methodological differences. A labial central association in initial syllables was shown by Boysson-Bardies (1993) for French, Swedish and Yoruba infants but not for English: the English-speaking infants in her study preferred the labial front association. However, Boysson-Bardies analyzed the first and second syllables of utterances separately resulting in very small databases for statistical analysis. Oller and Steffans (1993) evaluated their results against the expected frequencies of consonants. They did not include expected frequencies of vowels, complicating comparison of results. The three predicted co-occurrences were observable in Tyler and Langsdale’s (1996) data if the small number of observations in the three age groups studied were pooled. An alveolar front association was not found in 3 English-speaking and 2 Swedish-speaking subjects by Vihman (1992). However, she counted /æ/ as a central vowel, also complicating the interpretation of her results relative to the predicted CV co-occurrences. Vocalization patterns across syllables are also of importance to considering emergence of vocal complexity. In languages, most words contain varied consonants and vowels across syllables; phonological reduplication, or repetition of the same syllable, is infrequent (Maddieson, 1984). In contrast, two types of canonical babbling in pre-linguistic infants have been described: reduplicated and variegated. Reduplicated or repeated syllables (e.g. /baba/) account for half or more of all vocal patterns in babbling and more than half of early word forms (Davis et al., 2002). In variegated forms, infants change vowels and/or consonants in two successive syllables (e.g. /babi/ or /bada/). Several studies have shown concurrent use of both reduplication and variegation during babbling (Mitchell & Kent, 1990; Smith, Brown-Sweeney & Stoel-Gammon, 1989). In variegated babbling, more manner than place changes for consonants (Davis & MacNeilage, 1995, Davis et al., 2002) and more height than front-back changes for vowels have been shown during babbling and first words (Bickley, 1983; Davis & MacNeilage, 1995, Davis et al., 2002). The preference for manner changes for consonants and height changes for vowels is consistent with the
356
Sophie Kern and Barbara L. Davis
Frame Content hypothesis (MacNeilage & Davis, 1990). As patterns are based on rhythmic jaw oscillations without independent tongue movement, predominance of manner and height changes over place and front-back changes are predicted when successive syllables show different levels of jaw closure. 1.2. Early ambient language effects Infants exhibit abilities to learn rapidly from language input regularities as early as 8-10 months, based on responses in experimental lab settings (e.g. Saffran et al., 1996; Werker & Lalonde, 1988). It has also been proposed that learning from ambient language input may influence and shape vocalization preferences in the late babbling and/or first word periods. The appearance of ambient language influences in production repertoires has been examined for utterance and syllable structures (Boysson-Bardies, 1993; Kopkalli-Yavuz & Topbaç, 2000), vowel and consonant repertoires and distribution (Boysson-Bardies et al., 1989 and 1992) as well as CV cooccurrence preferences (e.g. Lee, 2003). Some studies of early appearance of ambient language regularities have focused on adult capacities for perception of differences in children from different language environments. Thevenin et al. (1985) failed to find support for adults’ ability to discriminate the babbling of 7 to 14 month old English and Spanish-learning infants. However, their stimuli consisted of short 1 to 3 sec stretches of canonical babbling. Boysson-Bardies et al. (1984) presented naïve adults with sequences of early babbling of French, Arabic and Cantonese infants. Participants were asked to identify babbling of French infants. Listeners were correct in judging 70% of the tokens, suggesting that babbling in the pre-linguistic period may exhibit perceptually apparent ambient language characteristics. Adults were able to correctly identify language differences at 6 and 8 months, but not babbling of 10 month olds. According to Boysson-Bardies et al. (1984), this result could be explained by stimuli differences: stimuli from 10 month olds showed “less consistency” in intonation contours. Despite discrepancies in results, where adults were less accurate listening to older infants, these perceptual studies suggest a potential role of prosodic cues in adult listeners’ abilities to judge language background of young infants. Other studies targeting acoustic and phonetic properties of infants’ babbling output have provided some support for early ambient language learn-
Emergent complexity in early vocal acquisition
357
ing. Boysson-Bardies et al. (1989) compared vocalizations of French, English, Cantonese and Algerian 10 month olds. Based on computation of “mean vowels” (i.e. mean F1 and F2), they proposed that the acoustic vowel distribution was significantly different for the 4 language groups. There was also “close similarity” between infant and adult vowels in each of the four linguistic communities. Boysson-Bardies, Hallé, Sagart & Durand (1992) also suggested an early influence of the language environment on consonants in the four languages. They found significant differences in the distribution of place and manner of articulation across the four languages. Stop consonants represented the largest proportion for all infants. From 10 months, French infants produced fewer stops than American and Swedish infants. Levitt & Utman (1992) compared one French and one English-learning infant. English shows higher frequencies of fricatives, affricates and nasals than French; approximants are more frequent in French than in English. Each infant’s consonant inventory moved toward their own ambient language in composition and frequency; both infants showed a closest match to the ambient frequencies at 5 months. The French child also favored low front vowels and the English child preferred mid central vowels, consistent with frequencies in their ambient language. The study reported on a very small sample of data for the two children, however, complicating generalization of results on timing of early ambient language learning. In general, available studies are limited in the size of the databases and number of participants, so conclusions must be considered as needing further confirmation Strongly consistent trends in production patterns as well as preliminary indications about the timing of learning from ambient language input are apparent. However, empirical investigations of early ambient language learning do not provide strong evidence due to methodological issues (e.g. adult perceptual studies vs. infant production patterns, amount of data analyzed, age of observation, number of participants, longitudinal vs. crosssectional data collection and use of perception based phonetic transcription vs. acoustic analysis). To evaluate the emergence of early learning from the ambient language more fully, the issue must be considered in the context of common production patterns seen across languages. Larger cohorts of children in varied language environments illustrating diverse ambient language targets are necessary. Consistent data collection and analysis procedures are also essential to comprehensively evaluate this question.
358
Sophie Kern and Barbara L. Davis
2. Predictions In this work, a uniform analysis profile on large corpora for five different languages is imposed, with the goal of understanding timing of emergence and precise characteristics of ambient language learning in the context of reports on common production trends. Predictions based on common trends will be tested as follows: There will be a significantly higher proportion of: • stop, nasal and glide consonant manner of articulation, • coronal and labial consonant place of articulation, • mid and low front and central vowels. Within syllable consonant vowel co-occurrences will show a significant tendency for: • labial consonants and central vowels, • coronal consonants and front vowels, • dorsal consonants and back vowels. Across syllables, there will be a significant tendency for: • co-occurrence of both reduplication and variegation, • manner over place changes for consonants in variegated syllables, • height over front-back changes for vowels in variegated syllables.
3. Method 3.1. Participants Twenty infants (4 infants per language) were observed in their normal daily environment. Infants were described as developing typically according to community standards and reports from parents and physicians regarding developmental milestones. All infants were monolingual learners of Turkish, French, Romanian, Dutch and Tunisian Arabic. These languages represent diverse language families: French and Romanian are Romance languages, Dutch is a West-Germanic language, Turkish is a Ural-Altaic language and Tunisian belongs to the Arabic language family. Table 1 summarizes descriptive data for participants.
Emergent complexity in early vocal acquisition
359
3.2. Data collection One hour of spontaneous vocalization data was audio and video recorded every two weeks from 8 through 25 months in the infants’ homes. Parents were told to follow their normal types of activities with their child. No extra materials were introduced into the environment, so that samples reflected the infants’ typical vocalizations in familiar surroundings.
3.3. Data analysis Spontaneous vocalization samples during canonical babbling were analyzed. ‘Canonical babbling’ was defined as beginning with the onset of rhythmic speech-like syllables based on parent report. The data were collected until each child was chronologically 12 months of age. 165 hours of spontaneous data were phonetically transcribed using the International Phonetic Alphabet with broad phonetic transcription conventions. All singleton consonants and vowels as well as perceptually rhythmic syllable-like vocalizations were transcribed. Tokens considered as single utterance strings were separated by 1 second of silence, noise or adult speech. Transcribed data were entered into Logical International Phonetic Programs (LIPP, Oller & Delgado, 1990) for analysis of patterns. Table1. Participants and data analyzed. Language French Romanian Tunisian Turkish Dutch Total
Language Family Romance Romance Arabic Ural-Altaic West-Germanic
Number of participants 4 4 4 4 4 20
Number of one hour sessions 32 33 27 34 39 165
A variety of phonetic characteristics were considered. Consonants were grouped according to 1) manner of articulation: oral and nasal stops, oral and glottal fricatives, glides, and other (i.e. trills, taps and affricates) and 2) place of articulation: labial (bilabial, labiodental, labiopalatal and labiovelar), coronal (dental, alveolar, postalveolar and palatal), dorsal (velar and uvular) and guttural (pharyngeal and glottal). Glides were considered as
360
Sophie Kern and Barbara L. Davis
consonants, as they share the consonantal property of accompanying the mouth closing phase of babbling. Vowels were grouped according to 1) backness: front, central and back, and 2) height: high, mid and low. An other category included all segments that could not be perceptually recognized by transcribers as specific consonants or vowels (i.e. UC - undefined consonant, UV - undefined vowel). For all sounds occurring in perceptually rhythmic syllable contexts, within syllable consonant vowel (CV) cooccurrence patterns were analysed. For this analysis, consonants were grouped into 3 categories according to consonant place of articulation: labial, coronal and dorsal. Vowels were grouped into front, central and back dimensions. For across syllable patterns, utterance strings were considered reduplicated if all consonant and vowel types were identical. Variegated strings were designated by changes in consonant place or manner, vowel height or front-back, or both.
4. Results 4.1. Utterance structures Table 2 displays frequency of occurrence for utterances, segments and the C/V ratio. The number of utterances for all languages was 38,719 ranging from 3,409 (Turkish) to 10,623 (Dutch). Overall the number of segments totalled 168,145. In all languages, number of vowels exceeded consonants as illustrated by the C/V ratio. Overall, 57,472 consonants were analysed. The number of consonants ranged from 6,771 (Turkish) to 16,760 (Tunisian) across languages. For each language, percentages and totals of consonants occurring more than 5% are given in Appendix A. 69,007 vowels, including those occurring > 5% were transcribed (See Appendix B for percentages and totals of vowels occurring > 5%). Table 2. Frequency of occurrence of segments and utterances. Utterances
French Romanian Tunisian Turkish Dutch Total
10,085 8,280 6,322 3,409 10,623 38,719
Consonants
Vowels
C/V ratio
Other
Total segments
9,462 9,512 16,760 6,771 14,967 57,472
12,196 11,807 19,145 8,201 17,658 69,007
0.78 0.80 0.88 0.83 0.85 0.83
320 19 82 1,595 940 2,956
32,063 29,618 42,309 19,967 44,188 168,145
Emergent complexity in early vocal acquisition
361
4.2. Consonant characteristics Manner of articulation: Some similarities were apparent as well as some striking differences across languages relative to manner of articulation. Figure 1 displays results for manner of articulation for all consonants in the corpus. Oral stops were most frequent (43.5%). Four languages out of five exhibited this trend: oral stops accounted for 51.5% in French, 51% in Romanian, 42.5% in Dutch and 57.5% in Turkish. Tunisian infants produced only 29.5% stops. A high percentage of glottal fricatives is observed in Tunisian (31.5%) and Dutch (25.5%). In Tunisian, the glottal fricative [h] represented the largest consonant type (31.5%), almost equal to the number of stops (29.5%). The glottal fricative [h] represented 25.5% of occurrences for Dutch infants. When glottal fricatives were not counted, glides (15%) and nasals (12%) were the second most frequent manner of articulation for all languages. French infants produced twice the group average for nasals. Dutch and Tunisian infants produced far fewer nasals. Finally, in all languages children produced more orals and nasals and glides than other manner of articulation (Z-test, p ≤ 10-6). This result confirms our first hypothesis concerning a significantly higher proportion of stop, nasal and glide consonant manner of articulation. 70
Percentage of occurrence
60 orals
50
nasals 40
glides oral fricatives
30
glottal f ricatives 20
others
10 0 French
Romanian
Dutch
Tunisian
Turkish
A verage
Figure 1. Consonant manner of articulation.
Place of articulation: Figure 2 displays place of articulation results for all consonants in the corpus. Coronals were the most frequent at 47%. However in French, labials (47%) were most frequent; in particular, the labial
362
Sophie Kern and Barbara L. Davis
nasal [m] was frequently produced (21.5%). Tunisian infants produced more glottals, with a high frequency of the glottal fricative [h] as noted above for manner of articulation. The second most frequent Tunisian place category was coronals. Across all languages, there were more glottals than dorsal, due to Tunisian and Dutch. Our second hypothesis is confirmed, as the proportion of labials and coronals is significantly different (more frequent) than the proportion of dorsals and glottals in each of the 5 languages (Z-test, p ≤ 10-6). 70
Percentage of occurrence
60 50 labial 40
coronal dorsal
30
glottal 20 10 0 French
Romanian
Dutch
Tunisian
Turkish
Average
Figure 2. Consonant place of articulation.
4.3. Vowel characteristics Vowel frequencies ranged from 8,201 to 19,145. Two or three vowels accounted for 50% of all types. Only the mid low vowel [a] occurred with a frequency of >5% in all 5 languages. Vowels in the lower left quadrant of the vowel space were separated and compared with other vowel types (Figure 3). Overall, mid and low front and central vowels were most frequent. Combining the five languages, the lower left quadrant category yielded 66% of all vowels. This analysis confirms our third hypothesis that children will produce more vowels from the lower left quadrant than other vowel types. In each language, the difference between the two groups is statistically significant in showing a predominance of the vowels from the lower left quadrant (Z-test, p ≤ 106).
Emergent complexity in early vocal acquisition
363
In French, the low central vowel [a] and mid- front rounded vowel [œ] represented approximately 60%; 3 other vowels occurred >5%. In Tunisian, the two most frequent vowels were [æ] and [e]; only [a] occurred at > 5%. Dutch infants exhibited a high percentage of both central vowels [] and [a]; 3 others at > 5%. In Turkish [] occurred more than 29%: 5 other vowels at more than 5%: [], [æ], [a], [], [u], [ɨ] and []. Romanians produced [a] most frequently (29.5%); five other vowels occurred at frequencies > 5%.
80
Percentage of occurrence
70 60 50 V