Emergence and Adaptation: Studies in Speech Communication and Language Development

Emergence and Adaptation Studies in Speech Communication and Language Development Dedicated to Björn Lindblom on his 65...

Author: R. D. Diehl | Randy L. Diehl | Olle Engstrand | John Kingston | Klaus Kohler

22 downloads 1041 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Emergence and Adaptation Studies in Speech Communication and Language Development Dedicated to Björn Lindblom on his 65th birthday

Editors

Randy Diehl, Austin, Tex. Olle Engstrand, Stockholm John Kingston, Amherst, Mass. Klaus Kohler, Kiel

72 figures and 8 tables, 2000

ABC

Basel 폷 Freiburg 폷 Paris 폷 London 폷 New York 폷 New Delhi 폷 Bangkok 폷 Singapore 폷 Tokyo 폷 Sydney

S. Karger Medical and Scientific Publishers Basel 폷 Freiburg 폷 Paris 폷 London New York 폷 New Delhi 폷 Bangkok Singapore 폷 Tokyo 폷 Sydney

ABC Fax+ 41 61 306 12 34 E-Mail [email protected] www.karger.com

Drug Dosage The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in government regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any change in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.

All rights reserved. No part of this publication may be translated into other languages, reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, microcopying, or by any information storage and retrieval system, without permission in writing from the publisher or, in the case of photocopying, direct payment of a specified fee to the Copyright Clearance Center (see ‘General Information’). © Copyright 2000 by S. Karger AG, P.O. Box, CH–4009 Basel (Switzerland) Printed in Switzerland on acid-free paper by Reinhardt Druck, Basel ISBN 3–8055–7147–X

Vol. 57, No. 2–4, 2000

Contents

083 Foreword

Acoustic Patterning of Speech. Its Linguistic and Physiological Bases 085 Investigating Unscripted Speech: Implications for Phonetics

and Phonology Kohler, K.J. (Kiel) 095 Emotive Transforms Sundberg, J. (Stockholm) 113 The Source-Filter Frame of Prominence Fant, G.; Kruckenberg, A.; Liljencrants, J. (Stockholm) 128 The C/D Model and Prosodic Control of Articulatory

Behavior Fujimura, O. (Columbus, Ohio) 139 Diverse Acoustic Cues at Consonantal Landmarks Stevens, K.N. (Cambridge, Mass.)

Perceptual Processing 152 Modeling and Perception of ‘Gesture Reduction’ Carré, R. (Paris); Divenyi, P.L. (Martinez, Calif.) 170 General Auditory Processes Contribute to Perceptual

Accommodation of Coarticulation Holt, L.L. (Pittsburgh, Pa.); Kluender, K.R. (Madison, Wisc.) 181 Adaptive Dispersion in Vowel Perception Johnson, K. (Columbus, Ohio) 189 Language Acquisition as Complex Category Formation Lotto, A.J. (Chicago, Ill.)

ABC

© 2000 S. Karger AG, Basel

Fax+ 41 61 306 12 34 E-Mail [email protected] www.karger.com

Access to full text and tables of contents, including tentative ones for forthcoming issues: www.karger.com/journals/pho/pho_bk.htm

Biology of Communication and Motor Processes 197 Singing Birds, Playing Cats, and Babbling Babies:

Why Do They Do It? Sjölander, S. (Linköping) 205 The Phonetic Potential of Nonhuman Vocal Tracts:

Comparative Cineradiographic Observations of Vocalizing Animals Fitch, W.T. (Cambridge, Mass.) 219 Dynamic Simulation of Human Movement Using

Large-Scale Models of the Body Pandy, M.G.; Anderson, F.C. (Austin, Tex.)

En Route to Adult Spoken Language. Language Development 229 An Embodiment Perspective on the Acquisition of Speech

Perception Davis, B.L.; MacNeilage, P.F. (Austin, Tex.) 242 Speech to Infants as Hyperspeech: Knowledge-Driven

Processes in Early Word Recognition Fernald, A. (Stanford, Calif.) 255 The Construction of a First Phonology Vihman, M.M. (Gwynedd); Velleman, S.L. (Amherst, Mass.)

Auditory Constraints on Sound Structures 267 Searching for an Auditory Description of Vowel Categories Diehl, R.L. (Austin, Tex.)

Commentaries 275 Imitation and the Emergence of Segments Studdert-Kennedy, M. (New Haven, Conn.) 284 Deriving Speech from Nonspeech: A View from Ontogeny MacNeilage, P.F.; Davis, B.L. (Austin, Tex.) 297 Developmental Origins of Adult Phonology: The Interplay

between Phonetic Emergents and the Evolutionary Adaptations of Sound Patterns Lindblom, B. (Stockholm)

315 Publications Björn Lindblom 322 Index autorum Vol. 57, 2000 after 322 Contents Vol. 57, 2000

82

Contents

This page intentionally left blank

Phonetica 2000;57:83

Foreword

Dear Björn, On the occasion of your retirement, and your 65th birthday, we wanted to honor you and thank you for everything you have done for us at Stockholm University and for the scientific community around the world. Your dogged defiance of disciplinary boundaries has stimulated fruitful communication between investigators from many fields of research. You have motivated higher standards of intellectual rigor and roused a renewed interest in testing theoretical models. And, perhaps most importantly, you have made serious and successful efforts to make phonetics explanatory rather than merely descriptive. In the spirit of your work and thinking, we announced a symposium for interdisciplinary discussions on ‘Speech Com- Photo by Yngve Fransson. munication and Language Development’. The response from the international scientific community was enthusiastic – both from researchers and sponsors!1 This made possible an intensive 3-day conference (17–19 June, 1999) at Stockholm University with superb contributions from phonetics, psychology, biology, neurology, zoology, engineering, music and mathematics. A congenial atmosphere provided the perfect setting for a truly fruitful scientific meeting, and we hope that the insights and the interdisciplinary exchange that emerged at the symposium will be reflected in the composition of this volume, which preserves the sequence of contributions and the thematic sections of the conference. With this we wish you a happy and productive retirement. May you continue to have success with solving the fundamental problems in phonetics! Francisco Lacerda and Christine Ericsdotter Bresin 1

We are grateful for the generous support by The Wenner-Gren Foundation, The Bank of Sweden Tercentenary Foundation, The Swedish Council for Research in the Humanities and Social Sciences, The Swedish Council for Social Research, and Stockholm University.

Fax +41 61 306 12 34 E-Mail [email protected] www.karger.com

© 2000 S. Karger AG, Basel 0031–8388/00/0574–0083 $17.50/0 Accessible online at: www.karger.com/journals/pho

Acoustic Patterning of Speech Its Linguistic and Physiological Bases Phonetica 2000;57:85–94

Received: October 6, 1999 Accepted: December 20, 1999

Investigating Unscripted Speech: Implications for Phonetics and Phonology K.J. Kohler Institut für Phonetik und digitale Sprachverarbeitung, Universität Kiel, Germany

Abstract This paper looks at patterns of reduction and elaboration in speech production, taking the phenomenon of plosive-related glottalization in German spontaneous speech, on the basis of the ‘Kiel Corpus’, as its point of departure, and proposes general principles of human speech to explain them. This is followed by an enquiry into the nature of a production-perception link, based on complementary data from perceptual experiments. A hypothesis is put forward as to how listeners cope with the enormous phonetic variability of spoken language and how this ability may be acquired. Finally, the need for a new paradigm of phonetic analysis and phonological systematization is stressed, as a prerequisite to dealing adequately and in an insightful way with the production and perception of spontaneous speech. Copyright © 2000 S. Karger AG, Basel

Introduction

The phonetic question to be discussed in this paper may be highlighted with a joke by the British comedians Ronnie Barker and Ronnie Corbett from one of their television shows [Davidson and Vincent, 1978, p. 142]: ‘Four girls were disqualified for cheating in the Miss Greater Manchester competition last night. They were Miss Altrincham, Miss Paddingham, Miss Pumpingham and Miss Stuffingham.’ For those readers who are not too familiar with English geography and with the intricacies of the spelling-sound relationship of English place names, Altrincham is a town south of Manchester, pronounced [l‡ltrR\˜m], jut like Birmingham or Nottingham. The places the other three beauty competitors are suggested to come from do not exist but are given names that follow the same pattern of name formation in -ham [˜m]. Over and above that, the real and the pseudo place names are also interpretable as phrasal constructions of verb + them with articulatory reduction at the sentence and utterance levels. The two Ronnies play on the linguistic indeterminacy of speech signals, which results from the enormous flexibility of word production in utterance context, irrespective of regional and social variation of a language. Their word play builds and depends on the listener’s intuitive knowledge of a many-to-one relationship between word sequences and their acoustic output in speech production, on the one hand, and of a



K.J. Kohler Institut für Phonetik und digitale Sprachverarbeitung Universität Kiel, D–24098 Kiel (Germany) Tel. +49 431 880 3319, Fax +49 431 880 1578 E-Mail [email protected]

one-to-many relationship between acoustic speech signals and units of meaning in speech perception and comprehension, on the other. Without the phonetic variability and semantic indeterminacy as essential constituents of speech communication, jokes of this kind, examples of which may be multiplied across languages, would fall flat. The question then arises as to how phoneticians and phonologists can cope with these facts in their modelling of speech communication. To give an adequate answer they should realize that humans communicate, first and foremost, in unscripted interaction and that, to gain insight into speech production and perception processes as well as into speech development, the variability of unscripted speech should therefore be a focus of attention. They must also be aware that this variability goes far beyond allophonic variation of segmental-type phonemes in word pronunciations; it includes, amongst others, the phonic flexibility of speaking styles and adjustments to the communicative situation. However, phonetic analysis and phonological systematization have not paid due tribute to this prime importance of unscripted speech communication. They have instead concentrated on lab, or at best textual, speech frames for the production and perception of sounds in words, rather than of words in utterances, and they have done this with a view to setting up patterns of rigid invariance in word phonology instead of providing rules of structured variability in utterance phonology. The latter approach is needed for an explication of everyday speech communication. Utterances set frames for phonetic flexibility of words in speech production, and words require utterance embedding to be perceived and understood appropriately. On the one hand, this utterance dependence controls the phonetic coalescence and semantic ambiguity in context, as is exemplified in the joke, and on the other hand, it is reponsible for extreme articulatory reduction but nevertheless correct decoding in context, as is shown by countless instances of spontaneous speech. Remove the utterance contexts, and the remaining word sequences become unintelligible. This may be illustrated by the following example from the ‘Kiel Corpus of Spontaneous Speech’ [IPDS, 1995, 1996, 1997]: nun wollen wir mal kucken (‘now let’s see’, g122a009) in the phonetic form [næu’ ˘˜ n˘ υ fl˜† Na lkˆTk\] for unreduced [nu’n v˘lñ vi’fl mal lkˆTk\]. It has strong nasalization across its first three syllables relating to syllable-final nasal consonants, which are reduced (deleted or shortened) in this hypo as against the hyper pronunciation. There is additional labiodentalization around the third syllable representing canonical [v] of wir. Other possible realizations are [næu’ ˘˜ N/m˜fl ma lkˆTk\]/[næu’ ˘˜ fl˜ ma lkˆTk\], where the apical gesture of the medial nasal is also eliminated or the consonant deleted altogether. A native speaker of German has no problem in understanding the whole utterance, but when kucken is removed, decoding the larger section that remains becomes impossible. The spectrograms of figures 1 and 2 compare the spontaneous speech manifestation of this sentence with a careful reading pronunciation, contrasting the acoustic consequences of levelled versus extensive articulatory movements. This paper aims to correct the traditional paradigm of phonetic/phonological sound segment and word orientation, taking a large corpus of spontaneous German dialogue data (‘The Kiel Corpus of Spontaneous Speech’) as its point of departure. Four questions will be addressed. The first deals with data and theory of speech production: (1) What are the patterns of reduction and elaboration in speech production and what general principles, governing human speech, can be adduced to explain them? The next two deal with data and theory of speech perception as well as with speech development: (2) What role do the production phenomena play in perception and what is the

86

Phonetica 2000;57:85–94

Kohler

Fig. 1. Speech wave, spectrogram and SAMPA label window for the ‘Kiel Corpus of Spontaneous

Speech’ utterance nun wollen wir mal kucken (‘now let’s see’) in dialogue turn g122a009: reduced speech.

production-perception link? (3) How do communicators succeed in relating a large array of phonetic forms to the ‘same’ item – a word or an utterance, and how may this ability be acquired? The fourth question deals with a new paradigm for phonetics and phonology: (4) What categories of description are necessary to systematize the variability of speech for an adequate and insightful account of its production and perception? The discussion will pick out a frequently occurring phenomenon of unscripted German speech – glottalization in connection with plosive articulation – and interpret findings from corpus data on speech production and from experimental data on speech perception.

Investigating Unscripted Speech

Phonetica 2000;57:85–94

87

Fig. 2. Speech wave, spectrogram and SAMPA label window for a reading pronunciation of nun

wollen wir mal kucken by speaker K.J.K.: elaborated speech.

Glottalization in the Production of Plosives

Survey of Data on Stop Production in German In German connected speech – text reading and especially spontaneous dialogues – glottalization, in alternation with, or in addition to, more forceful glottal stops is a very common phenomenon. It not only applies to the context of word-initial vowels, which has been acknowledged in textbooks for a long time, but also to two further contexts: (1) sonorant – plosive – sonorant (especially nasal) for fortis as well as lenis stops at all places of articulation, e.g. könnten [kœn≈nn] or Stunden [EtTn≈nn] or sind noch [zRn≈n n˘x], instead of the more elaborate canonical pronunciations [kœntˆñ], [EtTndñ] (or [kœntn], [EtTndn] with nasal plosion), [zRnt n˘x]; figure 3 contrasts the reduced forms of können [kœnn], with a modal-voice nasal, and könnten [kœn≈nn], with a glottalized nasal; (2) vowel – fortis plosive – consonant (especially nasal), e.g. zweiten [tsva≈R (I)n], Leipzig [la≈ R –IptsRç], hat nicht [h≈aI nRç]. Both these types of glottalization are related to plosives in more elaborate, especially citation form renderings of the relevant words, without or with intervening morpheme or word boundaries after the plosive. In these cases a simple glottal valve action

88

Phonetica 2000;57:85–94

Kohler

Fig. 3. Spectrogram of (Die) können uns (abholen) (‘They can collect us’, top) and of (Die) könnten uns (abholen) (‘They could collect us’, bottom): read speech, speaker K.J.K.

is used to cut off the air stream for stop articulation. This may happen in addition to or instead of a more complex combination of supraglottal oral and velic closures. A single long hold for a glottal stop may also be relaxed into irregular pulsing of rather long periods, compared to the environment. Moreover, the stop is not released into a vowel but in most cases is followed by another complete or partial oral occlusion – nasal, plosive or lateral. This articulatory sequencing is usually the result of [˜] elision before nasals or laterals of canonical forms. In the complete nasal environment of case (1), the oral closure may be at the same place of articulation, and be accompanied by velic opening, throughout the sequence, as the interruption of the air stream is transferred to the glottal valve. Under condition (1) there are four possibilities of temporal alignment of glottalization with the sonorant: medial, final, initial or complete irregular voicing, i.e. (a) [n≈nn], (b) [n≈n], (c) [≈nn], (d) [≈n≈n] for apical nasal articulation in, e.g. könnten. The following distributions have been found for the four categories: (a) is by far the most common in all contexts; for fortis stops (c) is the next frequent, for lenis stops it is (b). In the context of condition (1) there are also occurrences of voiceless nasals instead of fortis, and of breathy-voiced or voiceless nasals instead of lenis stops, due to glottal (interarytenoidal) opening. In all these cases the modal-voice nasal context is interrupted by a different type of phonation as a residue of more complex plosive articula-


Phonetica 2000;57:85–94

89

tions. But the lenis context also allows a further progression to modal voice, i.e. reduction to [nn]. In the fortis context this is only possible in unstressed function words and elements of compounds, e.g. -zehnten in numerals. This change may be complete, or there may be a very weak trace of the plosive in the form of a medial amplitude and/or F0 dip in the nasal stretch. So the speaker still signals a break, albeit towards the low effort end of a reduction scale ranging from plosive to complete nasalization [for further details, see Kohler, 1996a, 1998]. Explaining the Data with Reference to General Principles of Speech Production All the phonetic realizations in nasal contexts of (1) eliminate the need for a synchronization of velic control and increase coarticulatory ease through a transfer of the valve action from the velum to the glottis. The velum can thus remain lowered in the entire sequence, but for canonical fortis plosives and the majority of lenis ones the listener is still guaranteed a signal break through a glottal stop, glottalization or some other change in phonation. These articulatory reductions at the utterance level may be regarded as instances of a more economical articulatory reorganization under the general principle of economy of effort, even if it is at present not quantifiable. In the typical nasal plosion context of condition (2), glottal compression is also possible although less frequent than in (1). Since the synchronization of velic control is then again no longer crucial this glottal adjustment also constitutes an increase of coarticulatory ease through reorganization. Glottal stop and more relaxed glottalization can thus accompany plosive articulations in non-aspirated environments, which require adducted, rather than abducted vocal folds. In nasal contexts a scale of glottal phenomena can take over the stop function altogether and thus allow velic action and velic synchronization, i.e. movements of a sluggish articulator, to be eliminated. Moreover, the timing of these phonation changes within the nasal stretch can be left indeterminate, as long as they occur. That reduces the demands on articulator coordination and helps to ease production whenever called for by the context of situation.

Plosive-Related Glottalization in Perception

A Hypothesis Since the production patterns of plosive-related glottalization are widespread, rule-governed and referable to general principles of speech production it must be assumed that their acoustic manifestations play a fundamental role in speech perception. The question thus is in what way listeners make use of, e.g. the presence/absence of glottalization and of its temporal indeterminacy to restore the intended utterances containing words with or without stops. A Perception Experiment For this purpose the utterance die können uns abholen (‘they can collect us’) of figure 3 (top) was used as the base for stimulus generation for a perceptual experiment. About 65 ms of glottalization from another utterance die könnten uns anholen (‘they could collect us’) of figure 3 (bottom) were spliced into the long nasal of können, replacing its initial, its central or its final section, or (by doubling) the entire length. Figure 4 shows the speech waves of original können and of the 4 glottal splicings. Fur-

90

Phonetica 2000;57:85–94

Kohler

Fig. 4. Speech waves of original Die können uns ab(holen) (top). Die könnten uns ab(holen) (bot-

tom), and 4 stimuli derived from Die können uns ab(holen) by splicing, with glottalization replacing the second half, the first half, the centre, or the total length of the nasal, respectively. Original stimuli as in figure 3. Stimuli generated for perception experiment.


Phonetica 2000;57:85–94

91

thermore, the modal nasal of können was lengthened and shortened in another two stimuli. Finally both original stimuli together with the 6 manipulated ones were duplicated 10 times, randomized and presented to 23 subjects in a formal listening test for identification of the test utterance as containing either the word können or the word könnten. The results are very clear: the presence of glottalization produces practically 100% könnten, its absence 100% können judgements. So the presence as well as the temporal indeterminacy of a glottalized section in production are mapped onto perception. The listener decodes the break in modal voicing of the nasal – at least as long as it is of a duration typically found in production – as a cue to a stop, and ignores the precise synchronization with the nasal, in the same way as the speaker differentiates between presence or absence of a stop [for further details see Kohler, 1999]. Prolegomena to a Perception Theory for Speech Communication What type of speech perception theory can account for this mapping of stop articulations? The Motor Theory or Direct Realism are most unlikely candidates because there is no invariant articulatory gesture that may be said to underlie the large variability in production for the listener to recover. What happens on the part of the speaker is a short-term interference with the air stream, which in the extreme case is a complete stoppage of airflow out of mouth and nose. This essential aerodynamic goal may be achieved by widely differing articulatory gestures: a simple glottal or a more complex supraglottal occlusion or both. The glottal obstruction may be relaxed, resulting in glottalization and other phonation types that deviate from surrounding modal voice, even if only by a decrease in amplitude and/or fundamental frequency. They thus still mark the intended goal of local air stream interference. What mechanism is actually used by the speaker depends on the amount of effort that is to be put into the production of speech as a function of utterance context and communicative situation. If the supraglottal occlusions are relaxed the closure periods go to zero, which means nasalization in a nasal context and lenition in an intervocalic context. Both are widespread in the languages of the world. In the case of nasalization the short-term interference with the air stream may be removed altogether; similarly, intervocalic lenition can result in complete vocalic integration. As long as air stream interference for intended stop articulation is produced by the speaker, in whatever way, it is mapped onto the acoustic signal as a short-term change from modal voice. This, in turn, is what listeners can rely on as a common feature, if they have learnt to group the variability in their phonetic experience through exemplar-based learning, in a way Björn Lindblom [this vol.] is proposing in his paper. So the perception theory that is to cope with reduction phenomena in spontaneous speech successfully will have to have an acoustic and auditory base. And it can no longer postulate phonetic invariance of any sort, articulatory or otherwise, because the speech material the listener receives for decoding from speakers in everyday interactions is simply not made that way, but listeners still cope extremely well with this structured variability. Constraints from higher levels of speech processing will also have to be integrated into such a perception theory for spontaneous speech. If German listeners receive an acoustic signal that does not contain a break in nasal modal voicing they will interpret it as having no stop component at the signal level, e.g. being können rather than könnten or the cardinal number dreizehn rather than the ordinal dreizehnten. But since in unstressed function words and numeral components modal-voice nasalization may

92

Phonetica 2000;57:85–94

Kohler

also affect fortis stops (the ‘Kiel Corpus’ provides several instances of [tse’n(n)] -zehnten, e.g. g125a005, g256a001), a stopless signal may actually refer to an intended conjunctive or ordinal form containing a stop. So -zehnten in am dreizehnten November (‘on November 13th’) and -zehn in an dreizehn Novembertagen (‘on 13 days in November’) may coalesce phonetically, but they will, of course, still be decoded as the ordinal and the cardinal number, respectively, due to top-down interpretation. Some Thoughts on Speech and Language Development The next question concerns how the link between the production of structured variability in utterances and their correct decoding develops in speech and language acquisition. I find Björn Lindblom’s reference to the roles of motor constraints and of perceptual experience very attractive [Lindblom, this vol.]. Projected onto our data his ‘low-energy articulatory search’ would enable the child spontaneously to discover simple glottal valve patterns by the side of other more complex stop mechanisms, both used by the ambient phonology. The low-cost motor patterns of adult reduced speech would thus also be arrived at by the child in a playful exploration of basic air stream control. The child would thus gain phonetic experience with its own speech actions and their relationships from the energy angle, and it would, in parallel, gain experience with the acoustic patterns produced by adults, which show a certain degree of congruence to the child’s own in respect of low-cost stop control. It could thus be able to form ‘perceptual categories as emergents of phonetic experience’. Research into child language should pursue these questions by looking at the development of phonation types, also in connection with stop articulations.

A New Paradigm for Phonetics and Phonology

The data presented in this paper cast new light on theories of speech production and perception that are required for a modelling of the speech communication process. We need a new paradigm for phonetic analysis and phonological systematization as a prerequisite for such theory development [Kohler, 1996b]. The following five points summarize the essentials for such a new paradigm. (i) Connected and especially unscripted speech must be given a much more prominent role in speech research than has been customary. We need speech data of at least sentence size in natural and meaningful contexts because it is only there that reductions occur and can be tested perceptually. (ii) The focus of attention in speech production and perception is to be shifted from word to utterance phonology, from phonemes in words to phonetic shapes of words in utterances. (iii) The search for invariants at any level of production or perception is to be given up in favour of the grouping of variants according to general principles of motor constraints and according to perceptual experience and exemplar-based phonetic memory. This approach emphasizes the importance of signal statistics in category formation, which traditional phonology, in contrast to automatic speech recognition, has denied. (iv) The strictly linear segmental phonemic frame for phonological systematization has to be complemented with non-linear componential features referring to any articulatory or phonatory aspect. This is mandatory in view of the temporal indetermi-


Phonetica 2000;57:85–94

93

nacy that was found in plosive-related glottalization in German production and perception data: if a phonetic feature is not used in a linear segmental way by speakers and listeners, it should not be treated as such in phonological analysis. It is characteristic of reduced speech, compared to elaborated canonical phonetic forms, that the linear segmentability, which is customarily applied to the latter, becomes fuzzy: features of glottalization, nasalization, labi(odent)alization, palatalization, velarization, etc. become dissociated from specific linear segments and temporally indeterminate, they operate as long components or articulatory prosodies instead [Firth, 1948]. This may be illustrated by further examples from German that combine several of these prosodies: soll sie [z˘zi] (‘is she to’) vs. sollen sie [zæ˘zi] (‘are they to’) vs. [zæ≈˘zi] sollten sie (‘should they’) – for canonical [z˘l zi] vs. [z˘ln zi] vs. [z˘ltn zi] – are differentiated by the presence or absence of nasalization and/or glottalization in the vowel of the first syllable, and the linear segments /l, t, n/ of the canonical forms are no longer discernible. These articulatory prosodies are also relevant for the listener as was shown by a formal listening test with these items in contextual frames: [zæ≈˘zi] was predominantly decoded as sollten sie [for further details see Kohler, 1999]. Research into speech and language development in children would also benefit immensely if the established frame of phoneme acquisition were given up and the development of children’s long-component production patterns as well as the development of their more global perception patterns were studied instead. This would also assist the theory building sketched above. (v) Finally, the time-honoured division of the field of phonetic science into phonetics, which provides raw measurement data, and phonology, which cooks them and turns them into a good linguistic meal, has outlived itself, although it is still prevalent in the heads of many linguists. This division advocates the supremacy of phonological categories and their independence as well as pre-existence vis-à-vis phonetic substance and has led phonetic research astray on many an occasion in the past, e.g. when psychologists took over the phoneme concept and developed perception or language acquisition models, or when linguists created the subdiscipline of lab phonology and the model of articulatory phonology. The division has now reached the stage where it becomes a hindrance to the advance in the theory and modelling of speech communication as it unfolds in real speakers and hearers in natural settings every day and everywhere.

References1 Davidson, I.; Vincent, P.: The bumper book of the two Ronnies: the very best of the news (Allen, London 1978). Firth, J.R.: Sounds and prosodies. Trans. Philol. Soc. 127–152 (1948). IPDS: The Kiel Corpus of spontaneous speech, vols. 1–3. CD-ROM #2–4 (Institut für Phonetik und digitale Sprachverarbeitung, Kiel 1995–1997). Kohler, K.J.: Phonetic realization of German /˜/-syllables. Arbeitsber. Inst. Phonetik Univ. Kiel (AIPUK) 30: 159–194 (1996a). Kohler, K.J.: Developing a research paradigm for sound patterns of connected speech in the languages of the world. Arbeitsber. Inst. Phonetik Univ. Kiel (AIPUK) 31: 227–233 (1996b). Kohler, K.J.: The phonetic manifestation of words in spontaneous speech; in Duez, Proc. from the Esca Workshop on Sound Patterns of Spontaneous Speech, La Baume-les-Aix 1998, 13–22. Kohler, K.J.: Articulatory prosodies in German reduced speech. Proc. 14th Int. Congr. Phonet. Sci., San Francisco 1999, pp. 89–92. Lindblom, B.: Developmental origins of adult phonology: the interplay between phonetic emergents and the evolutionary adaptations of sound patterns. Phonetica 57: 297–314 (2000).

1

Graphic signal representations and speech output of utterances referred to in this paper can be found at the following URL: www.ipds.uni-kiel.de/examples.html.

94

Phonetica 2000;57:85–94

Kohler


Received: November 1, 1999 Accepted: February 21, 2000

Emotive Transforms Johan Sundberg KTH Voice Research Centre, Department of Speech Music Hearing, KTH (Royal Institute of Technology), Stockholm, Sweden

Abstract Emotional expressivity in singing is examined by comparing neutral and expressive performances of a set of music excerpts as performed by a professional baritone singer. Both the neutral and the expressive versions showed considerable deviations from the nominal description represented by the score. Much of these differences can be accounted for in terms of the application of two basic principles, grouping, i.e. marking of the hierarchical structure, and differentiation, i.e. enhancing the differences between tone categories. The expressive versions differed from the neutral versions with respect to a number of acoustic characteristics. In the expressive versions, the structure and the tone category differences were marked more clearly. Furthermore, the singer emphasized semantically important words in the lyrics in the expressive versions. Comparing the means used by the singer for the purpose of emphasis with those used by a professional actor and voice coach revealed striking similarities. Copyright © 2000 S. Karger AG, Basel

Introduction

Almost 100% of the world’s population actively seek the occasion to listen to music. Also, music is often called the ‘language of emotions’, a poetic rather than precise description, that would simply allude to the fact that music often mediates strong emotional experiences to the listener. An important question emerges: how can music be understandable to almost anyone and how can it be so closely linked to emotions? Obviously, the composer is responsible for a good deal of the emotional impact on the listeners. However, listening even to masterpieces of compositional art can be either wonderful or boring, depending on the performance. Thus, important contributions to musical experience derive also from the performer. Analysis of music performance should be a worthwhile object in the study of musical expressivity. In our music performance research we have applied the analysis-by-synthesis strategy (fig. 1) [Sundberg, 1993]. The input is a music file containing the information given by the score, complemented with chord symbols and phrase and subphrase mark-



Johan Sundberg KTH Voice Research Centre Department of Speech Music Hearing, KTH SE–10044 Stockholm (Sweden) Tel. +46 8 790 7873, Fax +46 8 790 7854

Fig. 1. The

analysis-by-synthesis research paradigm. A music file is read by the Director Musices program, which identifies certain musical context and accordingly introduces departures from the nominal description of the piece provided by the score with respect of various available parameters, such as tone and pause duration, amplitude changes, vibrato parameters, and timbre.

ers. A performance program, Director Musices [Friberg, 1991, 1995], reads the music file. It contains a number of performance rules, which, depending on the musical context, insert micropauses and variations of amplitude, tempo, and vibrato. The output is a sounding performance. During the development the system has been continuously evaluated by an expert musician, Lars Frydén and his suggestions have been implemented in the Director Musices. In a sense, then, the Director Musices is a generative grammar of music performance, reflecting essential aspects of Frydén’s musical competence. Independent corroboration of the rules has been gathered from listening tests [Friberg, 1995]. The Director Musices rules can be divided into two main groups according to their apparent purpose in music communication, differentiation rules and grouping rules. The differentiation rules enhance the difference between tone categories, such as scale tones, intervals, and note values. The grouping rules mark the musical structure, e.g. by inserting micropauses at structural boundaries. All rules are triggered by the musical context and thus ultimately reflect nothing but the musical structure [Palmer, 1989]. Apparently this implies that Director Musices performances cannot contain any emotional information. Yet, Bresin and Friberg [1998] recently demonstrated that with appropriately chosen tempo and loudness, Director Musices is indeed capable of producing performances that differ in emotional expressivity. This is achieved by varying the selection of rules applied and the magnitudes of the rule effects. For example, a performance may sound happy, if the micropauses are made long, the tempo is quick, and the amplitude varies within and between tones. This finding is highly relevant to the issue of emotional transforms. It indicates that emotional expressivity can be derived directly from the music score simply by enhancing structural aspects of the piece. The Director Musices grammar has been developed for instrumental music. In vocal music the emotional coloring of the performance seems particularly salient, and singers tend to succeed quite well in communicating the intended emotional information to the listener [Kotlyar and Morosov, 1976; Scherer, 1995]. In forced choice tests, listeners succeed in identifying intended emotions in about 80% of the cases, on average. Similar results have been found for performance on musical instruments [Gabrielsson, 1995; Juslin, 1997].

96

Phonetica 2000;57:95–112

Sundberg

Table 1. Examples analyzed

Composer

Song title

Folk tune

Vi gå över daggstänkta berg, bars 1–8 F. Schubert Erlkönig, bars 72–79 R. Schumann Dichterliebe VII, bars 12–18 R. Schumann Liederkreis XII, bars 18–26 G. Verdi Falstaff, Ford’s monologue, bars 24–31 G. Mahler Lieder eines fahrenden Gesellen, song No. 3, bars 5–11 F. Mendelssohn Paulus, Aria No. 18, bars 5–13 F. Schubert Du bist die Ruh, bars 8–15 F. Schubert Wanderers Nachtlied, bars 3–14 F. Schubert Nähe des Geliebten, bars 3–8 R. Schumann Dichterliebe VI, bars 31–42 R. Strauss

Zueignung, bars 21–29

Text

Abbreviation

Character

Vi gå över daggstänkta berg… Mein Vater, mein Vater, Wie Du auch strahlst… Und der Mond… Laudata sempre sia…

Folk tune

agitated

Mein Vater Wie Du auch Und der Mond Ford’s Monologue Ich hab’ ein

agitated agitated agitated agitated

Gott sei mir gnädig…

Mendelssohn

peaceful

Du bist die Ruh… Über allen Gipfeln ist Ruh… Ich denke dein…

Du bist die Ruh Wanderers Nachtlied Nähe des Geliebten Es schweben

peaceful peaceful

Zueignung

peaceful

Ich hab’ ein glühend Messer…

Es schweben Blumen und Englein Und beschworst darin die Bösen…

agitated

peaceful peaceful

The purpose of the present, exploratory investigation was twofold: (1) to examine some examples of emotive transforms in singing, and (2) to compare examples of the acoustic code used for adding expressivity in singing with some examples used for marking emphasis in speech. Results from two experiments will be reported, one concerning singing and one concerning speech. Experiment I Material collected for a previous investigation was used [Sundberg et al., 1995]. Håkan Hagegård, internationally well-known professional baritone, agreed to perform 17 excerpts of differing characters from the classical repertoire in two ways, as in a concert and with as little expression as possible. The emotional expressivity of these examples was evaluated in two listening tests. In the first, a panel of expert listeners evaluated the difference in expressiveness between the neutral and expressive versions. Twelve excerpts, in which the expressive versions were perceived as particularly expressive as compared with the neutral versions, were selected for further analysis. In the second test, 5 experts classified the emotional quality of the expressive versions as either secure, loving, sad, happy, scared, angry, or hateful. The results, shown in table 1, implied that a majority perceived six examples as hateful or happy, i.e. agitated, and the remaining six as loving and secure, i.e. peaceful. The recording were analyzed with regard to overall sound level, duration of tones, long-term average spectrum (LTAS), and fundamental frequency (F0). A detailed description of the experiment and the analysis can be found elsewhere [Sundberg et al., 1995]. Tone duration has been found to play a prominent role in music performances. The measurement of tone duration was based on the result of a previous experiment with synthesized sung performances [Sundberg, 1989]. As its definition in sung performances is crucial to the results of this study, this experiment will be briefly reviewed. In speech research syllables are generally measured as in orthog-

Emotive Transforms

Phonetica 2000;57:95–112

97

Fig. 2. Illustration of the segmentation of musical syllables according to the vowel onset-to-vowel

onset criterion.

raphy, such that a CV syllable starts at the onset of the C. As a C is lengthened after a short V and shortened after a long V in singing, the duration of a C depends on the context. Hence, the duration of a CV syllable will differ depending on whether its duration is measured as in orthography or from vowel onset to vowel onset. In the synthesis experiment mentioned, the tones in a rhythmical example were assigned [la] and [la:] syllables segmented according to both these alternatives. The results revealed that a correct realization of the rhythmical structure was obtained only when syllables were segmented from vowel onset to vowel onset, as illustrated in figure 2. This demonstrates that syllable duration in singing should be measured from vowel onset to vowel onset.

Results

Figure 3 shows the mean overall sound level. On average, the agitated examples were louder than the peaceful examples, particularly in the expressive versions. The peaceful examples were softer in the expressive than in the neutral versions. Thus, the singer tended to enhance the loudness difference between the two example categories in the expressive versions. Figure 4 shows the short-term variability of sound level in terms of the mean of the time derivative, measured after a 20-Hz LP filter smoothing. The agitated examples showed a higher variability than the peaceful examples, particularly for the expressive versions. Figure 5 shows the tempo, measured as the mean number of shortest note values per second. The agitated examples were sung at a faster tempo than the peaceful examples, which, in turn, were sung slower in the expressive than in the neutral version. Also in this case the singer increased the difference between the two example categories in the expressive versions.

98

Phonetica 2000;57:95–112

Sundberg

Fig. 3. Mean overall sound level, Leq in the exerpts examined. Gray and white columns refer to neutral and expressive versions. Results for examples with an agitated and a peaceful emotional character. The bars in the two rightmost columns represent ±1 standard deviation.

Emotive Transforms

Phonetica 2000;57:95–112

99

Fig. 4. Short-term variability of loudness. The columns represent the mean of the time derivative

of the overall sound level, measured after a 20-Hz LP filter smoothing. Gray and white columns refer to neutral and expressive versions. Results for examples with an agitated and a peaceful emotional character. The bars in the two rightmost columns represent ±1 standard deviation.

100

Phonetica 2000;57:95–112

Sundberg

Fig. 5. Mean tempo, measured as the mean number of shortest note values per second. Gray and

white columns refer to neutral and expressive versions. Results for examples with an agitated and a peaceful emotional character. The bars in the two rightmost columns represent ±1 standard deviation.

Emotive Transforms

Phonetica 2000;57:95–112

101

a

Fig. 6. a LTAS of the neutral and expressive performances of an example (dashed and solid curves). The dashed line represents the mean F0. The LTAS levels at the mean of F0 (L0) and at the highest peak (L1) are shown for the expressive version. b Dominance of the fundamental, estimated as the level difference LTAS L0–L1, for the neutral and expressive versions of the agitated and peaceful examples (filled and open symbols).

b

The level difference between the fundamental and the first formant changes with glottal adduction and thus reflects an aspect of phonation [Sundberg, 1987]. This aspect was studied from LTAS of each excerpt. The highest peak in an LTAS of vocal material roughly corresponds to the mean level at the mean of F1 (fig. 6a). From such spectra the level was determined at the frequency corresponding to the average F0 of the example. This level, henceforth LTAS L0, and the level of the main peak of the LTAS, L1, are marked for the expressive version in the example shown in the figure. As illustrated in figure 6b the difference LTAS L0–L1 tended to be greater in the peaceful examples, particularly in the expressive versions. In other words, the fundamental was mostly more dominant in the expressive versions of the peaceful examples. Thus, the singer apparently performed the peaceful examples with less glottal adduction than the agitated examples.

102

Phonetica 2000;57:95–112

Sundberg

Fig. 7. Deviations from nominal duration in the expressive and neutral versions for all tones in all excerpts. The dotted line represents the case that the deviations were identical in both versions. The solid line represents the trendline, equation and correlation shown in the upper right corner.

According to Fónagy [1976], articulatory movements play a prominent role in the emotional coloring of speech. Formant frequency transition time was analyzed for some agitated and peaceful examples. Because of the comparatively high F0, measurement was difficult in many cases. In the cases where reliable data were available, surprisingly small differences were found. However, a more detailed analysis is required before any conclusions can be drawn. The score specifies durational relations between the various tones in nominal terms rather than as they are realized in a performance. Hence, a comparison between nominal and performed durations is interesting. Figure 7 compares normalized deviations from nominal durations in the expressive and neutral versions of all excerpts in terms of the mean lengthening per shortest note value. Had the tones in the neutral versions not deviated from nominal duration, all data points would have clustered around the vertical axis in the diagram. Instead, they are scattered between –200 and +300 ms. This indicates that the tone durations deviated considerably from nominal in the neutral versions. The solid line represents the best linear fit. Had the tones in the expressive versions departed from nominal durations as much as in the neutral versions, the trend line would have fallen upon the diagonal. Instead it shows a slope of 1.27. This indicates that the singer made similar departures from nominal duration in the expressive versions as compared to the neutral versions, but the departures were greater in the expressive versions. Thus, some of the deviations from nominal durations, that the singer used in the expressive versions, transpired also to his neutral versions; similar observations have been made in performance of instrumental music [Palmer, 1989]. Music structure is hierarchical; small groups of tones, musical gestures, join to form greater groups of tones, subphrases, which join to make still greater groups of tones, phrases, etc. This hierarchy is reflected in performance; typically, musical gestures are terminated with a micropause, and subphrases and phrases are initiated by an accelerando from a slow tempo and terminated with a rallentando (a slowing down of the tempo). These characteristics have been implemented in a model illustrated in

Emotive Transforms

Phonetica 2000;57:95–112

103

a

b Fig. 8. a Model for the tempo curve, expressed as the ratio between performed and nominal duration (DR), for subphrases and phrases implemented in the Director Musices program [from Friberg and Sundberg, 1999]. b Performed-to-nominal duration ratios in the singer’s neutral and expressive versions of the Mendelssohn example. The nominal durations were calculated on the basis of the mean duration of the shortest note value in the entire example.

figure 8a [Friberg and Sundberg, 1999]. Figure 8b shows a typical sung example in terms of deviations from the nominal durations, which were calculated on the basis of the mean duration of the shortest note value in the entire example. The example illustrates the observation above that the deviations were similar but often slightly greater in the expressive than in the neutral versions.

104

Phonetica 2000;57:95–112

Sundberg

a

b Fig. 9. a Performed-to-nominal duration ratios for the syllables of the Zueignung example. The gray and white columns refer to the neutral and expressive versions. b Phoneme duration ratio between the expressive and neutral versions for a section of the Zueignung example. Vowels are represented by white columns.

Vocal performers often seem to emphasize words that they perceive as particularly important for semantic reasons. Emphasized words in the material were identified by an informal listening test. The singer seemed to use different methods for marking emphasis. One method was to lengthen the stressed syllable of the emphasized word. The performed/nominal duration ratios were determined for all syllables in the neutral and expressive versions of all excerpts. These ratios were then compared between the neutral and expressive versions for each syllable. In 34 cases the performed/nominal duration ratio for a syllable was more than 20% greater in the expressive than in the neutral version. Of these, 16 lengthenings occurred for syllables that appeared in a stressed position in the bar. Thus, in these cases the singer lengthened the stressed syllable in emphasized words. Another method to emphasize words was observed almost as frequently. In 18 of the 34 cases just mentioned the lengthenings occurred on syllables that appeared on the

Emotive Transforms

Phonetica 2000;57:95–112

105

a Fig. 10. a F0 curves for the neutral and expressive versions of the Zueignung example. b F0 curves

for the neutral and expressive versions of the Mein Vater example.

upbeat of the emphasized syllable of the word. Thus, in these cases the lengthening occurred in an unstressed position in the bar, i.e. on the syllable preceding the stressed syllable of the emphasized word. As a result, the stressed syllable of the emphasized word was somewhat delayed. This phenomenon might be called the emphasis by delayed arrival. Figure 9a shows a typical example. Here, the word ‘Bösen’ (evil) was perceived as emphasized, which seems logical from a semantic point of view. Although appearing in an unstressed upbeat position in the bar, the syllable (d)‘ie B’(ösen) was clearly lengthened, while the syllable (B)‘ös’(en) was slightly shortened in the expressive version. Figure 9b shows the phoneme duration ratio between expressive and neutral for this section of the text. It can be seen that the lengthening concerned the consonant [b] rather than the vowel preceding it. Several similar examples were observed in the material.

106

Phonetica 2000;57:95–112

Sundberg

10b

Other events that seemed typically associated with perceived emphasis consisted of specific pitch patterns. Figure 10a shows an example. In the neutral version of this peaceful excerpt, F0 changed quickly between the tones while in the expressive version, long and wide ascending pitch glides occurred on ‘Und’ and in ‘(b)eschw(orst)’. Such pitch glides did not characterize all expressive versions. In the agitated example shown in figure 10b, pitch glides can be seen in the neutral rather than in the expressive version on the phrase-initial words ‘Mein’ and ‘und’. In the expressive version the pitch curve changed more abruptly. In addition, the figure illustrates that the extent of the vibrato modulation in agitated examples was much greater in the expressive version. This is in agreement with the observation that the extent and rate of the vibrato are important in signaling the emotion of fear in music performances [Gabrielsson and Juslin, 1996].

Emotive Transforms

Phonetica 2000;57:95–112

107

a

b Fig. 11. a Syllable duration differences between the neutral and emphasized cases for one of the

spoken sentences. Black, gray and white columns refer to the versions where the words ‘tisdag’, ‘ska’ and ‘städa’, respectively, were emphasized. b Phoneme duration differences between the neutral and emphasized cases for one of the sentences. Black, gray and white columns refer to the versions where the words ‘tisdag’, ‘ska’ and ‘städa’, respectively, were emphasized.

108

Phonetica 2000;57:95–112

Sundberg

Fig. 12. Audio and F0 curves for the spoken words ‘men vad’ pronounced in a neutral way (left) and

with emphasis on the word ‘vad’ (right).

Experiment II One method that the singer used to emphasize words was to lengthen the stressed syllable of the emphasized word. Fant et al. [1999] recently found a strong correlation between perceived stress and vowel duration in speech. Thus, in these cases the singer used the same code for emphasis that can be used in speech. Several examples were also found of emphasis by delayed arrival, i.e. lengthening the unstressed tone preceding the stressed syllable of an emphasized word. To find out if this principle is also applied in speech, it is necessary to measure the duration of syllables defined in the same way as in singing, i.e. from vowel onset to vowel onset. This was realized by a simple experiment. A female actor and hightly experienced voice coach was asked to read a set of short Swedish sentences, first six times in a neutral way, and then three times emphasizing one of the different words in the sentence. Vowel onset to vowel onset syllable duration was measured, and means were calculated for each condition. The syllable durations observed for the neutral version were used as a reference, such that lengthenings and shortenings were calculated relative to the neutral version.

Results

Figure 11a shows the differences between the neutral and emphasized cases for one of the sentences. Not only the emphasized syllables, but also the syllable preceding them were clearly lengthened. For instance, when ‘tisdag’ (Tuesday) was emphasized, not only ‘isd’ but also the preceding syllable ‘å t’ showed increased duration. A seg-

Emotive Transforms

Phonetica 2000;57:95–112

109

ment duration analysis of the same sentence showed that consonants preceding emphasized vowels tended to be lengthened (fig. 11b). For example, the consonant [t] was clearly lengthened in the version emphasizing the word ‘tisdag’. Similarly, when the word ‘städa’ (tidy) was emphasized, the initial consonants [s] and [t] were lengthened. Heldner [1996] has made similar observations. Thus, also in this case, we find a similarity of emphasis markers in speech and singing. The same spoken material was analyzed also with respect to F0 contours. Figure 12 compares a neutral version of the sentence ‘Men vad (betyder det?)’ (But what does it mean?) with the version where ‘vad’ (what) was emphasized: in the neutral version the duration of the consonant [v] was short and the pitch contour is part of an overall gently descending glide. In the emphasized version it was produced with a marked pitch gesture. This is similar to the pitch contour observed in the same consonant in ‘Und beschworst’. This shows that at least part of the emphasis markers used for voiced consonants in singing can be found also in speech.

General Discussion and Conclusions

Above some examples of emotive transforms in singing have been discussed. Although synthesis experiments are certainly needed to test the generality of the observations made, the results still invite to some speculation. This study relies heavily on the professional competence of the singer and the actor. Their expertise is to detect the emotional character of a text and to realize it in an understandable way to listeners. Interestingly, composers often leave most of the emotional interpretation to the performer. In many of the examples considered here, the composer’s instruction is limited to hints for tempo, e.g. ‘Rather slow’. Therefore, performer’s ability to sense the emotional character of the text and the music as well as to realize it in an intelligible way is crucial. The striking similarities between the expressive means used in singing and in speech suggest that most listeners correctly perceive the emotional transforms. This must limit the variability of sung performances. For example, listeners are likely to react negatively, if a singer performs a peaceful song in an agitated way. In this study, most observations have concerned examples of the differentiation principle, while the only example of the grouping principle mentioned was phrasing. The underlying assumption was that much of the essence of emotional transforms seemed likely to occur in the differentiation of tone categories. It seems that the marking of the hierarchical structure is in a sense a more basic aspect of music performance than the emotional coloring. For example, the singer marked the phrases almost as clearly in the neutral as in the expressive versions. Yet, a performer’s emotional coloring of a song can be expected also to affect the marking of tone groups. Thus, the sound level changes reflecting the harmonic progressions were often greater in the expressive than in the neutral versions. The code used by our singer subject to color his expressive versions emotionally is largely similar to that used in speech. The singer sang the agitated examples louder, with greater amplitude variation and at a faster tempo than the peaceful examples. These three acoustic characteristics have been found typical of the expression of activity during speech [Scherer, 1995]. Likewise, similarities were found with regard to how emphasis was signaled in singing and speech. For example, the singer lengthened

110

Phonetica 2000;57:95–112

Sundberg

stressed syllables in emphasized words. As mentioned, this code for signaling emphasis has been found also in speech [Fant et al., 1999]. Furthermore, we found examples of emphasis by delayed arrival both in the singer’s expressive versions and in the actor’s speech. The principle of delayed arrival implies lengthening of an unstressed syllable preceding a stressed syllable. The lengthening was found to concern not only the vowel but also the consonant of the unstressed syllable. If sung syllables were defined as in orthography, such lengthened consonants would belong to the stressed syllable, so the lengthening would occur on the stressed syllable. However, as mentioned before, this definition of syllables does not apply to singing. Yet, a particularly interesting case would be when the unstressed syllable preceding the stressed contained only a vowel. (John Kingston is gratefully acknowledged for pointing out the relevance of this case.) Only one such case occurred in the entire material, ‘die Augen...’ in the example Es schweben. Also in this case the syllable ‘ie’ was lengthened. Although more such examples should be analyzed, our results clearly suggest that delayed arrival is a useful emphasis marker in some sung contexts. The level of the fundamental tended to be higher in the peaceful than in the agitated examples. This suggests that the singer varied the voice source depending on the emotional character of the song, using more glottal adduction in the agitated than in the peaceful examples. It is tempting to speculate about the reason for this. In speech a high degree of glottal adduction is typically used for loud phonation at high pitches, such as in shouting or screaming, while a low degree of adduction is common in soft voice. Hence, an increase of glottal adduction in agitated examples and a decrease in peaceful examples is likely to contribute to the emotional expressivity. A similar reasoning seems applicable to the difference in tempo, sound level, and sound level variability in the two types of examples. What are the emotive transforms? As demonstrated by Bresin and Friberg [1998], essential contributions to emotional expressivity in piano performances seem to derive from the application of the two performance principles grouping and differentiation. This indicates that the marking of the hierarchical structure and the enhancing of the differences between tone categories contribute to emotional expressivity. As might be expected, examples of the grouping and differentiation principles were also found in the sung performances analyzed here. Moreover, the effects were mostly slightly greater in the expressive than in the neutral performances. This observations has also been made for performances of instrumental music [Palmer, 1989]. Thus, a clearer differentiation of tone categories and a clearer marking of structure seem to be components in the emotive transforms. The singer seemed to apply the differentiation principle also with respect to the emotional coloring of the performance. For example, he sang most of the agitated examples louder than the peaceful examples and in the expressive versions he increased this difference by singing the expressive versions louder than the neutral versions. Similarly, the sound level variability was on average greater in the agitated than in the peaceful examples and he enhanced this difference in the expressive versions. Thus a clearer marking of the acoustic code used for emotional expressivity seems to be a component of the emotive transforms. The emphasizing of semantically important words seemed to be typical of the expressive versions of the sung examples. This can also be seen as a case of differentiation, although based upon semantics rather than musical structure. Thus, the princi-

Emotive Transforms

Phonetica 2000;57:95–112

111

ples of grouping and differentiation seem highly relevant to emotive expressivity. Both can be found in speech [Lindblom, 1979; Diehl, 1991] and also in other forms of communication. Indeed, they may be quite basic to human communication in general [Carlson et al., 1989]. Fónagy [1962, 1976] launched the idea of glottal and articulatory movement as a leading principle underlying expressivity in speech. We found examples of variation in glottal parameters, such as pitch and the dominance of the fundamental. Thus, our results support the assumption that expressive transforms are closely related to glottal factors. Two questions were asked in the introduction: How can music be emotionally expressive, and why is music so widely appreciated? Our observations have shown that the principles of grouping and differentiation seem instrumental in producing emotive transforms in singing. They seem equally relevant to speech. In this sense, music is not special. Moreover, we have found many examples of identity or similarity in the code used in music and speech for the purposes of grouping, differentiation, and emotional coloring. This should make music understandable and possible to interpret in emotional terms to almost anyone who understands speech.

References Bresin, R.; Friberg, A.: Emotional expression in music performance: synthesis and decoding. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 85–94 (1998). Carlson, R.; Friberg, A.; Frydén, L.; Granström, B.; Sundberg, J.: Speech and music performance: parallels and contrasts. Contemp. Music Rev. 4: 389–402 (1989). Diehl, R.: The role of phonetics within the study of language. Phonetica 48: 120–134 (1991). Fant, G.; Kruckenberg, A.; Liljencrants, J.: Prominence correlates in Swedish prosody. Int. Congr. Phonet. Sci. 99, San Francisco 1999, vol. 3, pp. 1749–1752. Fónagy, I.: Mimik auf glottaler Ebene. Phonetica 8: 209–219 (1962). Fónagy, I.: La mimique buccale. Phonetica 33: 31–44 (1976). Friberg, A.: Generative rules for music performance: a formal description of a rule system. Computer Music J. 15: 56–71 (1991). Friberg, A.: A quantitative rule system for musical performance; doct. diss. KTH, Stockholm (1995). Friberg, A.; Sundberg, J.: Does music performance allude to locomotion? A model of final ritardandi derived from measurements of stopping runners. J. Acoust. Soc. Am. 105: 1469–1484 (1999). Gabrielsson, A.: Expressive intention and performance; in Steinberg, Music and the mind machine, pp. 35–47 (Springer, Berlin 1995). Gabrielsson, A.; Juslin, P.: Emotional expression in music performance: between the performer’s intention and the listener’s experience. Psychol. Music 24: 68–91 (1996). Heldner, M.: Phonetic correlates of focus accents in Swedish. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2, pp. 33–36 (1996). Juslin, P.: Emotional communication in music performance: a functionalist perspective and some data. Music Percept. 14: 383–418 (1997). Kotlyar, G.M.; Morosov, V.P.: Acoustical correlates of the emotional content of vocalized speech. Sov. PhysicsAcoustics 22: 208–211 (1976). Lindblom, B.: Final lengthening in speech and music; in Gårding, Bruce, Bannert, Nordic prosody. Travaux de l’Institut de Linguistique de Lund, No. 13, pp. 85–101 (1979). Palmer, C.: Mapping musical thought to music performance. J. exp. Psychol. 15: 331–346 (1989). Scherer, K.: Expression of emotion in voice and music. J. Voice 9: 235–248 (1995). Sundberg, J.: Synthesis of singing by rule; in Mathews, Pierce, Current directions in computer music reasearch. System Development Foundation Benchmark Series (MIT Press, Cambridge 1989). With sound examples on CD ROM, 45–55 and 401–403. Sundberg, J.: How can music be expressive? Speech Commun. 13: 239–253 (1993). Sundberg, J.; Iwarsson, J.; Hagegård, H.: A singer’s expression of emotions in sung performance; in Fujimura, Hirano, Vocal fold physiology: voice quality and control, pp. 217–232 (Singular Publishing Group, San Diego 1995). Sundberg, J.: The Science of the Singing Voice (Northern Illinois University Press, De Kalb, Illinois 1987).

112

Phonetica 2000;57:95–112

Sundberg


Received: October 29, 1999 Accepted: February 9, 2000

The Source-Filter Frame of Prominence Gunnar Fant Anita Kruckenberg Johan Liljencrants Department of Speech, Music and Hearing, The Royal Institute of Technology, KTH, Stockholm, Sweden

Abstract One aim of our study is to discuss some of the relations of prosodic parameters to the speech production process, e.g. how speech intensity is related to the vocal tract filter and the voice source and the underlying aerodynamics. A specific problem of phonetic interest is the role of subglottal pressure and fundamental frequency as intensity determinants and their covariation in speech. Our speech analysis displays, incorporating perceptually scaled syllable prominence, are suitable for multilevel studies of speech parameters. A new intensity parameter, SPLH, related to sonority is introduced. In combination with the standard sound pressure level it provides information on the voice source spectral slope. In Swedish, high long stressed vowels approach a semi-closed target, and thus a sonority minimum, which suggests a motor component in prominence perception. Copyright © 2000 S. Karger AG, Basel

Introduction

In a primitive view prosody is a set of suprasegmental categories carried by fundamental frequency (F0), duration and intensity. However, it is apparent that one of the prosodic categories, the perceived relative prominence of syllables, words and parts of speech is carried by an extended set of correlates to the entire production process, source as well as filter functions. Relative prominence has a basic role in Lindblom’s [1990] concept of hyper- versus hypospeech. A well-recognised feature of hypospeech is that a sufficient articulatory reduction will transform a stop consonant into an approximant. Similarly, a nasal consonant may degenerate into a nasalised vowel. In such cases, the consonant-vowel contrast is reduced as illustrated in figure 1. Here the various degrees of reduction pertain to different speakers and reflect their individual modes of articulatory distinctiveness [Fant et al., 1991]. The present study has a wider scope. It has developed from analysis of prose reading of Swedish texts [Fant and Kruckenberg, 1989, 1994, 2000; Fant et al., 1999, 2000]. We have also performed studies of reading of metrically structured Swedish



Gunnar Fant Department of Speech, Music and Hearing The Royal Institute of Technologie, KTH PO Box 70014, S–10044 Stockholm (Sweden) E-Mail [email protected]

a

b

Fig. 1. a Three speakers varying in degree of velar /g/ closure. b Two speakers differing in degree of

alveolar closure in /n/. From Fant et al. [1986, 1991].

poetry [Kruckenberg and Fant, 1993]. In our experience prominence in Swedish may be related to the following set of features and measures: (1) duration of syllables and vowels; (2) local F0 accentuation patterns; (3) the overall intensity of the speech wave (SPL, in dB); (4) the same with a high-frequency pre-emphasis, SPLH; (5) measures pertaining to the spectrum of the voice source, and (6) spectral modifications of consonants and vowels. There are listed in an approximate order of relevance. Duration and F0 cues have been extensively treated in the references above. We shall concentrate on remaining speech production-related categories. Articulation and phonation have a conceptual similarity to filter and source but these dichotomies are not identical. When discussing prominence in terms of source and filter, we have to observe that a single articulatory gesture may effect both the filter and the source [Fant,

114

Phonetica 2000;57:113–127

Fant/Kruckenberg/Liljencrants

1997]. Source and filter functions determine the overall intensity, spectral shape and spectral tilt of a vowel. These may be discussed in terms of auditory attributes such as loudness and sonority. Loudness is a psycho-acoustic term referring to overall perceived strength of a sound. Sonority is partially synonymous to loudness but implies a resonant quality. One attempt to relate to these terms has been to introduce, in addition to the regular SPL, a high-frequency pre-emphasised intensity measure, the SPLH. The SPLH is qualitatively more, closely related to the concepts of loudness and sonority than is SPL. The difference between SPLH and SPL, when corrected for the influence of the specific formant pattern, provides a measure of the voice source spectral tilt [Fant, 1997]. The relevance of the source spectral tilt as a correlate to prominence has been treated in detail by Sluijter and van Heuven [1996]. Earlier estimates of spectral tilt and prominence were reported by Fant and Kruckenberg [1995, 1996]. Within a specified context increase of prominence is usually associated with increase of both intensity and loudness. The high-frequency spectral boost associated with increasing voice effort was first demonstrated by Fant [1959; see also Fant 1980, 1995, 1997]. These data showed that, averaged over several vowels, a 10-dB increase of the level of F1 was accompanied by about 16 dB increase in the region of F3 and only 4 dB in the level of the amplitude of the voice fundamental. As will be discussed, variations in relative prominence also affect articulation, with consequences not only for the filter function but also for the voice source and noise sources. We shall expand on some of these issues and also discuss the covariation of subglottal pressure, F0 and speech intensity within an utterance and their relations to prominence. Experimental Techniques Prominence Scaling A novelty that we have introduced in our speech analysis work is a measure of perceived syllable or word prominence. It is based on a continuous interval scale labelled Rs from 0 to 30. It was first introduced by Fant and Kruckenberg [1989], who showed that word prominence assessments closely follow those of the syllable carrying maximum stress in lexical pronunciation. Applications to stress correlates were discussed by Fant and Kruckenberg [1994]. The Rs scale has been applied extensively in our recent integrated speech analysis strudies of Swedish prose reading and in related intonation modelling [Fant et al., 1999, 2000]. It has also influenced synthesis work [Portele and Heuft, 1997]. An approximate mapping of Rs measures to phonological categories would place unstressed syllables in the region of Rs 25. These are typical values only, subject to overlap. Results from a test where 15 listeners graded 223 syllables in a paragraph of prose reading gave the following results at the word level [Fant et al., 1999]: Content words scored an average of Rs = 19 and function words Rs =11. Nouns received a score of Rs = 20, adjectives 18, verbs and adverbs 17. Pronouns scored 12.5, prepositions 11, auxiliary verbs 10.5 and remaining function words Rs = 9.5. Some variability was encountered by occasional de-emphasis of content words and emphasis of function words. In any case, average values of content and function words differ substantially. Data Display The speech material originated from a session [Fant et al., 1997a, b] in which simultaneous measures of true subglottal and supraglottal pressures had been recorded. The speaker, S.H., was a medical doctor specialising in voice research. He has a good voice and a standard Swedish pronunciation.

Source-Filter Frame of Prominence

Phonetica 2000;57:113–127

115

Fig. 2. Standard speech analysis display: Prominence Rs, oscillogram, spectrogram, subglottal pres-

sure (Psub) and supraglottal pressure (Psup), F0, SPLH and SPL.

116

Phonetica 2000;57:113–127


Fig. 3. Prominence Rs as a function of acoustic parameters sampled from vowels [a] in sentence-

initial positions. Lower right: Rs as a function of Rs predicted from joint data on duration and SPLH–SPL.

Results from prominence ratings of the read Swedish text have been added to a display of oscillogram, spectrogram, subglottal pressure, supraglottal pressure, F0 and two intensity traces. An example is shown in figure 2. Our standard text, a 1-min prose reading, covers 20 such assembles, which provides us with a useful database for prosodic and segmental investigations. One of the intensity traces is the SPL in dB with flat weighting. The other is the SPLH which differs from the SPL by the introduction of our standard pre-emphasis. (1) G(f) =10log10{(1+f2/2002)/(1+f2/5,0002)} dB which has a gain of 3 dB at 200 Hz, 14 dB at 1,000 Hz and 25 dB at 5,000 Hz. SPLH is more sensitive to variations in the region of the second and the third formant, F2 and F3, than is SPL and may accordingly provide a better correlate to the concept of sonority. Moreover, the difference SPLH–SPL brings out the spectrum level of formants above F1, which in part is related to the source and in part to the filter function, i.e. the formant pattern. At constant articulation, variations in the SPLH–SPL measure accordingly bring out variations in the high-frequency contents of the source which in turn is related to the concept of spectral tilt [Sluijter and van Heuven, 1996; Campbell and Beckman, 1997]. Linear Regression Analysis A major task has been to establish quantitative relations between Rs values and physical parameters and of parameter covariation. Because of the complex contextual variability within a sentence we have restricted most of the statistical analysis to simple linear regression of data selected from specified contexts. Figure 3, from Fant et al. [1999, 2000], shows Rs of a set of eleven [a] vowels, selected from breath group-initial positions, as a function of acoustic parameters. Correlation coefficients from lin-


Phonetica 2000;57:113–127

117

ear regression analysis gave r = 0.89 for duration, r = 0.84 for Psub, r = 0.79 for SPL, r = 0.91 for SPLH and r = 0.93 for (SPLH–SPL). A prediction of Rs from duration and (SPLH–SPL) gave r = 0.95. The gain from combining two strong predictors is not very great.

Functional Analysis

Perceived prominence may be related to three levels of analysis, (1) the speech wave in terms of relative intensity and spectrum patterns, (2) a decomposition in terms of source and vocal tract filter characteristics, and (3) the underlying articulatory and phonatory processes. An insight in speech production theory promotes an understanding of the rather complex relations within and across the separate levels. In addition we have to consider the contextual factors such as position within an utterance. Observed variations in intensity and source spectrum of vowels may thus be traced back to phonatory as well as articulatory events with an understanding that an articulatory gesture may effect the phonatory process and thus affect the voice source. Another aspect of relative prominence may appear in the overall spectral pattern of a syllable rather than in vowel intensity and source patterns. This is primarily a matter of hypo- versus hyperspeech. Intensity and Source Features Intensity is systematically context-dependent. An unstressed syllable in the beginning of a breath group often has a greater intensity than a stressed syllable at the end of a breath group. A smaller dependency of context is found in the incremental rate of intensity increase with prominence. An increase of Rs by 10 units, in the domain of accented syllables from Rs =15 to Rs = 25, was found to be associated with 6 dB in SPL, 9 dB in SPLH and thus 3 dB in the SPLH–SPL parameter. The latter is quite sensitive to the particular vowel quality, i.e. the formant pattern and increases with the degree of articulatory opening. With increasing stress the gain in the SPLH–SPL parameter is only in part related to the voice source slope. For the [a] vowel the shift in articulation towards a less reduced formant pattern has a greater effect on the SPLH than has the associated source spectrum tilt. A prominence difference of 10 Rs units was modelled in an /a/ with the low-prominence case F1 = 600 Hz, F2 =1,300 Hz, F3 = 2,500 Hz, and F1 = 700 Hz, F2 =1,200 Hz, F3 = 2,500 Hz selected for the high-prominence case. The difference in formant pattern contributed 2 dB to the 3 dB difference in SPLH–SPL and the less steeply falling source slope of the high-prominence case contributed only 1 dB. However, in non-open vowels formant patterns vary less with prominence. The main benefit of SPLH compared to SPL is that SPLH is closer to a sonority measure. However, it can be argued that at least in Swedish there are instances where prominence is associated with a phase of reduced sonority. A long close vowel when stressed attains an overshoot gesture towards complete closure in the middle or end of the vowel. Thus the name Maria (fig. 4) will be pronounced [marì:ja]. The sonority minimum in the semi-closed phase is more apparent in SPLH than in SPL. The observed pattern conveys an articulatory gesture that could support a motor theory of prominence perception, but we do not go that far and prefer an interpretation with reference to a learned association within the speech code, the auditory pattern providing sufficient cues.

118

Phonetica 2000;57:113–127


Fig. 4. Example of how a focal F0 peak overshooting the subjects mean F0, the F0r = 220 Hz, affects

the SPL contour. The lower intensity curve is (SPLH–SPL–14). Observe also the [j] element of the name ‘Maria’. From Fant and Kruckenberg [1994, 1996].

In the example above one part of the sonority minimum is related to the vocal tract filter function. Thus a shift of F1 alone from 300 to 210 Hz will lower the amplitudes of all higher formants by 6 dB. However, an additional component contributing to the sonority minimum is the articulatory effect on the voice source, which attains a lower exitation amplitude and a relative weakening of its high-frequency content [Bickley and Stevens, 1986]. A more detailed survey of source-filter interaction [Fant, 1997] in the frame of the LF model [Fant et al., 1985] reveals the general trend of how the source spectrum is affected by the articulatory pattern. This interaction is not only confined to consonants, the articulatory influence is also apparent in a comparison of open and closed vowels of non-extreme articulation. Articulatory Expansion and Reduction A special category of prominence correlates related to units larger than a single vowel, e.g. CVC segments and syllables, involve force and manner of articulation,


Phonetica 2000;57:113–127

119

expansions and reductions as in hyper- versus hypospeech [Lindblom, 1990]. We have already mentioned articulatory reductions of stops into approximants and of nasal consonants into nasalised vowels. Vowel-consonant contrasts are accordingly reduced as was illustrated in figure 1. Another aspect of articulatory reduction is that voiceless consonants may assimilate the voicing of adjacent vowels. A well-known example is that an unstressed /t/ may turn into a [d]. Assimilation of voicing is also typical of the consonant /h/ in unstressed, especially intervocalic positions, where it loses its vocal fold articulatory configuration and is reduced to a breathy vowel [Fant, 1993]. In our experience, a further reduction may eliminate the specific /h/ segment, and what is left is a slightly breathy onset of the voice source in the following vowel. In the boundary region between a voiced and a following unvoiced sound, the voicing is exposed to an anticipatory vocal fold abduction gesture. An example is preocclusion aspiration of unvoiced stops. With high prominence the vocal cord abduction gesture may start already in the beginning of the preceding vowel, which causes a gradual increase of the non-vibratory glottal area, i.e. a progressing leakage. As a result the glottal source loses high-frequency harmonic energy. The first and second formant bandwidths are substantially widened [Fant, 1997], and there is a fill-in of aspiration noise. These pre-occlusion features add to the basic prominence cues of burst duration and intensity and the F1 cut-back governed by a prolonged vocal fold abduction. Subglottal Pressure Apart from initial rise and decay phases of the order of 150 ms, the average contour of the subglottal pressure, Psub, within a breath group is a decline from about 5–7 cm H2O to 3.5–4.5 cm H2O, approximately independent of the duration of the breath group [Fant et al., 1997a, b]. The associated F0 declination is of the order of 4–5 semitones and intensity declines by about 6 dB following the same temporal pattern. Accordingly, correlating Rs with Psub or F0 or SPL within the total set of syllables in a breath group provides very low scores. On the other hand, confining the analysis to a restricted context, e.g. of vowels [a] in phase-initial position as in figure 3, there is a clear correlation, but it can be argued that it is partially maintained by a clustering of the data into two groups, low and high stress. A more detailed analysis of the complete corpus reveals a pattern of Psub starting to build up, well in advance of the stressed syllable, to a shallow maximum at the left boundary of the syllable followed by a decaying contour [Fant et al., 1997a, b, 1999]. The location of the turning point coincides with the P center of rhythmical analysis [Marcus, 1981]. A typical example is shown in figure 5 where the F0 peak in the focally accentuated word [dru:g] is located in an interval of decaying Psub. The same synchrony pattern is apparent in the two words of figure 7 and can also be identified in the first syllable of figure 2. Furthermore, we have found that the rate of Psub decay within stressed syllables of our corpus is positively correlated with its prominence, Rs (r = 0.5). There is accordingly some evidence that the subglottal pressure promotes not only focal stress but has also a role at moderate stress levels. On the other hand, we have evidence that focal prominence can be activated without a raised Psub, typically in breath group-final positions, and is supported by F0 and duration cues. In Swedish the dominant correlate of focal accentuation is the size of the local F0 modulation.

120

Phonetica 2000;57:113–127


Fig. 5. Illustrating a high-prominence F0 peak located in a phase of decaying subglottal pressure.


Phonetica 2000;57:113–127

121

Covariation of Psub, F0 and SPL A specific problem of phonetic interest is how the intensity of a voiced sound is related to subglottal pressure, Psub, and to fundamental frequency, F0. We shall make a detour into production theory and merely point out some relevant aspects of the production mechanism. In terms of a source-filter model, a speech sound is the product of a source scale factor and the spectral properties of the source and the filter. In voiced sounds the scale factor is not the glottal flow amplitude, but the slope of glottal flow at the closure discontinuity, in other words, the corresponding negative peak of the glottal flow derivative. In the LF model [Fant et al., 1985; Fant and Lin, 1985; Fant, 1993, 1995, 1997] it is labelled Ee. Formant amplitudes are directly proportional to Ee [Fant, 1993, 1997]. The main shape of the glottal flow pulse determines the low-frequency part of the source spectrum, whereas the middle and higher parts increase with the abruptness of the closure. From our earlier work [Fant et al., 1996, 1997a, b] we have found that Psub increases with F0 up to a mid-frequency, F0r, in the speaker’s voice range and levels off or decays at higher frequencies. On the other hand in singing [Sundberg et al., 1999], it is found that Psub as well as SPL continue to rise within the entire F0 range. In speech, according to our data, Ee and also SPL increase with F0 in the lower part of the F0 range, F0 < F0r, and at a faster rate than Psub. Here we find that Ee increases in proportion to F02 and Psub in proportion to F00.7 and inversely that F0 increases in proportion to Psub1.4. Furthermore at F0 < F0r Ee was found to be proportional to F01.35 at constant Psub, and to Psub1.1 at constant F0. Thus the joint contribution of Psub and F0 is an Ee proportionality of Psub1.1 F01.35. How do these findings relate to the physics and physiology of voice production? It is known from earlier studies that a perturbation of Psub at constant laryngeal muscle activation causes a small passive increase of F0. According to modelling performed by Titze [1986] this effect is largely confined to low F0 phonations where the vocal folds are slack and lack stretching. Here the F0 increase is of the order of 4 Hz/cm H2O in Psub. Titze [1986] explains the F0 rise by an increase of the width of the glottal slit at constant length causing an elongation and stretching of the edge contour. However, this passively induced F0 rise is much smaller than what we observe. A decrease of Psub from 6 to 5 cm H2O would according to the Psub1.4 proportionality of our data account for an F0 lowering from 120 to about 95 Hz, a step of 4 semitones, which is the observed order of magnitude within the declination contour of a breath group. It should be kept in mind that F0 and Psub, in spite of covarying trends, basically employ different physiological mechanisms and are free to vary independently. In a low pitch range F0 is controlled mainly by the thyroarytenoid (vocalis) muscle whereas the cricothyroid takes over the control in a higher pitch range [Titze, 1986]. The analytical relation between SPL and F0 at constant Ee is simple. An increase of F0 by one octave, doubling the number of excitations per time unit, causes a doubling of the intensity, in other words 3 dB. This is but a minor part of the overall SPL increase with F0. However, at F0 > F0r, we find that both Ee and SPL reach a saturation limit or decrease with increasing F0. Combining observed trends for the entire F0 range, we have derived the following equations for predicting Ee and SPL (given a particular transglottal pressure Ptr = Psub–Psup), F0 and the speaker’s F0r. Eep = K+20log10{Ptr1.1xn1.35[(1–xn2)2 +xn2/Q2]–0.5} dB SPL = K+20log10{Ptr1.1xn1.85[(1–xn2)2 +xn2/Q2]–0.5} dB

122

Phonetica 2000;57:113–127


(2) (3)

Fig. 6. Covariation of glottal vibratory area, the glottal source scale factor (Ee) and its predicted

value (Eep), and subglottal pressure (Psub) with F0.

where xn = F0/F0r and Q =1.25. The covariation of Ptr = Psub, Ee and Eep and glottal area Ag with F0 is shown in figure 6. These data originate from glissando phonations [Fant et al., 1997a, b]. Observe that Ag attains maximum value at F0 = F0r. A test of the prediction formula (equation 3) within a spoken sentence is illustrated in figure 7. Measured and predicted SPL agree fairly well. Here, the typical feature of a Psub fall in a domain of increasing F0 is apparent. Examples of this negative covariation of Psub and SPL were first considered to be rather puzzling, but they turned out to be consistent, and we now understand that they


Phonetica 2000;57:113–127

123

Fig. 7. Predictability of SPL from Psub) and F0. The lower intensity contour is (SPLH–14). Observe the lag of F0 with respect to Psub in accented syllables.

124

Phonetica 2000;57:113–127


display a prosodically structured synchrony pattern. As already mentioned and illustrated in figure 5, Psub appears raised at the left boundary of a sufficiently stressed syllable and then decays during the vowel at the same time that F0 displays the typical rise-fall ‘hat pattern’. Furthermore, in focal accentuation F0 will generally overshoot the F0r, which implies a local reduction of SPL in the region of the peak F0. On occasion (fig. 4) one may also observe the associated local SPL maxima at instances in time where the F0 contour passes through F0r [Fant and Kruckenberg, 1994, 1996].

Summary and Discussion

Our study has covered specific aspects of acoustic correlates of perceived prominence with an attempt to relate these to the voice production mechanism, the voice source and to articulation. The novelty of the approach lies in a multilevel analysis incorporating records of subglottal and supraglottal pressure, and a continuously scaled prominence parameter, Rs, in synchrony with speech wave data including a new measure of speech intensity, SPLH, as a supplement to the sound pressure level, SPL. SPLH differs from SPL by a high-frequency pre-emphasis increasing the relative weight of formants above F1. SPLH is thus more closely related to sonority than is SPL. The difference, SPLH–SPL, carries information on the high-freqency content of the voice source. A requirement in contrastive analysis is that the formant pattern is the same or can be corrected for. Apart from covariation trends, subglottal pressure, Psub, and F0 vary independently. Although Psub is the main source of energy, the range of variation is rather small but phonetically significant. The joint contributions of Psub and F0 to SPL have been derived analytically. The equation provides a fair predictability of the SPL contour in speech. Apart from the filter function, i.e. the formant pattern, SPL is directly determined by the voice source excitation amplitude Ee which increases with F0 up to a midfrequency F0r of the speaker’s F0 range and saturates or decreases at higher F0. This is typically observed in a focal gesture of F0 overshooting F0r. The associated Psub gesture shows a consistent lag with respect to the F0 contour. As recently found from modelling experiments performed by Liljencrants [1999, personal commun.], the driving respiratory activity in speech can be described as a sequence of pressure pulses, which supports our findings concerning Psub maxima at the left boundaries of accented syllables or vowels, i.e. at P centers of rhythmical analysis [Marcus, 1981], and a following decay in the region of the intonation peak. As a result the peak F0 may coincide with a minimum of SPL. It is suggested that the common articulatory pattern of increasing emphasis is a target overshoot. Low vowels attain a greater jaw opening and thus a higher F1 and an intensity increase, whereas high vowels attain a more constricted mouth or lip opening, a lower F1 and thus because of vocal tract filtering a reduced intensity especially in F2 and higher formants. A secondary effect of a vocal tract narrowing is a reduction of voice source strength and of its high-frequency content [Bickley and Stevens, 1986]. Swedish long high vowels, when stressed, execute a gesture of vocal tract closure, and thus reduced sonority, in the middle or end of the vowel. This pattern could be taken as evidence for a motor theory of prominence perception. We have taken the more conservative view that it mirrors an established feature of the speech code known to both the speaker and the listener, supporting the auditory percept.


Phonetica 2000;57:113–127

125

As an example we have noted the prominence transformation of the [i:] in ‘Maria’ to a diphthongal-like [ij] gesture. However, these trends usually combine with an increase of the voicing effort, which in part compensates for the source spectrum change. Our presentation has had an emphasis on intensity-related parameters but duration and F0 parameters appear to be of equal or greater importance as prominence correlates. These have been studied in more detail in our earlier work [Fant et al., 1997a, b, 1999, 2000]. Because of the great contextual variability, prominence correlates should be sampled with reference to specified contexts. This is especially important in the analysis of text reading. A systematic frame of variability is the overall declination of subglottal pressure, F0 and intensity within a sentence or a breath group. The following data summarise the expected change in acoustical parameters, ceteris paribus, accompanying an increase of Rs by 10 units, e.g. from the unstressed level Rs =10 to the average stressed level Rs = 20 or from Rs =15, the weakest degree of accentuation to an apparent focal accentuation, Rs =25. In this interval syllable duration is increased by 125 ms. In fair agreement with the regression graphs of figure 3 the expected increase of vowel duration is 80 ms, the increase in SPL 6 dB, in SPLH 9 dB and thus in SPLH–SPL 3 dB. The local F0 patterns associated with the Swedish word accents 1 and 2 have modulation depths of the order of 4–6 semitones at Rs = 20. Among secondary correlates to prominence we have mentioned increased vowelconsonant spectrum and intensity contrast. A related feature is more abrupt intensity shifts at boundaries. Cases of prominence reduction were exemplified in figure 1. Glottalization, i.e. a brief interval of vocal fold complete or partial closure interrupting the voice source at a vowel onset, can have the function of a stress prompter [see the detailed presentation in Dilley and Shattuck-Hufnagel, 1996]. The typical situation is at a voiced juncture between two words. An example is to be seen in figure 2 where there is an apparent intensity minimum in the juncture between the two last words, ‘från Arne’.

Acknowledgements This work has been financed by grants from the Bank of Sweden Tercentenary Foundation and support from the Royal Institute of Technology, KTH and from Telefonaktiebolaget LM Ericsson.

References Bickley, C.; Stevens, K.N.: Effects of a vocal tract constriction on the glottal source: experimental and modelling studies. J. Phonet. 14: 373–382 (1986). Campbell, N.; Beckman, M.: Stress, prominence and spectral tilt; in Botinis, Kouroupetroglou, Carayannis, Proc. ESCA Workshop on Intonation: Theory, Models and Applications, Athens 1997, pp. 67–70. Dilley, L.; Shattuck-Hufnagel, S.: Glottalization of word-initial vowels as a function of prosodic structure. J. Phonet. 24: No. 4 (1996). Fant, G.: Acoustic analysis and synthesis of speech with applications to Swedish. Ericsson Technics, No. 15/1, pp. 1–106 (1959). Fant, G.: Voice source dynamics. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 17–37 (1980). Fant, G.: Some problems in voice source analysis. Speech Commun. 13: 7–22 (1993). Fant, G.: The LF-model revisited. Transformations and frequency domain analysis. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 119–155 (1995). Fant, G.: The voice source in connected speech. Speech Commun. 22: 125–139 (1997).

126

Phonetica 2000;57:113–127


Fant, G.; Hertegård, S.; Kruckenberg, A.; Liljencrants, J.: Covariation of subglottal pressure, F0 and glottal parameters. Eurospeech 97: 453–456 (1997a). Fant, G.; Kruckenberg, A.: Preliminaries to the study of Swedish prose reading and reading style. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2, pp. 1–89 (1989). Fant, G.; Kruckenberg, A.: Notes on stress and word accent in Swedish. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 125–144 (1994). Fant, G.; Kruckenberg, A.: The voice source in prosody. Proc. ICPhS 95, Stockholm 1995, Vol. 2, pp. 622–625. Fant, G.; Kruckenberg, A.: Voice source properties of the speech code. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 45–46 (1996). Fant, G.; Kruckenberg, A.; Liljencrants, J.: Prominence correlates in Swedish prosody. Int. Conf. Phonet. Sci. 99, San Francisco 1999, pp. 1749–1752. Fant, G.; Kruckenberg, A.; Liljencrants, J.: Acoustic-phonetic correlates of prosody in Swedish; in Botinis, Intonation (Cambridge University Press, Cambridge 2000). Fant, G.; Kruckenberg, A.; Nord, L.: Prosodic and segmental speaker variations. Speech Commun. 10: 521–531 (1991). Fant, G.; Kruckenberg, A.; Hertegård, S.; Liljencrants, J.: Accentuation and subglottal pressure in Swedish; in Botinis, Kouroupetroglou, Carayannis, Proc. ESCA Workshop on Intonation: Therory, Models and Applications, Athens 1997b, pp. 111–114. Fant, G.; Liljencrants, J.; Lin, Q.: A four-parameter model of glottal flow. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 1–13 (1985). Fant, G.; Lin, Q.: Frequency domain interpretation and derivation of glottal flow parameters. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 1–21 (1988). Fant, G.; Nord, L.; Kruckenberg, A.: Individual variations in text reading: a data-bank pilot study. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 1–17 (1986). Kruckenberg, A.; Fant, G.: Iambic versus trochaic patterns in poetry reading. Nordic Prosody VI, pp. 123–135 (Almqvist & Wiksell, Stockholm 1993). Lindblom, B.: Explaining phonetic variation: a sketch of the H&H theory; in Hardcastle, Marchal, Speech production and modelling, pp. 403–439 (Kluwer Academic Publishers, Dordrecht 1990). Marcus, S.M.: Acoustic determinants of perceptual center (P-center) location. Perception Psychophysics 30: 247–256 (1981). Portele, T.; Heuft, B.: Towards a prominence-based synthesis system. Speech Commun. 21: 1–72 (1997). Sluijter, A.M.C.; van Heuven, V.J.: Spectral balance as an acoustic correlate of linguistic stress. J. acoust. Soc. Am. 100: 2471–2484 (1996). Sundberg, J.; Andersson, M.; Hultqvist, C.: Effect of subglottal pressure variation on professional baritone singers’ voice sources. J. acoust. Soc. Am. 105: 1965–1971 (1999). Titze, I.: On the relation between subglottal pressure and fundamental frequency in phonation. J. acoust. Soc. Am. 85: 901–906 (1989).


Phonetica 2000;57:113–127

127


Received: January 10, 2000 Accepted: February 23, 2000

The C/D Model and Prosodic Control of Articulatory Behavior Osamu Fujimura Ohio State University, Columbus, Ohio, USA

Abstract The Converter and Distributor (C/D) model is a generative description of articulatory gesture organization for utterances. Its input comprises specifications for syllables by features, a paraphonologically augmented metrical structure, and system parameters for utterance conditions. A syllable-boundary pulse train is computed as a time function representing the skeletal rhythmic structure of the utterance. Control functions for articulators are computed by superimposing consonantal elemental gestures onto the base function, which includes voicing, vocalic, mandibular, and tonal functions associated with the pulse train. A preliminary analysis of microbeam data for experimental dialogues, using jaw opening as an approximate measure of the syllable magnitude, inferred syllable and boundary durations consistently with durational characteristics observed in the acoustic waveform. Copyright © 2000 S. Karger AG, Basel

Introduction

In the history of phonetic studies, production and perception of speech constituted two major approaches in parallel. The underlying assumption in virtually all studies [but see Browman and Goldstein, 1992; Levelt et al., 1999] was that the building blocks of speech signals were phonemes: autonomous and integral segments with associated physical properties. Target values were inherent to individual segments. Coarticulation governed signal variation [Lindblom, 1963]. Allophonic variation was understood to be ad hoc and language-dependent and therefore had to be treated in phonology. Larger units such as syllables were formed under well-formedness constraints (i.e. phonotactic constraints) of phonemic strings. On the other hand, psycholinguistic behaviors such as speech errors show that phonemic segments cannot be treated as an unstructured simple linear string [see, among others, Levelt, 1989; Fujimura, 1990; Levelt et al., 1999]. Articulatory data also show that phonemes are not autonomous segmental units [Fujimura, 1990; Sproat and Fujimura, 1993; Krakow, 1999].

© 2000 S. Karger AG, Basel Fax +41 61 306 12 34 E-Mail [email protected] www.karger.com

Accessible online at: www.karger.com/journals/pho

Osamu Fujimura Department of Speech and Hearing Science The Ohio State University, 1070 Carmack Road Columbus, OH 43210-1002 (USA), Tel. +1-614-292 2769 Fax +1-614-292 7504, E-Mail [email protected]

A basic deviation from the general principle of segment concatenation and coarticulation was proposed by Sven Öhman [1967].1 His consonantal perturbation theory treated consonantal gestures independently of the vocalic progression. The Converter and Distributor (C/D) model (see below) is a fuller development of this fundamental concept. Nonlinear phonology also has abandoned the notion that speech is a linear string of phonemic segments, allowing more or less independent organization of different phonetic control dimensions. Related to the loose linkage among different phonologic variables, i.e. tiers, features are underspecified rather than fully specified segment by segment [Archangeli, 1988; Steriade, 1994; Itô et al., 1995; for a new theory of syllable-based phonology, see Haraguchi, 1999]. The C/D model retains the distinction between phonology and phonetics. However, phonetics is dependent on the language as well as utterance situations. If we accept this new view, phonetics must be more abstract and complex than it used to be thought. However, it can be quantitatively explicit and empirically testable. The general principle of coarticulation remains; but phonetics must account for more than what the segment-based theory can [see Kohler, 1990, for relevant discussion].

The C/D Model

The C/D model is a phonetic theory for representing utterances in realistic conversational speech [Fujimura, 1994]. It is a computational description of articulatory gesture organization using syllables, rather than phonemes, as the concatenative segmental units. Feature specifications in the input representation of the linguistic from are based on phonological contrast considering the syllable as the contextual domain for redundancy reduction [Fujimura and Lovins, 1978; Fujimura, 1996a]. The Converter, the first component of the model in the sequence of processes for phonetic implementation, evaluates the input specifications for the utterance. A base function is constructed to represent the prosodic organization of the given utterance, enhancing a phonological metrical structure with numeric prominence control of selected metrical nodes and numerically setting system parameters for the given utterance conditions. The base function as one of its multidimensional aspects includes vocalic feature specifications for the syllable nucleus, representing the relatively slow articulatory movement from one syllable nucleus to the next, along with mandibular and laryngeal (voicing and tonal) linear progressions, forming discrete phonetic status contours in parallel. Elemental gestures for syllable margins are selected by the Distributor for each syllable component, i.e. p-fix (if any), onset, coda, or s-fix (if any) [Fujimura, 1996a; Fujimura and Williams, 1999]. A multidimensional set of Actuators assemble stored impulse response functions (IRFs) and the IRFs are superimposed upon the base function at the abstract phonetic level of description by Control Function Generators, before the control functions are delivered to the Signal Generator, the last component of the phonetic implementation process [Fujimura, 1996a; Fujimura and Williams, 1999].

1

A prosody-based description of speech organization was first advocated by Firth [1948, 1957]. A new syllablebased theory was recently published by Levelt et al. [1999].

C/D Model

Phonetica 2000;57:128–138

129

Fig. 1. Phonetic gesture organization for an utterance ‘That’s correct!’. The syllable pulse (bottom

panel, labeled 1, 2, or 3) represents the syllable magnitude by its height. Its time-shifted copies for onset and coda (pocs pulses), respectively, excite elemental consonantal gestures, which are represented by their impulse response functions labeled by Greek letters for manner and capital Roman letters for place.

Prosodic effects are observed not only in the traditional suprasegmental aspects of speech signals, but also in supralaryngeal articulation, in particular vowel quality along with voice quality control. The C/D model blends phonological functional structure with utterance-specific parameter settings, handling a mixture of symbolic and numeric variables within the phonetic implementation process. The base function is exemplified in figure 1 for the tongue body fronting gesture, evoked as the results of the evaluation of the feature specification {front}2 considering tautosyllabic feature specifications for the nuclei of the stressed syllables /#æt.s/ and /rek.t/. The reduced syllable of ‘correct’ is assumed, in this figure, to return the underlying vocalic contour value to the resting position. The thin step function, labeled ‘fronting contour’, represents the underlying phonetic status contour [Fujimura, 1996a] as a syllable-to-syllable flow of the vocalic dorsum gesture. (The time values of the status switching for this contour is assumed to coincide with the abstract edges of the triangles for the syllables.) A more concrete level of representation for the tongue body front-back movement control is represented by the thick smooth time function as the result of subjecting the status contour signal to a coarticulatory filtering process [see

2

The feature {front} may be considered equivalent to {–back} in this paper.

130

Phonetica 2000;57:128–138

Fujimura

Fujimura and Erickson, 1997]. (The effects of phrase boundaries are not depicted in this figure.) In figure 1, at the top, the phonological structure of each syllable for the sentence is shown according to the C/D model specification scheme. The symbols surrounded by curly brackets represent phonological feature specifications: {θ, v} for {interdental, voice}, {T, τ} for {apical, stop}, {K, τ} for {dorsal, stop}, {ρ} for {rhotic}. The s-fixes in English are always {apical} and voicing is determined by {voice} (or its lack as default for voiceless obstruents} for the tautosyllabic coda, and therefore only the manner feature (in this case {τ} for {stop}) is specified. The IRFs for the consonantal gestures involved in this utterance are also shown below the vocalic function. It should be noted that this figure is speculative, provided only for explanatory purposes. The actual microbeam data, of course, are crucial for making these speculations, but none of the curves are to be directly observable in the exact form because of the nonlinear nature of the signal generator. Different actuators excite their IRFs by the relevant pocs (p-fix, onset, coda, or sfix) [see Fujimura, 1996a] pulses, the thin vertical bars in the bottom panel of figure 1. Upward arrows arising from the top of each pocs pulse point to the excited impulse response functions as ballistic movement patterns for consonantal elemental gestures that belong to each syllable component. For example, the voiceless dorsal stop gesture | K, τ | occurs twice in this utterance, once for the onset of the first (reduced) syllable of ‘correct’ and the second for the coda of the second syllable of the same word. The peak values of the articulatory gestures are different, directly reflecting the syllable pulse magnitude. It should be emphasized that IRFs are generally different in their inherent shape and amplitude, depending on whether the feature specification is for onset or coda. The timing of the occurrence of the peak relative to the pocs pulse, a time-shifted syllable pulse [Fujimura, 1995], is part of the inherent property of the IRF (note the directions of the upward arows suggesting the IRF excitation process). The use of a particular articulator (for example the tongue tip) may be critical for a consonant in onset but not in coda (for example, replacement of [t] by a glottal stop, no apical contact for [l], etc., depending on the dialect in English). Such ad hoc qualitative differences in manifestation are described as differences of IRFs corresponding to the same feature specification, e.g. {lateralo} vs. {lateralc}, rather than allophonic rules in phonology. Their difference with respect to sensitivity to prosodic conditions is accounted for, at present, by the combination of the difference in IRFs and the effects of nonlinear mapping from the abstract control function to articulatory or acoustic physical signals. Prosodic Control At the bottom of figure 1, the temporal succession of triangles computed from the syllable pulse train is shown. The three syllables are identified by numerals according to their sequential occurrence in the utterance. The magnitudes of the thick pulses represent the phonetic strengths of the syllables. An augmented metrical grid is assumed to be computed from the input specification of the phonological metrical structure, as in Liberman and Prince [1977] (metrical tree), Prince [1983], Hayes [1984] (metrical grid), or Halle and Idsardi [1995] (parentheses representation), phonetically augmented by paralinguistic specifications such as prominence enhancement for contrastive emphasis [Fujimura, 1994; Erickson, 1998]. Such a numeric metrical grid (i.e. the sequence of thick vertical bars in the bottom panel of the figure) represents an ordered

C/D Model

Phonetica 2000;57:128–138

131

series of pulses for syllables, the pulse height representing the syllable magnitude. The shadow computation constructing a triangle for each syllable using the same angles but different magnification factors, according to the pulse height, yields a series of triangles left to right [Fujimura et al., 1998]. The contiguity of syllable triangles is broken by occasional gaps, representing boundaries of continuously variable strengths [see Sproat anf Fujimura, 1993, for discussion of various boundaries]. The boundary magnitude is assumed to be computable from the augmented metrical structure. The shadow angles for syllables and boundaries remain to be determined according to empirical data for the given utterance conditions (see below). Boundary features may evoke manifestations of certain supralaryngeal gestures as well as laryngeal and pulmonary gestures [see Fujimura and Williams, 1999, for discussion of Japanese geminate obstruents and Spanish rhotics]. The voicing contour as a control function may be assumed to always exhibit, in any language, a single island per syllable [Fujimura, 1996a]. In other words, if there is a voice cessation in the control function of voicing, that must involve a syllable boundary in the abstract phonetic sense. This implies that the abstract phonetic status contour is syllable-based, the voice onset in the beginning part of the syllable and the voice offset in the ending part of the syllable are implemented at time values relative to the onset and coda pulses, according to the phonological feature specifications for onset and coda (or in some languages s-fixes or p-fixes), respectively. Voicing can continue across syllable boundaries if no voice cessation occurs in the control function within the contiguous syllables, or if such cessations do not appear in phonetic signals due to coarticulatory undershooting [Lindblom, 1963]. Vocal fold vibration is controlled in different dimensions. The glottal adduction/ abduction is one, tension (primarily cricothyroid muscle contraction) is another. Stiffness control (vocalis muscle contraction) can be independent of tension and many be implemented via an IRF corresponding to a consonantal feature specification. F0 is only one of the physical manifestations of the voice quality control, but is acoustically robust and plays the central role in prosodic perception along with temporal modulation of phonetic events. Inferring Syllable-Boundary Pulses from Jaw Movement Using the C/D model as a prosodic description for an utterance, an empirical test of the validity of the model is underway. Specifically, to test, in part, the concept of the syllable-boundary pulse train, the following hypotheses were adopted as an instantiation of the model to study the rhythmic organization of articulatory patterns of utterances in a fairly natural dialogue between the experimenter [Donna Erickson, see Erickson et al., 1998] and the subject, as recorded by the X-ray microbeam system. For the interpretation of data, we hypothesized as follows: (1) Syllables are represented by symmetric triangles of different sizes but with the same angle throughout an utterance, arranged along the time axis (fig. 1). (2) In each utterance analyzed, there is at least one occurrence of contiguous syllables without a prosodic boundary in between. (3) The maximum angle causing no overlaps of contiguous syllable triangles is the shadow angle, the tangent of which determines the ratio of the syllable magnitude to an abstract syllable duration for the utterance condition. (4) The remaining gaps between contiguous syllable triangles represent boundaries, their lengths representing the magnitudes (strengths) of the boundaries.

132

Phonetica 2000;57:128–138

Fujimura

Fig. 2. Microbeam pellet movements and inferred syllable triangles. Top panel shows the computed

pulse train (boundary pulses not shown) with associated syllable triangles. The gaps between contiguous triangles are interpreted to represent boundaries of varied magnitudes. The bottom panel shows compressed acoustic waveforms. The ordinate in each of the remaining three panels represents the vertical positions for the pellets attached to (from top) lower lip (LLy), mandible (MAN1y), and tongue tip/blade (TTy, about 1 cm behind the tip), in mm with the origin at the occlusal plane. The abscissa is time in seconds. Vertical cursors are drawn at the edges of the syllable triangles indicating the time domains of syllables as determined from articulatory movement patterns (see text).

Now specifically referring to the articulatory manifestation of the abstract syllable magnitude, we further assume an empirical approximation as follows: (5) Syllable magnitude is manifested as relative jaw opening provided the syllable nuclei are specified with the same phonological features. This implies that the observed excursion of mandibular motion for the syllable nucleus /a/ in the key items /faJv/, /naJn/, /paJn/, in the context of the current experiment, reveals directly the prosodic component (as opposed to phonologically specified inherent gestures of jaw movement which are the same in all tokens) of jaw control function as first approximation [Fujimura et al., 1998]. Data were recorded for 2 male and 2 female native speakers of Midwestern American English. The experimenter conducted a dialogue with the subject whose articulation was recorded, along with the acoustic speech signal, by the microbeam system at the University of Wisconsin, Madison. The dialogue used ‘five’, ‘nine’ and ‘pine’ as the target words in a street address phrase, but was relatively freely designed to allow

C/D Model

Phonetica 2000;57:128–138

133

the subject to use varied expressions. S/he was instructed beforehand that the partner of the dialogue, i.e. the experimenter, might need corrections because she was in a noisy communication environment. The address to be used in the response was given to the subject in a monitor display, and the subject’s response always included at least three tokens of ‘five’ or ‘nine’, but not a string of three identical numbers. The street name was always ‘Pine Street’. Figure 2 illustrates an example of the results of our semiautomatic interpretation of the syllable magnitude and timing data. The key words are marked automatically as NINE of FIVE for the crucial articulator of the consonant involved, for the purpose of visually confirming demisyllabic iceberg patterns [Fujimura, 1986, 1996b]. The midpoint or the iceberg threshold crossing points of the crucial articulator for the demisyllable was used as the temporal center of the syllable triangle. The simultaneously recorded acoustic signal is shown at the bottom as a compressed waveform. The top panel of figure 2 represents the inferred syllable magnitude (triangle height) by copying the extent of mandibular lowering. The syllable ‘duration’ was computed according to the hypotheses above. Usually, as seen in this example, we found multiple instances of syllables representing the target words (all with /aJ/) approximately contacting with each other, while other instances of contiguous target words in the same utterance left a considerable gap in time (indicating boundaries). We checked the acoustic signals to determine if the magnitude of each intertriangular gap reasonably compared with what we could judge as an acoustic pause, whenever determinable including voiced or unvoiced consonantal segments. We also listened to the speech signals carefully to compare the perceptual impression of phrasing patterns with the articulatorily inferred boundary strengths. The result was quite reasonable without exception for some randomly sampled dialogues by each of the 4 speakers. The phrasing strategy seemed to vary quite considerably from speaker to speaker, however [Fujimura et al., 1998]. A complete data analysis is underway. Prosodic Effects on Articulation We would like to emphasize here that, unlike the traditional concept of prosody and suprasegmentals, the so-called segmental articulation, in particular the vowel articulation, varies considerably, depending on the prosodic condition of the word, even when the word sequence uttered is exactly the same, i.e. when the segmental context is identical [Fujimura, 1990]. Figure 3 compares the same sentence: ‘It’s six five seven America Street’, uttered in two different prosodic conditions, as observed in one of the earlier microbeam data obtained at the University of Wisconsin Microbeam Facility (simulated dialogue read from a monitor screen by the subject). The utterance in figure 3a had a contrastive emphasis placed on ‘six’, as a response to a question (also read by the subject) which contained the ‘wrong’ digit. In figure 3b, the word ‘America’ was emphasized as a correcting response to the question which cited the ‘wrong’ street name. The curves represent time functions pertaining to the vertical component of the movement of the pellet attached on the tongue blade surface, about 1 cm behind the tip of the tongue. The subject was a young male native speaker of the Midwest American dialect of English. The upward arrow in each figure points to the lowest position of the tongue blade for the vowel [R] in ‘six’, while the bracket shows the time domain corresponding to the word ‘America’. It is apparent that the high front vowel in ‘six’ is articulated with a considerably lower blade position when the word is emphasized than not emphasized. Similarly, the

134

Phonetica 2000;57:128–138

Fujimura

a

b

Fig. 3. Tongue blade height in a pair of utterances, ‘It’s six five seven America Street’, with contrastive emphasis placed on ‘six’ (a) and on ‘America’ (b). One vertical section is about 3.4 mm and

one horizontal section about 0.43 s. The upward arrows point to the word ‘six’, and the horizontal brackets refer to the word ‘America’. The three valleys within each bracket correspond to the syllable nuclei [˜], [£], and [R], respectively, the final schwa being indicated only as an inflection point at the right edge of the bracketed time range. The large valley near the left extreme of each panel corresponds to a rest position before the utterance of the sentence. (Data recorded in 1983 at the University of Wisconsin Microbeam Facility.)

C/D Model

Phonetica 2000;57:128–138

135

word ‘America’, when emphasized, is pronounced with more vigorous blade movement and also expanded time span. A particularly striking effect of emphasis is observed in the articulation of the word-initial schwa (see the deepest minimum in the bracketed time domain) in ‘America’. The blade is considerably lower for schwa than for the low vowel in ‘five’ (unemphasized) in the same utterance (fig. 3b), while not only the low vowel in ‘five’ but also the mid front vowel in ‘seven’, both not emphasized, show lower blade position than the schwa in figure 3a. It should be noted, however, that in figure 3a for the unemphasized ‘America’, the word/phrase-initial schwa is more distinctly pronounced than the word-final schwa, the latter showing only an inflection point rather than any local minimum, and this is true even in figure 3b even though the absolute vertical position is lower in figure 3b than in figure 3a. The scales for vertical position in figure 3a, b are calibrated and are the same from top down. In figure 3b, a strong but slow action of jaw opening (not shown here but replicated consistently in later experiments [Erickson, pers. commun.]) starts from the beginning of the word, anticipating the succeeding stress syllable. This is particularly true when it is emphasized. The mandibular movement is reflected in the deep downward excursion of the tongue blade for the word-initial schwa, indicating a general prominence effect of the unit-initial syntagmatic position [see Keating, 1995]. It is seen in figure 3b that the tongue blade for the mid front vowel, with main stress, is not as low as for the initial schwa in the same emphasized word ‘America’. Presumably, the inherent tongue body gesture for this feature-specified mid front vowel is articulated with an enhanced movement because of the emphasis. Such intricate interactions between the inherent vocalic gestures and the jaw movement gesture [see Maeda, 1992] have been consistently observed also in a recent microbeam study of the effects of emphasis using different vowels [Erickson et al., 1999]. Thus, enhancing the inherent articulation of the vowel, vocalic gestures seem to be hyperarticulated [Lindblom, 1990] when the syllable prominence (i.e. syllable magnitude) is increased by emphasis. At the input to the C/D model, the metrical tree node representing the emphasized word receives a numeric enhancement factor, which results in the magnification of the grid height, representing the metrical value of the stressed syllable as the head of the word. The increased syllable pulse height of the directly affected syllable causes three main effects in parallel. First, the deviation of mandibular position (opening excursion) from the contextually determined neutral position for the pertinent syllable is increased according to the numeric enhancement. The jaw opening excursion deviating from the natural course of movement for the computed syllable pulse height, reflecting both the lexical metrical value (i.e. the phonologically determined degree of stress) and the syntagmatically determined phonetic degree of jaw opening (for example, phrase and word-initial position makes it more open than for final position). Second, the enhancement of the inherent vocalic gesture for the specified phonological feature is implemented in the vocalic aspect of the base function. In the case of ‘America’, the feature specification is {front} for the mid (unmarked) front lax (unmarked)3 vowel of the second syllable corresponding to the head node of the phonological word. The inherent phonetic gesture for this feature specification is tongue

3 The tenseness of a vowel, in our scheme, is specified as the coda glide [Fujimura and Erickson, 1997; Fujimura and Williams, 1999].

136

Phonetica 2000;57:128–138

Fujimura

body advancement as a phonetic status |front|, and this inherent gesture is enhanced via the syllable pulse enhancement. The third effect of the enhanced syllable magnitude is the proportional increase in amplitude of all tautosyllabic consonantal elemental gestures as their common excitation pulse is made larger. Such an increase of an abstract gesture (i.e. IRF amplitudes as the passive response to the syllable pulse excitation) is not necessarily directly observable because of the highly nonlinear nature of the signal generator, the last stage of the model [Fujimura, 1994; Fujimura and Williams, 1999], typically showing saturation of the movement of the critical articulator of the consonant against the roof of the mouth including the contact of the sides of the tongue blade and local feedback within muscle spindles, etc. [Fujimura, 1998]. A quantitative evaluation of our interpretation must await a computational simulation of the dynamic, nonlinear deformation of the tongue and other organs in large deformation, along the line Wilhelms-Tricarico [1995] has shown. The effect is semiquantitatively observable, however, in the form of a durational expansion of the elemental gesture manifestation both in articulatory movement patterns, as in figures 2 and 3, and in the acoustic segmental durations. The enhancement effect seems more obvious for onset gestures than for coda gestures. This is partly explained by difference between onset vs. coda IRFs with respect to the relation between the inherent peak height of the IRF and the effective threshold value of the articulator’s (including vocal folds’) position for acoustic discontinuities such as stop formation and release, voice onset and offset, etc. [Lehiste, 1970; Umeda, 1977; Crystal and House, 1997]. It is possible, however, that, in the case of the word ‘America’ shown above, the underlying phonological vocalic specification is not exactly unmarked (i.e. no feature specification) for the first syllable. It may be that some vocalic feature specification is there at the output level of (phrasal) phonology (as a reducible low vowel) for the initial syllable of ‘America’, but syllable reduction at the Converter level of the phonetic implementation process suppresses the phonetic manifestation of the underlying features in typical utterance situations, by assigning the reduced phonetic status according to the metrical value of the syllable. Such a matter needs to be studied systematically with much more data. It is the intention of this paper that the C/D model, unlike traditional phoneme-based segmental and suprasegmental models, is in principle capable of handling such situations [see Leben, in press, for a related discussion]. The goal of this study is to account for the diversity of observable effects of prosodic variation by providing an effective framework of phonetic description.

Acknowledgment This research has been supported in part by NSF (SBR-9511998, BCS-9977018), ATR/HIP, and ATR/MIC. The author would like to acknowledge effective collaboration by Donna Erickson, Bryan Pardo, Corey Mitchell, and Caroline Menezes. John Westbury and his staff at the Microbeam Facility, University of Wisconsin, assisted in the data collection for the experiments discussed here. The author is also grateful for editorial comments by Klaus Kohler and John Kingston, which contributed to a substantial improvement of this article.

C/D Model

Phonetica 2000;57:128–138

137

References Archangeli, D.: Aspects of underspecification theory. Phonology 5: 183–205 (1988). Browman, C.P.; Goldstein, L.M.: Articulatory phonology: an overview. Phonetica 49: 155–180 (1992). Crystal, T.; House, A.S.: A note on the durations of American English consonants; in Kiritani, Hirose, Fujisaki, Speech production and language, pp. 195–213 (Mouton de Gruyter, Berlin 1997). Erickson, D.: Effects of contrastive emphasis on jaw opening. Phonetica 55: 147–169 (1998). Erickson, D.; Fujimura, O.; Dang, J.: Emphasized versus unemphasized /aJ/: jaw, tongue, and formants. J. acoust. Soc. Am. 105: 1354 (1999). Erickson, D.; Fujimura, O.; Pardo, B.: Articulatory correlates of prosodic control: emotion and emphasis. Lang. Speech 41: 395–413 (1998). Firth, J.R.: Sounds and prosodies. Trans. Philol. Soc. 1948: 127–152; also in Firth, J.R.: Papers in linguistics, 1934–1951, pp. 121–138 (Oxford University Press, London 1957). Fujimura, O.: Relative invariance of articulatory movements; in Perkell, Klatt, Invariance and variability in speech processes, pp. 226–242 (Erlbaum, Hillsdale 1986). Fujimura, O.: Methods and goals of speech production research. Lang. Speech 33: 195–258 (1990). Fujimura, O.: C/D model: a computational model of phonetic implementation; in Ristad, Language computations, DIMACS Ser. in Discrete Mathematics and Theoret. Computer Sci., Vol. 17, pp. 1–20 (American Mathematical Society, Providence 1994). Fujimura, O.: Syllable structure constraints: a C/D model perspective; in Agbayani, Harada, Proc. SWOT-II, Working Papers in Ling., University of California, Irvine, pp. 59–74 (Department of Linguistics, University of California, Irvine 1996a). Fujimura, O.: Iceberg revisited. J. acoust. Soc. Am. 99: 2aSC (1996b). Fujimura, O.: Neuromuscular simulation and linguistic control. Bull. Commun. parlée 4: 59–63 (1998). Fujimura, O.; Erickson, D.: Acoustic phonetics; in Hardcastle, Laver, Handbook of phonetic sciences, pp. 64–115 (Blackwell, Oxford 1997). Fujimura, O.; Lovins, J.B.: Syllables as concatenative phonetic units; in Bell, Hooper, Syllables and segments, pp. 107–120 (North Holland, Amsterdam 1978). Fujimura, O.; Pardo, B.; Erickson, D.: Effect of emphasis and irritation on jaw opening; in Duez, Proc. ESCA/ SPoSS, pp. 23–29 (1998). Fujimura, O.; Williams, J.C.: Syllable concatenators in English, Japanese and Spanish; in Fujimura, Joseph, Palek, Item order: Proc. LP’98, pp. 461–498 (Charles University Press, Prague, 1999). Halle, M.; Idsardi, W.: General properties of stress and metrical structure; in Goldsmith, Handbook of phonological theory, pp. 403–443 (Blackwell, Oxford 1995). Haraguchi, S.: A theory of syllables; in Fujimura, Joseph, Palek, Item order: Proc. LP’98, pp. 691–715 (Charles University Press, Prague, 1999). Hayes, B.: The phonology of rhythm in English. Ling. Inquiry 15: 33–74 (1984). Itô, J.; Mester, A.; Padgett, J.: Licensing and underspecification in optimality theory. Ling. Inquiry 26: 571–613 (1995). Keating, P.: Segmental phonology and non-segmental phonetics. Proc. 13th Int. Congr. Phonet. Sci., Vol. 3, pp. 26–32, Stockholm 1995). Kohler, K.: Segmental reduction in connected speech in German: phonological facts and phonetic explanations; in Hardcastle, Marchal, Speech production and speech modelling, pp. 69–92 (Kluwer, Dordrecht 1990). Krakow, R.A.: Physiological organization of syllables: a review. J. Phonet. 27: 23–54 (1999). Leben, W.: Weak vowels and vowel sequences in Kwa: sounds that phonology can’t handle; in Fujimura, Joseph, Palek, Item order: Proc. LP’98, pp. 717–732 (Charles University Press, Prague, 1999). Lehiste, I.: Suprasegmentals (MIT Press, Cambridge 1970). Levelt, W.J.M.: Speaking: from intention to articulation (MIT Press, Cambridge 1989). Levelt. W.J.M.; Roelofs, A.; Meyer, A.S.: A theory of lexical access in speech production. Behav. Brain Sci., 22: 1–75 (1999). Liberman, M.; Prince, A.: On stress and linguistic rhythm. Ling. Inquiry 8: 249–336 (1977). Lindblom, B.: Spectrographic study of vowel reduction. J. acoust. Soc. Am. 35: 1773–1781 (1963). Lindblom, B.: Explaining phonetic variation: a sketch of the H and H theory; in Hardcastle, Marchal, Speech production and speech modeling, pp. 403–440 (Kluwer, Dordrecht 1990). Maeda, S.: Articulatory modeling of the vocal tract. J. Physique 4: 307–314 (1992). Öhman, S.E.G.: Numerical model of coarticulation. J. acoust. Soc. Am. 39: 151–168 (1967). Prince, A.: Relating to the grid. Ling. Inquiry 14: 19–100 (1983). Sproat, R.; Fujimura, O.: Allophonic variation in English /l/ and its implications for phonetic implementation. J. Phonet. 21: 291–311 (1993). Steriade, D.: Underspecifications and markedness; in Goldsmith, Handbook of phonological theory, pp. 114–174 (Blackwell, Oxford 1994). Umeda, N.: Consonant duration in American English. J. acoust. Soc. Am. 59: 434–444 (1977). Wilhelms-Tricarico, R.: Physiological modeling of speech production: methods for modeling soft-tissue articulators. J. acoust. Soc. Am. 97: 3085–3098 (1995).

138

Phonetica 2000;57:128–138

Fujimura


Received: February 15, 2000 Accepted: April 10, 2000

Diverse Acoustic Cues at Consonantal Landmarks Kenneth N. Stevens Research Laboratory of Electronics and Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass., USA

Abstract The consonantal segments that underlie an utterance are manifested in the acoustic signal by abrupt discontinuities or dislocations in the spectral pattern. There are potentially two such discontinuities for each consonant, corresponding to the formation and release of a constriction in the oral cavity by the lips, the tongue blade, or the tongue body. Acoustic cues for the various consonant features of place, voicing and nasality reside in the signal in quite different forms on the two sides of each acoustic discontinuity. Examples of these diverse cues and their origin in acoustic theory are reviewed, with special attention to place features and features related to the laryngeal state and to nasalization. A listener appears to have the ability to integrate these diverse, brief acoustic cues for the features of consonants, although the mechanism for this integration process is unclear. Copyright © 2000 S. Karger AG, Basel

Introduction

Most consonants are produced with a narrowing in the oral region of the vocaltract airway. This narrowing is sufficient to create particular types of discontinuities or ‘dislocations’ in the short-time spectrum of the sound. These discontinuities occur when the constriction is formed and when it is released. The constriction or narrowing can be formed by any one of three active articulators: the lips, the tongue blade, or the tongue body. Examples of these consonantal discontinuities can be seen on the spectrograms of the sentence ‘The bike was Danish’, in figure 1. There are three ways of creating such an acoustic discontinuity. A common way is to switch the principal acoustic source from frication (or turbulence noise) near a constriction in the oral cavity to the acoustic source at the glottis. These two acoustic sources as well as the filtering of these sources by the vocal tract have quite different acoustic spectra, leading to a sharp discontinuity in the spectrum. Such a discontinuity occurs at the release of an obstruent consonant such as a stop or fricative consonant. Examples in figure 1 are at the release of [b] and [d], where the frication noise burst is followed immediately by glottal vibration at the vowel onset. A discontinuity also occurs at the time the constriction in the oral cavity is formed, such as at the closure for



Prof. Kenneth N. Stevens, Research Laboratory of Electronics and Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 50 Vassar Street, Room 36-517, Cambridge, MA 02139-4307 (USA), Tel. +1 (617) 253-3209, Fax +1 (617) 258-7864, E-Mail [email protected]

Fig. 1. Spectrogram of the sentence ‘The bike was Danish’, illustrating the locations of consonant

landmarks, as discussed in the text.

a postvocalic stop consonant. Examples in figure 1 are the times of closure for [b] and [d] and the onset of frication noise for [E]. In the case of a postvocalic stop consonant, there is a glottal source up to the time the constriction is formed, and then there is no source of frication in the oral cavity until the constriction is released. Production of an obstruent consonant usually results in two acoustic discontinuities – one at the implosion of the consonant and the other near the release. There are, however, occasions where one of these articulatory events does not create an acoustic signature (as, for example, in the two stop consonants in the sequence up to, where the closure for /t/ precedes the release of /p/, leading to no acoustic discontinuity for either of these articulatory events). A second way of creating an acoustic discontinuity for a consonant is to switch the sound output from the nostrils to the mouth (or vice versa) while maintaining a fixed acoustic source at the glottis. In contrast to obstruent consonants, there is no change in the acoustic source at the time this change in the output occurs. This type of discontinuity occurs at the time of release of the oral-cavity closure for a nasal consonant, and at the time of formation of this closure, as illustrated by the [n] in figure 1. Such a discontinuity will only occur if the nasal consonant is adjacent to a vowel or glide. In a third mechanism for forming an acoustic discontinuity, the major acoustic path through the oral cavity switches from a path with a side branch to one in which there is a direct path without a side branch. This type of discontinuity is used for liquid consonants. The clearest example is the release of a lateral consonant into a following vowel. During the lateral configuration there is an acoustic path around one or both sides of the tongue blade, and the midline path is blocked. Following the release a midline path is created, with an abrupt change in the acoustic transfer function from the glottal source to the mouth output. The glottal source remains relatively fixed as this change in the oral cavity occurs, similar to a nasal consonant. A change in the transfer

140

Phonetica 2000;57:139–151

Stevens

function also occurs at the release of a rhotic consonant in English, although this change is often not as abrupt as it is for a lateral consonant. For all of the types of acoustic discontinuities just discussed there is movement of one of the three oral articulators, either from a more open configuration to a narrowing, or vice versa. This movement of the articulator that causes the discontinuity is usually accompanied by movements of other articulators, particularly the tongue body, that must reach a position for the adjacent segment. If the adjacent segment is a vowel, the changes in vocal-tract shape resulting from these movements give rise to transitions of the formants on the side of the discontinuity that is adjacent to the vowel. The principal acoustic attribute that defines a consonant, then, is an acoustic discontinuity or landmark. The broad acoustic attributes of this landmark for a given consonant depend on the features [sonorant] and [continuant]. These have been called articulator-free features [Halle and Stevens, 1991]. The articulatory action that is responsible for the acoustic discontinuity is the closure or release of the lips, the tongue blade, or the tongue body. The articulator that causes the discontinuity is sometimes called the principal articulator. The various features of the consonant are represented in the sound by acoustic properties or cues in the vicinity of this landmark [Liu, 1996]. These features have been called articulator-bound features [Halle and Stevens, 1991], and include specification of place of articulation for the principal articulator and features associated with secondary articulators, including voicing (for obstruents), features distinguishing nasals from liquids, and, in some cases, tongue-body features (such as the tongue body feature [back] for laterals in some languages). In general, the cues for these articulator-bound features are different on the two sides of the acoustic discontinuity. For example, it is well known that information about place of articulation for a stop-consonant release into a vowel is carried in the spectrum of the frication noise burst (which occurs immediately preceding the acoustic discontinuity) and the transitions of the formants after the onset of the glottal source, immediately following the acoustic discontinuity. We turn now to a more detailed examination of the cues to several different consonant features as observed in the vicinity of the acoustic discontinuities that mark a consonant closure or release. In the discussion we suggest some implications of these diverse acoustic cues in the vicinity of consonantal landmarks for models for acquisition and perception of consonants.

Voicing Feature for Obstruents

We examine first the voicing feature for obstruent consonants. As is well known, there are several cues for voicing in obstruents. Immediately preceding the consonantal release or immediately following the closure, a cue for the + or – value of the voicing feature is the amplitude of low-frequency periodicity in relation to the low-frequency amplitude in the adjacent vowel. The relative amplitude of the low-frequency prominence in the obstruent interval is expected to be 5–15 dB below that in the adjacent vowel for a voiced consonant. Immediately following onset of voicing after the consonant release, two other cues are the change in fundamental frequency (a higher starting frequency for voiceless consonants and a lower starting frequency for voiced consonants) [House and Fairbanks, 1953; Lea, 1973; Ohde, 1984] and a modification of the low-frequency spectrum shape reflecting the degree of glottal spreading (greater

Cues at Consonantal Landmarks

Phonetica 2000;57:139–151

141

spreading for voiceless consonants). Immediately preceding the consonantal closure, acoustic evidence for the degree of glottal spreading also provides a cue for the voicing feature [Klatt and Klatt, 1990]. Spectrograms of a pair of words illustrating some of these cues for the voicing distinction for a labiodental fricative are shown in figure 2. Below each spectrogram is a contour of fundamental frequency versus time (fig. 2b), together with spectra sampled at selected points in the utterance. These points are: immediately prior to the formation of the constriction for each fricative (fig. 2c), about 30 ms after this point (fig. 2d), and immediately after release of the constriction (fig. 2e). Each of these displays provides information about the voicing status of the consonant. The fundamental frequency at the onset of the vowel following the voiceless consonant is higher than following the voiced consonant, by about 21 Hz in this example, as shown in figure 2b. This acoustic property is an indication that the stiffness of the vocal folds is lower near the end of a voiced fricative and higher near the end of a voiceless fricative. An increased vocal-fold stiffness is known to raise the transglottal pressure threshold for phonation and hence to inhibit the onset of glottal vibration as the transglottal pressure increases following the consonant release [Halle and Stevens, 1971; Titze, 1992]. The low-frequency spectrum amplitude just following the formation of the constriction in relation to the low-frequency amplitude immediately preceding this point is much greater for the voiced consonant. The difference in this example is about 25 dB for the voiceless consonant and about 15 dB for the voiced consonant. The low-frequency spectrum shape in the vowel immediately following the constriction (fig. 2e) is different for the two consonants. The amplitude A1 of the first-formant peak relative to the amplitude H1 of the first harmonic, as well as the difference H1–H2, show evidence of glottal spreading [Klatt and Klatt, l990] following [f] compared to [v], which is one of the articulatory gestures that tend to inhibit glottal vibration. In this particular utterance, there is only weak acoustic evidence of glottal spreading immediately preceding the consonant, as shown in figure 2c. It is also noted that in the spectrogram and in the spectrum in the vowel immediately following the release of the voiced consonant, there is evidence in the high-frequency region that frication noise overlaps with glottal vibration – a sign that the consonant is voiced. A similar set of measures relating to the glottal source can be made for voiced and voiceless stop consonants. These acoustic cues for the voicing distinction for obstruent consonants provide evidence for vocal-fold stiffening or slackening and for the presence or absence of glottal spreading. Both of these gestures contribute to the inhibition or facilitation of glottal vibration during the obstruent interval when there is a reduced transglottal pressure. Not all of these cues may be present for obstruent consonants in different contexts, and there may be other cues that are not discussed here, but a partial set of cues is often sufficient to identify the voicing feature.

Fig. 2. Acoustic measurements from the words divine (left panels) and define (right panels), illustrat-

ing some of the cues for the voicing distinction for the labiodental fricatives. The various panels below the spectrograms are discussed in the text. The amplitudes H1, H2, and A1 are labeled in e.

142

Phonetica 2000;57:139–151

Stevens

2


Phonetica 2000;57:139–151

143

Place Features for Stop Consonants

Cues for the place features at the release of stop consonants also reside in two regions following the release. The noise burst occurs while the constriction formed by the primary articulator is still small enough that there is a pressure drop across the constriction, and hence there is a rapid flow of air through the constriction. The acoustic properties of this noise burst reflect the length of the constriction and of the cavity anterior to the constriction. There is a spectral prominence in the burst that is related primarily to the length of the front cavity [Fant, 1960; Stevens, 1998]. Following the end of the burst, the source of excitation for the vocal tract shifts to the glottis, and this source excites all of the vocal-tract resonances. As the articulators move from the configuration for the consonant to that for the vowel, the shape of the vocal tract changes, and the formant frequencies change. The starting frequency, the direction, and the amount of movement of the formants, particularly F2, are cues for the place of articulation for the consonant [Delattre et al., 1955]. The F2 transition shows how the vocal-tract shape posterior to the consonant constriction changes with time following the release. These transition cues are quite different from the cues carried by the acoustic properties of the burst. Both the burst shape and the transitions are known to be cues for consonant place of articulation. Shown in figure 3 are spectrograms of the two nonsense utterances [h˜′b$b] and [h˜′d$d], together with spectra sampled in the noise burst and at two points in time after the onset of voicing: the latter two points illustrate how the formants move (particularly F2) following the release. The well-known contrast in the spectrum shape of the burst for [b] and [d] is evident, with the [d] burst having greater high-frequency amplitude due to the resonance of the cavity anterior to the constriction1. As for the formant transitions, the principal difference is a relatively high starting frequency of F2 for [d] and a low starting frequency for [b]. Figure 3c, d show that F2 remains relatively level over the initial 20 ms for [b], whereas there is a fall in F2 for [d]. (The spectrogram shows that the rate of fall of F2 is greater after this initial 20-ms interval.)

1 The burst spectra immediately below the spectrograms in figure 3 are sampled with a time window that does not overlap with the onset of full voicing following the burst. Thus these spectra are defined differently from the ‘onset spectra’ described by Blumstein and Stevens [1979] and by Walley and Carrell [1983]. Those spectra were calculated with a fixed time window that sometimes overlapped with the onset of full voicing, and thus often differ from the burst spectra in figure 3.

Fig. 3. Spectrograms and spectra sampled in the nonsense utterances [h˜′b$b] (left panels) and

[h˜′d$d] (right panels). The three spectra below each spectrogram show the spectrum of the burst (the upper curve is a smoothed version of the lower curve), and two spectra sampled at the onset of glottal vibration and about 20 ms later. The location of F2 is marked on the lower two spectra. All spectra are measured with a 6.4-ms time window as shown on the waveforms. The burst spectra are averages of 10 spectra sampled at l-ms intervals within the burst, whereas each of the other spectra is calculated with the time window centered on a pitch period, as shown below the spectra.

144

Phonetica 2000;57:139–151

Stevens

3


Phonetica 2000;57:139–151

145

Fig. 4. Spectrogram of the word chemist is shown at the top. The lower panels are spectra sampled at

two points within the word: about 30 ms preceding closure for the nasal consonant, and about 30 ms after the time of closure.

Nasal Consonants

As noted above, a nasal consonant at the onset of a syllable is always followed by a vowel (or occasionally a glide) in English. Likewise a nasal consonant in coda position in a syllable is always preceded by a vowel. Acoustic events in the vicinity of the discontinuity caused by the consonant release or closure can provide information about the feature [nasal] (i.e. whether or not the consonant is a nasal) and about the place of articulation features.

146

Phonetica 2000;57:139–151

Stevens

Fig. 5. Spectrogram of the word Dennis is shown at the top. The lower panels are spectra sampled at two points within the word: about 30 ms preceding closure for the nasal consonant, and within the nasal murmur.

With regard to the feature [nasal] we again observe that acoustic cues for the feature exhibit one set of properties in the nasal murmur preceding the release and another set immediately after the release of the consonantal closure. These cues are illustrated in the words chemist in figure 4 and Dennis in figure 5. Within the nasal murmur, the cue for the feature [nasal] is a lowest resonance at about 250 Hz with a bandwidth that is somewhat wider than that of a normal first-formant bandwidth, together with a nasal resonance in the frequency range 800–1,000 Hz for an adult [Båvegård et al., 1993]. There is also a general reduction in the high-frequency spectrum amplitude above


Phonetica 2000;57:139–151

147

about 1,000 Hz relative to the spectrum amplitude in the vowel in this frequency range. After the consonant release, or before consonantal closure, the vowel is nasalized, and this attribute has several acoustic correlates which are different from those in the murmur. In particular, the spectra of the nasalized vowels in figures 4 and 5 both show a nasal resonance in the frequency range 750–1,000 Hz, as predicted from acoustic theory [Chen, 1995; Stevens, 1998], although in figure 5 this resonance appears only as a shoulder on the spectrum. The place feature for a nasal consonant is also represented by acoustic events on two sides of the consonantal landmark: in the formant transitions in the vowel immediately following the consonant release, and in the spectrum pattern in the nasal murmur [Kurowski and Blumstein, 1984, 1987]. The formant transitions following the release or preceding the closure for a nasal consonant are similar to those for a stop consonant, and consequently the movements of F1 and F2 provide one set of cues for nasal place of articulation. The spectrum within the murmur also varies with place of articulation. Peaks in this spectrum reflect the natural frequencies of the whole system consisting of the nasal cavity and the oral cavity. The natural frequencies for a labial nasal, such as [m] in figure 4, include the natural frequency of the vocal tract, which extends from the glottis to the labial closure. A typical value for this frequency is about 1,200 Hz for an adult speaker. For an alveolar nasal consonant this frequency is in the range 1,500–1,800 Hz. The spectrum sampled within the nasal murmur in figure 5 shows a peak in this frequency range, and can be contrasted with the spectrum for the [m] murmur in figure 4.

Lateral Consonants

As a final example, we examine acoustic events in the vicinity of the release for a lateral consonant. During the time of tongue-blade contact with the alveolar ridge for a lateral consonant in English, there is an acoustic path around one or both edges of the tongue blade. The airway on the midline immediately posterior to the tongue-tip closure forms a short side branch in the acoustic path from the glottis to the lips [Fant, 1960]. When the tongue-tip contact is released, this side branch is suddenly removed, and there is an acoustic discontinuity at this time. Cues for identification of the feature [lateral] again reside in the acoustic signal immediately preceding and immediately following the discontinuity or landmark. A spectrogram of the word lag is shown in figure 6, together with spectra sampled immediately preceding and immediately following the consonantal landmark (at about 90 ms in this case). Just before the release there are two low-frequency spectral peaks at about 450 and 1,150 Hz, with the second peak being relatively weak. Within about 20 ms these two formants have abruptly shifted up in frequency and the amplitude of the second peak has increased dramatically. There is a corresponding abrupt increase in the spectrum amplitude at high frequencies. Acoustic theory can explain these changes [Stevens, 1998]. For example, after release of the tongue blade, acoustic losses at the constriction will decrease, leading to a narrower and more prominent F2. One observes once again markedly different spectral characteristics on the two sides of the landmark at the consonantal release. The spectra in figure 6 are quite different from the spectra for a nasal consonant, shown in figures 4 and 5.

148

Phonetica 2000;57:139–151

Stevens

Fig. 6. Spectrogram of the word lag is shown at the top. The lower panels are spectra (6.4-ms time

window) sampled on glottal periods just preceding and just following the consonant release. The arrows point to the spectral prominences corresponding to the second formant.

Discussion

Data of the kind summarized here may provide some insight into the process whereby a listener uncovers the underlying distinctive features and their organization into segments and words in running speech. First, these observations highlight the differences in the way acoustic cues are represented in consonants on the one hand and vowels and glides on the other. The presence of an abrupt acoustic discontinuity in the sound is an indication that a consonant constriction or closure is formed or is released. This abrupt change in the spectral pattern defines a region in the signal where cues for the articulator-bound features of the consonant are to be found. The presence of a vowel, on the other hand, is generally marked by a peak in low-frequency amplitude, indicating a maximum opening of the oral cavity. Cues for vowel features are located in the vicinity of this amplitude peak, and the acoustic parameters that provide these cues


Phonetica 2000;57:139–151

149

do not exhibit abrupt changes. Likewise, glides tend to be marked by a minimum in low-frequency amplitude, with no abrupt acoustic discontinuity. At an acoustic discontinuity created by the closure or release of an articulatory constriction for a consonant, a listener must integrate the diverse acoustic cues on the two sides of the discontinuity in order to estimate the various consonantal features. This capability must be acquired by a child who is learning a language. The mechanism by which this integration is acquired, and the age at which it is acquired, has been the topic of several studies [e.g. Parnell and Amerman, 1978; Ohde et al., 1995]. Data from those studies, as well as data relating to fricative consonant perception [Nittrouer, 1992; Nittrouer et al., 1998] suggest that a child is well past 5 years of age before this integration is complete. Although the cues on the two sides of a consonantal discontinuity appear to be diverse when described in acoustic terms, they are a natural consequence of continuous movements of the articulators. The listener is endowed with the production apparatus to generate examples of consonant closures and releases, to hear the acoustic consequences of these movements, and to develop from this experience an inventory of articulatory-acoustic relations. It is not unreasonable to expect, therefore, that the listener could draw on this experience with motor actions in interpreting the acoustic patterns. Those actions can then lead rather directly to estimation of the features which represent consonant distinctions in the mental lexicon. These views of the relations between speech production and speech perception have been discussed by Liberman and Mattingly [1985] and by others. They provide a framework for developing models of the process by which listeners integrate the different consonantal cues in the vicinity of acoustic discontinuities in the speech pattern, but the detailed mechanism for this process remains unclear. The various acoustic cues that are observed on the two sides of a consonant landmark often have spectral characteristics that occupy different frequency ranges. In the presence of noise or other distorting effects on the signal, some acoustic cues to the features may be obliterated or masked, while others may remain, depending on the spectral characteristics of the noise. For example, the spectra of the bursts for different stop consonant places of articulation differ from each other primarily in the high-frequency region, whereas the transitions of F1 and F2 are in the mid- and low-frequency range. Thus the existence of multiple acoustic cues can result in a robustness in the identification of consonantal features when there is noise or other distortions.

Acknowledgments The influence of Morris Halle and Jay Keyser in formulating the ideas in this paper is gratefully acknowledged. The comments of John Kingston are also acknowledged, with thanks. Preparation of this paper was supported in part by grants from the National Institutes of Health.

References Båvegård, M.; Fant, G.; Gauffin, J.; Liljencrants, J.: Vocal tract sweeptone data and model simulations of vowels, laterals, and nasals. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol. Stockh., No. 4, pp. 43–76 (1993). Blumstein, S.E.; Stevens, K.N.: Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. J. acoust. Soc. Am. 66: l001–1017 (1979).

150

Phonetica 2000;57:139–151

Stevens

Chen, M.Y.: Acoustic parameters of nasalized vowels in hearing-impaired and normal-hearing speakers. J. acoust. Soc. Am. 98: 2443–2453 (1995). Delattre, P.C.; Liberman, A.M.; Cooper, F.S.: Acoustic loci and transitional cues for consonants. J. acoust. Soc. Am. 27: 769–773 (1955). Fant, G.: Acoustic theory of speech production (Mouton, The Hague 1960). Halle, M.; Stevens, K.N.: A note on laryngeal features. Research Laboratory of Electronics, Rep. No. 101, pp. 198– 213 (Massachusetts Institute of Technology, Cambridge 1971). Halle, M.; Stevens, K.N.: Knowledge of language and the sounds of speech; in Sundberg, Nord, Carlson, Music, language, speech and brain, pp. 1–19 (MacMillan Press, Basingstoke 1991). House, A.S.; Fairbanks, G.: The influence of consonant environment upon the secondary acoustical characteristics of vowels. J. acoust. Soc. Am. 25: 105–113 (1953). Klatt, D.H.; Klatt, L.C.: Analysis, synthesis, and perception of voice quality variations among female and male talkers. J. acoust. Soc. Am. 87: 820–857 (1990). Kurowski, K.M.; Blumstein, S.E.: Perceptual integration of the murmur and formant transitions for place of articulation in nasal consonants. J. acoust. Soc. Am. 76: 383–390 (1984). Kurowski, K.M.; Blumstein, S.E.: Acoustic properties for place of articulation in nasal consonants. J. acoust. Soc. Am. 81: 1917–1927 (1987). Lea, W.A.: Segmental and suprasegmental influences on fundamental frequency contours; in Hyman, Consonant types and tones, pp. 17–70 (Linguistics Program, University of Southem California, Los Angeles 1973). Liberman, A.M.; Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985). Liu, S.A.: Landmark detection for distinctive feature-based speech recognition. J. acoust. Soc. Am. 100: 3417–3430 (1996). Nittrouer, S.: Age-related differences in perceptual effect of formant transitions within syllables and across syllable boundaries. J. Phonet. 20: 1–32 (1992). Nittrouer, S.; Crowther, C.S.; Miller, M.E.: The relative weighting of acoustic properties in the perception of [s]+stop clusters by children and adults. Perception Psychophysics 60: 51–64 (1998). Ohde, R.N.: Fundamental frequency as an acoustic correlate of stop consonant voicing. J. acoust. Soc. Am. 75: 224– 230 (1984). Ohde, R.N.; Haley, K.L.; Vorperian, H.K.; McMahon, C.W.: A developmental study of the perception of onset spectra for stop consonants in different vowel environments. J. acoust. Soc. Am. 97: 3800–3812 (1995). Parnell, M.M.; Amerman, J.D.: Maturational influences on perception of coarticulatory effects. J. Speech Hear. Res. 21: 682–701 (1978). Stevens, K.N.: Acoustic phonetics (MIT Press, Cambridge 1998). Titze, I.R.: Phonation threshold pressure: a missing link in glottal aerodynamics. J. acoust. Soc. Am. 91: 2926–2935 (1992). Walley, A.C.; Carrell, T.D.: Onset spectra and formant transitions in the adult’s and child’s perception of place of articulation in stop consonants. J. acoust. Soc. Am. 73: 1011–1022 (1983).


Phonetica 2000;57:139–151

151

Perceptual Processing Phonetica 2000;57:152–169

Received: December 14, 1998 Accepted: December 20, 1999

Modeling and Perception of ‘Gesture Reduction’ René Carré a Pierre L. Divenyi b a

ENST, Unité Associée au CNRS, Paris, France; b Experimental Audiology Research, VA Medical Center, Martinez, Calif., USA

Abstract The phenomenon of vowel reduction is investigated by modeling ‘gesture reduction’ with the use of the Distinctive Region Model (DRM). First, a definition is proposed for the term gesture, i.e. an acoustically efficient command aimed at deforming, in the time domain, the area function of the vocal tract. Second, tests are reported on the perception of vowel-to-vowel transitions obtained with reduced gestures. These tests show that a dual representation of formant transitions is required to explain the reduction phenomenon: the trajectory in the F1–F2 plane and the time course of the formant changes. The results also suggest that time-domain integration of the trajectories constitutes an integral part of the auditory processing of transitions. Perceptual results are also discussed in terms of the acoustic traces of DRM gestures. Copyright © 2000 S. Karger AG, Basel

Introduction

Numerous observations on vowel formant characteristics have been reported [Chiba and Kajiyama, 1941; Peterson and Barney, 1952; House and Fairbanks, 1953; Fant, 1973]. These characteristics have been seen to vary as a function of speaker, style, prosody, and phonetic context. The observations have led to the finding that vowels produced with very different formant characteristics could be perceptually equivalent. Because this multiple-valued nature of vowels was recognized early on, it has been generally assumed [first implicitly and, much later, explicitly, see e.g. Kuhl, 1992] that the average formant values of vowels in isolation represent targets that the talker, during continuous speech, always strives, and often fails, to reach. Cases when vocalic targets are not reached have been termed, from the standpoint of production, articulatory undershoot [Brownlee, 1996; Lindgren and Lindblom, 1996] and, from the standpoint of acoustic trajectories, ‘vowel reduction’ [Lindblom, 1963]. Vowel reduction is observed, for instance, in fast speech (where the vocal tract is not given enough time to complete the required task and, therefore, it is denied the opportunity to have its shape



Dr. René Carré ENST, Unité Associée au CNRS 46, rue Barrault, F–75634 Paris Cedex 13 (France) Tel. (33) 1 45 81 71 90, Fax (33) 1 45 88 79 35 E-Mail [email protected]

conform to that of an intended target) as well as in ‘hypospeech’ [Lindblom, 1990]. The intended targets, then, are recovered by the listener either by way of a special compensatory mechanism that produces a ‘perceptual overshoot’ [Lindblom and StuddertKennedy, 1967] or by using previously learned and stored templates of vowel formant patterns with the undershoot effects included [Johnson, 1997]. While such a mechanism would predict a one-to-one correspondence between production and perception of unreached target vowels, some results appear to cast the shadow of doubt over this prediction. For example, in production, vowel targets can often be reached even in fast speech [Kuehn and Moll, 1976; Gay, 1978; van Son and Pols, 1992], while normal-rate speech does not necessarily ensure that the targets will be reached at all times [Nord, 1986]. The situation is not less ambiguous in the perceptual domain: under certain conditions, listeners seem to base their judgment on the average of the formant values covered by the trajectory. Since averaging the frequencies of a formant trajectory inevitably makes the percept of the final frequency to be displaced toward the initial frequency, such cases yied a perceptual undershoot [van Son and Pols, 1993], rather than produce an overshoot stemming from a perceptual compensation for vowel reduction. Other authors argue that first formant transitions are averaged and second formant transitions lead to a perceptual overshoot [Huang, 1987; Di Benedetto, 1989]. The differences between the formant frequencies of carefully produced vowels (that is, those either uttered in isolation or found in laboratory speech) and those observed in spontaneous fluent speech suggest that while vowel reduction is a potentially precious tool for the study of speech phenomena having to do with speech kinematics, it is ill-suited for the study of steady-state vowels. Vowel-to-vowel transitions are part of these phenomena. Studying these transitions with vowel reduction views them from the perspective of an approach that, in agreement with Strange [1989], we favor [Carré et al., 1994]. Unfortunately, investigating the live production of vowel transitions and its perceptual consequences poses numerous and serious methodological problems. To bypass these, the method of simulating production by means of a realistic model has been used as an alternative to recorded natural tokens, as long as the model is regarded as capable of generating acceptable approximations. In the present paper, we will resort to such a simulation by way of a simple acoustic production model controlled by command parameters. The model in question, deduced from acoustic theory, is the Distinctive Region Model (DRM), which appears particularly well adapted for the study of the kinematics of speech production [Mrayati et al., 1988]. In this model, simple and efficient commands – similar to those actually realized during speech production – accomplish specific acoustic changes by some specific deformation of the vocal tract. We define efficiency as meaning that a small area deformation should lead to a large acoustic variation. As a consequence of the acoustic properties of a tube closed at one end, deformations applied to all points of the tube will not be equally efficient. Consequently, the model constrains the constrictions of the vocal tract to occur at points that are acoustically the most efficient. DRM commands corresponding to dynamic phonetic tasks are assumed to be speech gestures for vowel-to-vowel production. As the articulators are engaged in performing a certain constriction pattern, the area function of the vocal tract and its resonant characteristics are changed in some specific way. Thus, a speech gesture represents the transition in the time domain between two quasistationary segments of an utterance, i.e. the transition between an initial and a target constriction pattern and the corresponding initial and target area functions and their resonance characteristics. For example, the vowel-to-vowel transition /ai/ is described

Gesture Reduction

Phonetica 2000;57:152–169

153

by only one gesture – the one for lingual constriction – while the French /ay/ is described by two gestures – the lingual constriction gesture and the lip gesture. It should be noted that gestures derived using the DRM are in close agreement with the speech gestures derived by Browman and Goldstein [1986] from empirical (i.e. articulatory) data. Because they are distinctive in their nature, DRM gestures also represent units of information analogous to the information units of Browman and Goldstein [1986] that arise from the coordination of articulators when they engage in constricting the vocal tract at specific regions. The objective of the present article is to investigate the perception of vowel-tovowel transitions generated by DRM gestures. From the results, we will attempt to determine whether the listener is able to infer the occurrence of a specific vocal tract deformation gesture from the presence of its acoustic consequence, namely, from the presence of a specific temporal-spectral pattern in the formant transition.1 In the articulatory domain, deformation gestures acting on the area function of the vocal tract are intended to reach articulatory targets corresponding to acoustic targets. In the case of vowel reduction, we use the term ‘gesture reduction’ referring to the case when an articulatory target corresponding to an acoustic target is not reached. If the vocal tract deformation is the joint consequence of several individual gestures (referred to as coproduction [Kozhevnikov and Chistovich, 1965; Fowler, 1992]), then a reduction may take place in any or all of the component gestures. Therefore, it is essential to scan gestures individually both for the occurrence of reduction in each of the components and for the degree of their relative independence [Mattingly, 1990]. To be precise, vowel production may be said to rely on two largely autonomous supralaryngeal gestures: tongue constriction displacement and lip rounding. If the time course and/or pattern of synchronization of these two gestures are different, the resulting trajectory between a given initial and final vowel will also be different. The series of experiments described below were designed to investigate the perception of formant trajectories in vowel-to-vowel transitions in situations where vowel reduction could be observed.

Methods One major objective of the present investigations was to determine the extent to which the transition had to proceed along a given V1–V2 trajectory, in order for the listener to perceive the vowel V2 at the endpoint of the trajectory. To preclude any interference from the recency effect, we resorted to using a V1V2V1 stimulus similar to the one reported by Kuwabara [1985] and Beautemps [1993]. The major advantage of such tokens is that the V1 vowel provides a firm reference anchor2. Our V1V2V1 tokens were synthesized using the DRM controlled by two gestures: the tongue gesture and the lip ges-

1

Because the transitions evolve in both the time and the frequency domains, throughout the article we will refer to transitions in two coordinate systems. The first of these is the F1–F2 plane, which is most appropriate to represent formant trajectories and, consequently, the phonetically meaningful regions traversed (for example, the [ai] trajectory passes through the regions containing the vowels [£] and [e]) but is not suited to show temporal information. The second is the time domain representation of the formant transitions which captures well the time course of trajectories but which, by itself, does not convey information regarding the phonetic categories traversed. 2 In contrast, CVC tokens used by a number of investigators fail to provide a well-defined spectral reference on either side of the vowel ands thus raise the possibility of the listener estimating vowel formants through time averaging. Such an averaging could underlie a perceptual undershoot, as proposed by van Son [1993] and (partially) substantiated by his data.

154

Phonetica 2000;57:152–169

Carré/Divenyi

Fig. 1. Schematic diagram of the vocal tract deformation gestures for an [iV2i] sequence. The vocal tract shape of [i] is shown as the solid line, whereas the one for the (arbitrary) V2 is shown as the broken line. The gesture for [iV2] is indicated by the solide arrow (1 and associated 1) and that for the [V2i] by the broken arrow (2 and associated 2).

ture. The time course of the transition between two vowels was obtained through a cosine interpolation. The model’s output consisted of a sequence of formant frequencies calculated using the algorithm proposed by Badin and Fant [1984]. This algorithm takes into account wall vibrations as well as losses of different origin affecting the vocal tract (e.g. heat, viscosity and the resistive part of the radiation impedance)3. The resulting formant frequencies were used to control the first three formants of a cascade formant synthesizer, whereas F4 and F5 were fixed at 4,000 and 5,000 Hz, respectively. Bandwidths of the first five formants were, in ascending order, 70, 100, 110, 150 and 200 Hz. Frame duration was held constant at 10 ms. The voice source was represented by a series of 0.1-ms pulses shaped by a second-order glottal filter (F = 100 Hz, BW = 300 Hz). F0 was varied linearly from 120 Hz at the beginning of V1, to 130 Hz at the beginning of V2, and to 100 Hz at the end of V3. For each experiment, a set of tokens was synthesized, each having different transition characteristics. Each block of trials consisted of ten random-order presentations of each token. Subjects were 5 native French listeners.Their tasks consisted of assigning, by button press, one of three or four labels to the second vowel V2 in each token. The list of labels, corresponding to French vowels, was assembled by the first author following pilot listening tests. Each trial was initiated by the subjects’s response to the previous one. Stimuli were presented through earphones at a comfortable listening level. A PC computer controlled all experiments. Results for each token were expressed as mean percent identification and intersubject standard deviation.

Experiment 1: Gesture Reduction for [iai]

In the first experiment (experiment 1a), the V1V2V1 sequence has the [i] vowel for V1 and vowels along the [ia] trajectory for V2. Only one gesture was involved in the transition. The objective of the experiment was to measure the minimum displacement toward the [a] at which the listener will perceive with certainty the distant target vowel, just as if, in fact, it had been reached. Figure 1 illustrates the two successive gestures performed to obtain the constriction patterns that generate the desired [iV2i] sequence.

3

In the computation, the total effective length Le of the tube is identical to the physical length Lp plus the length corresponding to the radiation inductance Lr. Lp is a function of the protrusion and Lr is proportional to the lip opening [Fant, 1975]. For example, for the vowel [i], Lr = 2 cm and Lp = 17 cm without protrusion, yielding Le = 19 cm; for the vowel [y], Lr = 0 cm and Lp = 19 cm because of 2 cm protrusion, so Le = 19 cm. The radiation effect is compensated by the protrusion length [Mrayati et al., 1990] and the total effective length is constant.

Gesture Reduction

Phonetica 2000;57:152–169

155

Fig. 2. Schematic representa-

a

tion of the eight [iV2i] sequences with different degrees of constriction, synthesized to be used as stimuli in experiment 1a. a DRM command amplitude (arbitrary units) as a function of time. b F1 and F2 transitions corresponding to the gesture in a, shown in a spectrogram plot. The temporal representation of F3 is also shown (dotted line) for the case in which the [a] target is reached. c F1–F2 plot of the eight formant trajectories in b. Note that all V2 vowels (shown as points) fall on the [ia] trajectory and can be interpreted as incomplete [iai] transitions, except for the rightmost point. The time labels on some of the points refer to the time of return to the final [i] vowel, before completing the [ia] transition, that is, the time value of vowel reduction.

b

c

It is obvious that different degrees of constriction (indicated as the endpoint of the left solid arrow) will lead to different vowels V2. Using the DRM synthesizer, eight [iV2i] sequences with different degrees of constriction were generated. The global command amplitude for the eight stimuli is illustrated as a function of time in figure 2a, whereas figure 2b shows the corresponding F1 and F2 transitions. The value of F3 obtained by the model is 2,944 Hz for the vowel [i] and 2,795 Hz when the target [a] is reached (for illustrative purposes, the F3 temporal trajectory is shown in figure 2a for the token in which the target [a] is reached). The duration of the initial V1 vowel was always 100 ms, whereas that of the final V1 vowel was 150 ms; the duration of both the V1V2 and the V2V1 transitions was 100 ms. Two of the eight items actually reached the target [a] before returning to [i] (i.e. the duration of the target in these two items was 10 ms

156

Phonetica 2000;57:152–169

Carré/Divenyi

and 0 ms, respectively). In contrast, the six other tokens did not reach the [a] (and thus represent different degrees of reduction), with the first transition truncated at 10, 20, 30, 40, 50, and 60 ms before it would have reached the [a] target, had this transition been complete. Saying it differently, the onset of the gesture to return to [i] was advanced, i.e. it occurred at –10, – 20,..., – 60 ms with respect to the instant at which the target [a] would have been reached by a completed transition. (Throughout the article, the negative values in milliseconds refer to the time before the target would be reached, were the transition to continue its course.) These characteristics were chosen by assuming that the command for the first task was to reach the target by means of a quasi-ballistic motion, which was cut short by an opposite command before the target was reached. [In fact, we will see later that the precise choice of the movement between two targets is irrelevant (experiment 1c).] The formant trajectory in the F1–F2 plane is plotted in figure 2c. Note that the trajectory in this coordinate system is the same for all eight tokens except for the V2 endpoint. The figure also attempts to show that the trajectory traverses regions corresponding to several French vowels. For example, formant values on the trajectory 40 ms before reaching [a], i.e. halfway between [i] and [a], correspond to those of the French vowel [£]]] . Results of the perception tests are shown in figure 3a. The task of the listeners was to label the extreme vowel V2 as one of the three French vowels /e/, /£/, or /a/ 4. The specific aim of this experiment was to determine the extent to which listeners identify the extreme vowel V2 as /£/ or /e/ when the transition does not go all the way to the endpoint [a]. Figure 3a illustrates average results of 5 listeners showing the percentage of /iai/ judgments for the eight different degrees of [V2i] transition cutback. As the figure indicates, the subjects made more than 80% /iai/ judgments for reductions corresponding to [V2i] transitions ranging from 0 to up to – 30 ms before a complete [ia] transition would have been reached. The – 30 ms cutback corresponds to formant values of F1 = 607 Hz and F2 =1,803 Hz which are quite different from those characteristic to a typical [a]. For [V2i] transitions returning to [i] more than 30 ms before completing the [ia] transition, first an /i£i/ and then an /iei/ percept were reported. The standard deviations, shown as the error bars in figure 3a and most subsequent data graphs, carry more intersubject than intrasubject variability, suggesting the presence of individual differences as regards the internal reference for the vowels /a, £, e, i/, even for subjects who belong to the same language group. A proof of this is illustrated in figure 3b that shows data of 1 individual listener averaged over five blocks of trials; the spread of these data is much smaller than the one seen in figure 3a for the data of all subjects combined. But would a V2, which was perceived as /a/ in an [iV2i] context, also be perceived as /a/ when presented in isolation? The objective of experiment 1b was to answer this question. Stable vowels (200 ms in duration) having the same formant values as those of the V2 endpoints of the transitions illustrated in figure 2c were synthesized and presented to the same subjects for labeling, using the three response categories identical to those in experiment 1a. Results shown in figure 3c indicate that, up to formant values corresponding to cutbacks not exceeding – 20 ms, the vowel was predominantly perceived as /a/. Thus, a comparison of the results of experiments 1a and 1b suggests the

4

In order to obtain a more complete picture of the phenomenon, the same experiments should be repeated with native speakers of other languages used as subjects. It is expected that a language with few vowels would lead to larger reductions than languages with many vowels.

Gesture Reduction

Phonetica 2000;57:152–169

157

a Fig. 3. Experiments 1a and 1b. a Average labeling results

(and standard deviation) of 5 listeners in experiment 1a for V2 in a [iV2i] context. Ordinate: percent responses; V2 = /a/ (filled diamonds), V2 = /£/ (squares), V2 = /e/ (triangles). Abscissa: point of return of the transition to [i] prior to reaching [a]. b Average labeling results (and standard deviation) of 1 listener in experiment 1a for V2 in a [iV2i] context. c Average labeling results (and standard deviation) of 5 listeners in experiment 1b for a steady-state V2 vowel. Ordinate: percent responses; V2 = /a/ (filled diamonds), V2 = /£/ (squares), V2 = /e/ (triangles). Abscissa: F1–F2 value of the V2 vowel (fig. 2c) shown as the point of return of the transition to [i] prior to reaching [a], had the vowel been the midpoint of an [iV2i] transition.

b

c

presence of a perceptual overshoot of about 15%5. Note that the identification curves for /£/ and /e/ in figure 3a are also shifted rightwards with respect to figure 3c and thus they, too, suggest a perceptual overshoot of approximately 15%. In experiment 1a, the slope of the first and the duration of the second transition were kept constant. Experiment 1c was designed to examine what role the slope plays

5

Actually, the reduction is probably quite a bit larger than 15% because, as a consequence of our smooth transition onsets and offsets, the nominal duration of the steady-state portion underestimates its real duration. Consequently, we estimate the overshoot to be at least 30%.

158

Phonetica 2000;57:152–169

Carré/Divenyi

a

Fig. 4. Experiment 1c. a

b

c

Spectrogram representation of the five different temporal patterns of the formant transitions; note that the transition for all tokens reaches the same V2 and that the total duration of the transition is always constant at 150 ms. b F1–F2 plane representation of the transitions: note that the extreme value, i.e. the V2 vowel corresponds to the formant values of the vowel [£]. c Average labeling results (and standard deviation) of 5 listeners in experiment 1c for V2 in a [iai] context with changing transition durations. Ordinate: percent responses; V2 = /a/ (filled diamonds), V2 = /£/ (squares), V2 = /e/ (triangles). Abscissa: Token numbers corresponding to the five different transition slopes (a).

in the reduction process. We selected one particular V2 among the most reduced targets in experiment 1a which was still perceived as /a/. This vowel is close to the transition cutback of – 30 ms (with F1 = 596 Hz and F2 =1,803 Hz) and is predominantly perceived as /£/ in isolation. In the five tokens synthesized, the slopes of the two transitions were changed gradually (fig. 4a) while keeping the other parameters constant – especially the formant trajectory in the F1–F2 plane (fig. 4b) and the duration of V2. The total duration of the transitions of the reduced vowel (i.e. the transition region that included [iV2] and [V2i] was fixed at 150 ms. In other words, when the slope of the first transition increased, that of the second decreased. Listening results, shown in figure 4c,

Gesture Reduction

Phonetica 2000;57:152–169

159

Fig. 5. Average labeling re-

sults (and standard deviation) in experiment 1d using a V2 value corresponding to a steady-state [£]. Ordinate: percent responses; V2 = /a/ (filled diamonds), V2 = /£/ (squares). Abscissa: total duration of the token.

indicate that changing the slopes in this fashion left the percept unaffected: all the tokens were perceived as /iai/. In experiment 1c, the duration of the complete [iV2i] transition region in which vowel reduction was observed was about 150 ms. In experiment 1d, we addressed the question of whether the perceived V2 vowel was influenced by the equivalent of speaking rate. To accomplish this, we constructed a stimulus series starting with the token described in experiment 1c (that is, with F1 = 596 Hz and F2 =1,830 Hz) and simultaneously lengthened all segments of the token from the original 400 ms, to 440, 480, 520, 560, 600, and 800 ms, while keeping the proportion of the steady-state portions and transitions constant. (The duration was changed by simply increasing the frame duration from the original 10 ms, to 11, 12, 13, 14, 15, and 20 ms.) Figure 5 illustrates the labeling results obtained in this experiment. As it appears, the reduction persists for token durations up to 600 ms. The duration of the entire transition region for this token was 225 ms. When transition duration increases past 225 ms, i.e. when the speaking rate further decreases, vowel reduction vanishes and the percept becomes /i£i/.

Experiment 2: Gesture Reduction for [aya]

The first experiment examined the reduction for the case of a vowel sequence generated by a single gesture, i.e. tongue constriction. There exist, however, vowel-tovowel transitions in which more than one gesture participates, e.g. [aya] in French. We chose this sequence to become the stimulus of experiment 2a for two reasons: the presence of two simultaneous gestures in the [ay] transition, i.e. tongue constriction and lip rounding, and the vocal tract shape of [y]. In comparison to the area function shape of the vowel [i] used in experiment 1, the vowel [y] differs by virtue of the presence of a second gesture, i.e. lip rounding. For the sake of simplicity, in the synthesis sheme used for generating the [y] vowel in the present experiment (2a), the two gestures were made strictly synchronous. Stimulus generation was accomplished using the DRM synthesizer described in the ‘Methods’ section. Figure 6a shows the time domain plot of a set of eight [aya] formant transitions with increasing [V2a] transition cutbacks, i.e. with an increasingly earlier starting point

160

Phonetica 2000;57:152–169

Carré/Divenyi

a

b

c

Fig. 6. Experiment 2a. a Temporal representation of the F1 and F2 transitions of eight of the ten tokens

used in the [aV2a] experiment where only two tokens completed the transition to the vowel [y]. The other six had their transitions progressively cut back, i.e. their formant frequencies returned toward those of [a] after reaching various V2 endpoint vowels on the trajectory. The temporal representations of F3 is also shown (dotted line) for the case in which the [y] target is reached. b F1–F2 plane representation of the [aya] transition for the eight tokens shown above. Note that, in contrast to the [iai] trajectory (fig. 2c), the [aya] trajectory is curved. c Results (average and standard deviation) of experiment 2a: Percent /aya/ (filled diamonds), /aia/ (squares), /ala/ (triangles) responses as a function of the duration of [y] or the transition cutback point (i.e. the point of return to [a]) where 0 ms refers to the condition in which the vowel [y] is reached but the transition immediately returns toward [a]. Note that 100% /y/ responses were obtained only for the 30-ms ‘positive cutback’ condition, i.e. for the condition in which there was a 30-ms steady-state [y] before the transition actually took a turn back to [a]. Also, note that the intersubject variability is much larger than the one observed in the subtests of experiment 1.

Gesture Reduction

Phonetica 2000;57:152–169

161

of the return to [a]. Among the eight tokens there is one in which the transition returns to [a] immediately after reaching the [y] and another in which a 10-ms steady-state [y] is present before returning to [a]. The value of F3 obtained by the model is 2,795 Hz for the vowel [a], and 2,540 Hz when the target [y] is reached. For illustrative purposes, the time trajectory of F3 (important for the distinction /i/-/y/) is shown in figure 6a for the case where the target [y] is reached. The F1–F2 trajectory corresponding to the time domain plot is illustrated in figure 6b. When, in the model, the two gestures are produced in perfect synchrony, the acoustic effect caused by lip rounding (manifest as a drop in F1 and F2) occurs later than that of the lingual constriction (manifest as a drop in F1 and a rise in F2). In contrast to the [iai] trajectory used in experiment 1, the one for [aya] is not a straight line in the F1–F2 plane6. The [aya] transition, therefore, should be considered as an appropriate stimulus to study the perception of curved trajectories generated by a composite gesture. In preliminary labeling tests, we observed that none of the eight stimuli shown in figure 6a yielded an unambiguous /aya/ percept. Consequently, in the main labeling experiment (experiment 2a), we added two new tokens in which the stable portion of [y] was increased by either 10 or 20 ms, respectively. Subjects were the same native French listeners who had participated in the former experiments. They were given the task of deciding whether any of the three vowels /y/, /i/, or /£/ or the liquid /l/ was the most appropriate label for V2 in the token they just heard. An /l/ percept is possible because, in French, /l/ is fronted. Results of this experiment are given in figure 6c. As the figure indicates, merely reaching the target [y] was insufficient to generate a definite /aya/ percept; in order for that to happen, [y] had to be present as a steady-state vowel for a duration longer than 30 ms. When the duration of the steady state [y] was shorter than 30 ms, V2 was perceived as /i/ rather than /y/. In other words, unless the transition halted at the [y] vowel and remained there for a given minimum duration (i.e. 30 ms), the lip rounding gesture was ignored. This finding strongly suggests that vowel-to-vowel transitions are governed by temporal integration of the trajectory vectors in the F1–F2 plane: There appears to be a mechanism that calculates the time average of the length and the direction of formant trajectories that are, by definition, time-varying, and that predicts the target of the trajectory which, in fact, may not actually be reached. Such a mechanism could explain why, in experiment 2a, there were so many /i/ responses when, in fact, the trajectory was diverging from the [i] formant values7. Therefore, the question arises whether an /y/ percept could be induced by presenting to the listeners an incomplete curved trajectory which, contrary to the one used in experiment 2a, would never point toward the [i] vowel. Consequently, in experiment 2b we synthesized an F1–F2 trajectory that first curved downward (i.e. was consistent with a labial gesture) before pointing toward the [y] (fig. 7b). The time course of the first two formants is illustrated in figure 7a, with degrees of cutback identical to those of the tokens used in experiment 1a (fig. 2b). Because, in preliminary tests, an /aia/ percept was never obtained, we gave the subjects the task of labeling the V2 vowel as /y/, /ø/, or /œ/. Labeling results are shown in fig-

6

In natural speech, a straight trajectory is obtained due to labial anticipation [Carré and Mrayati, 1991]. Using the model, such a straight trajectory is obtained by emulating an anticipated labial gesture. These results also predict that a transition which never points to [i], such as a strictly straight line [ay] trajectory, would never lead to an /i/ percept. Tests of this prediction will be the subject of a subsequent study.

7

162

Phonetica 2000;57:152–169

Carré/Divenyi

a

b

c

d

Gesture Reduction

Fig. 7. Experiments 2b and 2c. a Temporal representation of the F1 and F2 transitions of the eight tokens used in the [aV2a] experiment where only two tokens completed the transition to the vowel [y]. The other six had their transitions progressively cut back, i.e. their formant frequencies returned toward those of [a] after reaching various V2 endpoint vowels on the trajectory. The temporal representation of F3 is also shown (dotted line) for the case in which the [y] target iy reached. b F1–F2 plane representation of the [aya] transition for the eight tokens used in experiment 2b (solid line). Note that, compared with the [aya] trajectory (broken line) shown in figure 6b, the trajectory is symmetrically curved. c Results of experiment 2b: Percent /aya/ (filled diamonds), /aøa/ (squares), /aœa/ (triangles) responses as a function of the duration of [y] or the transition cutback point (i.e. the point of return to [a]) where 0 ms refers to the condition in which the vowel [y] is reached but the transition immediately returns toward [a]. Note that /aya/ responses were obtained for the –20 ms ‘negative cutback’ condition, i.e. for the condition in which there was a 20 ms before reaching the target [y]. Average results of 5 subjects. d Results in experiment 2c: Labeling of steadystate V2 vowels. Ordinate: percent responses V2 = /y/ (filled diamonds), V2 = /ø/ (squares), V2 = /œ/ (triangles). Abscissa: F1–F2 value of the V2 vowel shown as the point of return of the transition to [a] prior to reaching [y], had the vowel been the midpoint of an [aV2a] transition.

Phonetica 2000;57:152–169

163

Fig. 8. Experiment 2d. a Tem-

poral representation of the F1 and F2 transitions of the seven tokens used in the [aV2a] experiment where six tokens completed the transition to the vowel [i]. The last one had its transition cut back, i.e. its formant frequencies returned toward those of [a] after reaching V2 endpoint vowel on the trajectory. The temporal representation of F3 is also shown (dotted line) for the case in which the [i] target is reached. b F1–F2 plane representation of the [aia] transition for the seven tokens used in experiment 2d (solid line). Note that, compared with the [aya] trajectory (broken line) shown in figure 7b, the trajectory is also curved but reaches [i]. c Results of experiment 2d: Percent /aia/ (filled diamonds) and /aya/ (squares) responses as a function of the duration of [i] or the transition cutback point (i.e. the point of return to [a]) where 0 ms refers to the condition in which the vowel [y] is reached but the transition immediately returns toward [a].

a

b

c

ure 7c for the same 5 French subjects. As a control, in experiment 2c, we asked the subjects to label, using the same three response categories (i.e. vowels /y/, /ø/, or /œ/), 200-ms steady-state vowels that had the same formant values as those of the V2 endpoints of the transitions in experiment 2b (fig. 7b). Results of this experiments are shown in figure 7d. Comparison of the category boundaries obtained in experiment 2b and 2c for the three intermediate vowels, shown in figures 7c and 7d, reveals a consistent perceptual overshoot for the vowels as long as the context is dynamic, such as in experiment 2b.

164

Phonetica 2000;57:152–169

Carré/Divenyi

In production terms, experiments 2a and 2b used tokens which were generated with a certain time course for the labial, and a different time course for the tongue gesture. In experiment 2a, the acoustic trace of the labial gesture, signaled by a lowering of the first two formants as compared to [ai], is delayed, whereas in experiment 2b, the acoustic trace of the labial gesture is present at the very beginning of the V1V2 transition and is preserved until the transition is completed. In experiment 2d, we wanted to explore the perception of transitions following a trajectory that would suggest the presence of a labial gesture at the beginning and a release from this gesture toward the end of the trajectory. Tokens having the F1–F2 time course illustrated in figure 8a, and the corresponding trajectory in the F1–F2 plane illustrated in figure 8b, fulfill this objective. It has to be stressed that such a V1V2V1 sequence is unrealistic, i.e. tokens such as those shown in figures 8a and 8b are never encountered in natural speech and, consequently, cannot be obtained with a realistic production model such as the DRM which we have used so far. For this reason, stimuli for this experiment were generated using a simple formant synthesizer. The lack of realism of these tokens is manifest in that the trajectory in the F1–F2 plane suggests a labial gesture at the beginning (i.e. a drop in F1 and F2), but then moves toward, and actually reaches, [i] (fig. 8b), rather than [y] as in experiment 2b. In fact, none of the tokens generated according to this scheme were heard as natural speech. Preliminary listening tests showed that a steady-state [i] portion of 80–100 ms was necessary, in order for an /aia/ sequence to be perceived. Thus, experiment 2d was conducted with a steady-state [i] portion present in six tokens, with respective durations of 100, 80, 60, 40, 20, and 0 ms and one token not reaching [i], corresponding to a reduction with a cutback of – 20 ms. The listener’s task was to label the percept as either /aia/ (i.e. with no labiality detected) or /aya/ (i.e. with labiality detected). Results for 5 listeners, shown in figure 8c, indicate that, when the trajectory reaches [i] and stays on it for between 0 and 20 ms, an /y/ V2 percept is induced.

Discussion

In the foregoing paragraphs we described two main experiments aimed at assessing the listener’s percept of vowel quality in V1V2V1 sequences in which vowel transitions were generated by DRM gestures, i.e. by a model previously shown to be appropriate for the study of speech production [Carré and Mody, 1997]. The general finding of the experiments was that, for sequences with a single gesture (i.e. [iai], experiment 1) and, under certain conditions, also for sequences with two gestures (i.e. [aya], experiment 2), we consistently observed what can be described as a perceptual overshoot. These results, taken together with those of Carré et al. [1994] and Divenyi et al. [1995], lead to the following conclusions: (1) In production models, a command that does not reach its target, i.e. one in which the gesture is reduced through an articulatory undershoot, leads to an instance of vowel reduction in the acoustic domain [Lindblom, 1963]. In many cases, as demonstrated in experiments 1a and 2b, the perceptual system may completely recover the target not reached either by using a mechanism that produces a perceptual overshoot [Lindblom and Studdert-Kennedy, 1967], or because of previous perceptual learning of such situations. To characterize the reduction phenomenon, the duration of the entire transition region of the reduced vowel (i.e. to and from a constant duration V2) seems to be more important than the transition slopes (see exeriment 1c). This duration, how-

Gesture Reduction

Phonetica 2000;57:152–169

165

ever, is not absolute but, rather, appears (within a certain time range) to be related to the total duration of the token (experiment 1d). (2) Reaching the target does not guarantee that it will be perceived. Results of experiment 2a suggest that the transition is evaluated by a temporal integration mechanism which computes an average transition direction used by the listener to identify the vowel perceived. This mechanism is similar to the weighted-time average process postulated to operate for the evaluation of the pitch of sinusoidal and F0 glides [d’Alessandro and Castellengo, 1994]. (3) All the results highlight the usefulness of representing formant trajectories in the F1–F2 plane. This representation reveals even minute (i.e. 10- to 20-ms) temporal asynchronies between formants. Experiments on the discrimination of frequency-modulated signals suggest that asynchronies of this magnitude are easily detected by the auditory system [Moore and Sek, 1998]. (4) For reduction to occur reliably in the case of two gestures, coordination between them is needed, in order to obtain a formant trajectory shape (rectilinear or not) that can be unambiguously interpreted by the listener. For this to happen, the timing of the movement onset as well as the synchrony of the gestures must be precisely controlled. The transition that the listener hears is a composite of the acoustic traces of two command gesture components: To perceive /aya/, in a sequence generated by two concurrent command gestures, the acoustic effect of each gesture must be present to a sufficient degree. (5) In agreement with Strange et al. [1983] and Strange and Bohn [1998], results of the present experiments stress the importance of dynamic changes. Comparison of the results of experiments 2a and 2b serves to demonstrate this point: whereas, for a given target, one trajectory shape leads to a perceptual overshoot (experiment 2b), with another trajectory shape the target may not be perceived as such even when it is actually reached (experiment 2a). Although we do not wish to engage in a protracted discussion of the theoretical aspects of vowel production and perception, the present results present a puzzle as to the processes that putatively gave rise to them. To solve the puzzle, we propose two logical possibilities to be considered. The first, phonetic, possibility is that a given global formant trajectory pattern is identified by way of comparing its acoustic-perceptual characteristics to those of previously learned and stored tokens. Since, during the learning process the stored trajectory tokens become associated with corresponding production gestures, it is plausible that the listener will attempt to associate any new trajectory he/she hears with a gesture previously associated with one of the stored trajectories. According to this possibility, identification of the gesture does not occur directly but, rather, it is the consequence of the mapping process. The second possibility is that there is a mechanism whose function is to dissociate the acoustic traces generated by different gestures and to map these traces directly into a repertoire of learned gestures, i.e. the perceptual representation would no longer be phonetic but phonological. By relying on a small phonological rather than a large phonetic repertoire for the decoding of speech, this mechanism would appreciably reduce the information load on the perceptual system. But would information reduction alone compel us to adopt the second possibility over the first, in an attempt to provide a general explanation for the perception of transitions? In fact, if we had terminated our investigation after experiment 1, both possibilities could account for the results obtained. On the other hand, results of experi-

166

Phonetica 2000;57:152–169

Carré/Divenyi

ment 2 (and especially experiment 2d) can be explained only by the second.8 As far as the traces of the two gestures are concerned, our synthesis methods were able to manipulate them independently and generate trajectories with various degrees of labiality present. A glance at the trajectory in the F1–F2 plane promptly reveals that labiality is signaled by a northeast-to-southwest vector that displaces a (nominally [ai]) southeast-to-northwest trajectory toward the origin of the F1–F2 coordinate system. In contrast, this southwestward displacement occurred too late in the trajectory used in experiment 2a and, had we failed to add tokens with a steady-state [y] portion, the labiality information available to the listeners would have been clearly insufficient. One the other hand, in experiment 2b, such a displacement was introduced early on (at the onset of the transition) and, as the results showed, it was unambiguously interpreted as a sign of labiality. This was true to an even greater extent for the trajectory of experiment 2d – a trajectory not producible by the human vocal tract – in which the labial information was present apparently to such a high degree that, when [i] was the target V2, it was simply not perceived as such at all. Considered as a whole, results of the study suggest that the gestures had to be separately recognized because the listeners appear to have effectively dissociated the acoustic traces of the gestures from one another and recognized them independently. Such perceptual independence may be looked upon as analogous to the relative autonomy of gestures in production. Since relative autonomy is a sine-qua-non requirement of multiple command gestures, the independence of the perceptual effects of gestures observed in our experiments could open the door to mapping articulatory units into perceptual units – a possibility much discussed in the past 50 years and on which the dynamic approach we adopted sheds a new light. Clearly, gestures take place in time and so do the trajectories defined in the F1–F2 plane. Consequently, the vectors described above must reflect displacement over a certain period of time inside this coordinate system. Thus, the lenght and the direction of trajectory vectors must be the integral of instantaneous displacements in time. The existence of such an integrator is implied, for example, by results of experiments 2a and 2d that showed an /aia/ percept for a curved [aya] trajectory and an /aya/ percept for a curved [aia] trajectory. It is also clear that the temporal integrator must use a window function that, as suggested by results of experiments 2a and 2b, would be skewed. A formal model specifying the temporal integration mechanism more closely will be proposed in a further report.

8

In experiment 2d, the trajectories were fully artificial: a nearly-realistic labial gesture was followed by a fully artificial ‘unlabial’ one, the sequence occurring concurrently with a single lingual gesture. Results of that experiment, therefore, argue for a phonological interpretation because these unnatural trajectories are highly unlikely to have been part of a learned phonetic repertoire: the subjects could not possibly have encountered them when listening to speech and, consequently, could not have learned to classify them according to their acoustic-phonetic characteristics.

Gesture Reduction

Phonetica 2000;57:152–169

167

Acknowledgment The authors are particularly grateful to their colleagues Michael Studdert-Kennedy and Maria Mody for their many insightful comments on the present article. The research benefited from support by the NATO Scientific Office, the Department of Veterans Affairs Medical Research, and the National Institute on Aging.

References d’Alessandro, C.; Castellengo, M.: The pitch of short duration vibrato tones. J. acoust. Soc. Am. 95: 1617–1630 (1994). Badin, P.; Fant, G.: Notes on the vocal tract computations. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 53–107 (1984). Beautemps, D.: Récupération des gestes de la parole à partir de trajectoires formantiques: identification de cibles vocaliques non atteintes et modèles pour les profils sagittaux des consonnes fricatives; thèse Institut National Polytechnique, Grenoble (1993). Browman, C.; Goldstein, L.: Towards an articulatory phonology; in Ewan, Anderson, Phonol. Yb., pp. 219–252 (Cambridge University Press, Cambridge 1986). Brownlee, S.A.: The role of sentence stress in vowel reduction and formant undershoot: a study of lab speech and informal spontaneous speech; PhD thesis University of Texas, Austin (1996). Carré, R.; Chennoukh, S.; Divenyi, P.; Lindblom, B.: On the perceptual characteristics of ‘speech gestures’. J. acoust. Soc. Am. 96: S3326 (1994). Carré, R.; Mody, M.: Prediction of Vowel and Consonant Place of Articulation. Proc. 3rd Meet. ACL Special Interest Group in Computational Phonol. SIGPHON 97, Madrid 1997, pp. 26–32. Carré, R.; Mrayati, M.: Vowel-vowel trajectories and region modeling. J. Phonet. 19: 433–443 (1991). Chiba, T.; Kajiyama, M.: The vowel: its nature and structure (Tokyo-Kaiseikan Publishing Company, Tokyo 1941). Di Benedetto, M.G.: Frequency and time variations of the first formant: properties relevant to the perception of vowel height. J. acoust. Soc. Am. 86: 67–77 (1989). Divenyi, P.; Lindblom, B.; Carré, R.: The role of transition velocity in the perception of V1V2 complexes. Proc. 13th Int. Congr. Phonet. Sci., Stockholm 1995, pp. 258–261. Fant, G.: Speech sounds and features (MIT Press, Cambridge 1973). Fant, G.: Vocal tract area and length perturbations. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 1–14 (1975). Fowler, C.A.: Phonological and articulatory characteristics of spoken language. Haskins Lab. Status Rep. Speech Res., SR 109/110, pp. 1–12 (Haskins Laboratories, New Haven 1992). Gay, T.: Effect of speaking rate on vowel formant movements. J. acoust. Soc. Am. 63: 223–230 (1978). House, A.S.; Fairbanks, G.: The influence of consonant environment upon the secondary acoustical characteristics of vowels. J. acoust. Soc. Am. 25: 105–113 (1953). Huang, C.B.: Perception of first and second formant frequency trajectories in vowels. Int. Congr. on Phonet. Sci., Tallinn 1987, pp. 194–197. Johnson, K.: Speaker perception without speaker normalization. An exemplar model; in Johnson, Mullennix, Talker variability in speech processing, pp. 145–165 (Academic Press, New York 1997). Kozhevnikov, V.A.; Chistovich, L.A.: Speech, articulation, and perception. JPRS-30543. NTIS (US Department of Commerce, 1965). Kuehn, D.P.; Moll, K.L.: A cineradiographic study of VC and CV articulatory velocities. J. Phonet. 4: 303–320 (1976). Kuhl, P.: Infants’ perception and representation of speech: development of a new theory. Proc. ICSLP ’92, Banff 1992, pp. 449–456. Kuwabara, H.: An approach to normalization of coarticulation effects for vowels in connected speech. J. acoust. Soc. Am. 77: 686–694 (1985). Lindblom, B.: Spectrographic study of vowel reduction. J. acoust. Soc. Am. 35: 1773–1781 (1963). Lindblom, B.: Explaining phonetic variation: a sketch of the H and H theory; in Marchal, Hardcastle, Speech production and speech modelling, NATO ASI Series, pp. 403–439 (Kluwer Academic Publishers, Dordrecht 1990). Lindblom, B.; Studdert-Kennedy, M.: On the role of formant transitions in vowel perception. J. acoust. Soc. Am. 42: 830–843 (1967). Lindgren, R.; Lindblom, B.: Reduction of vowel chaos. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2, pp. 1–4 (1996). Mattingly, I.G.: The global character of phonetic gesture. J. Phonet. 18: 445–452 (1990). Moore, B.C.J.; Sek, A.: Discrimination of frequency glides with superimposed random glides in level. J. acoust. Soc. Am., 104: 411–421 (1998). Mrayati, M.; Carré, R.; Guérin, B.: Distinctive region and modes: a new theory of speech production. Speech Commun. 7: 257–286 (1988).

168

Phonetica 2000;57:152–169

Carré/Divenyi

Mrayati, M.; Carré, R.; Guérin, B.: Distinctive regions and modes: articulatory-acoustic-phonetic aspects. A reply to Boë and Perrier comments. Speech Commun. 9: 231–238 (1990). Nord, L.: Acoustic studies of vowel reduction in Swedish. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 4, pp. 19–36 (1986). Peterson, G.E.; Barney, H.L.: Control methods used in the study of the vowels. J. acoust. Soc. Am. 24: 175–184 (1952). Son, R.J.J.H. van: Vowel perception: a closer look at the literature. Proc. Inst. Phonet. Sci., Univ. Amsterdam 17: 33–64 (1993). Son, R.J.J.H. van; Pols, L.C.W.: Formant movements of Dutch vowels in a text, read at normal and fast rate. J. acoust. Soc. Am. 92: 121–127 (1992). Son, R.J.J.H. van; Pols, L.C.W.: Vowel identification as influenced by vowel duration and formant track shape. Proc. Eurospeech ’93, Berlin, pp. 285–288 (1993). Strange, W.: Dynamic specifications of coarticulated vowels spoken in sentence context. J. acoust. Soc. Am. 85: 2135–2153 (1989). Strange, W.; Bohn, O.S.: Dynamic specification of coarticulated German vowels: perceptual and acoustical studies. J. acoust. Soc. Am. 104: 488–504 (1998). Strange, W.; Jenkins, J.J.; Johnson, T.L.: Dynamic specification of coarticulated vowel. J. acoust. Soc. Am. 74: 695–705 (1983).

Gesture Reduction

Phonetica 2000;57:152–169

169



General Auditory Processes Contribute to Perceptual Accommodation of Coarticulation Lori L. Holt a Keith R. Kluender b a

Department of Psychology and The Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, Pa., and b Department of Psychology, University of Wisconsin, Madison, Wisc., USA

Abstract The ability of listeners to recover speech information, despite dramatic articulatory and acoustic assimilation between adjacent speech sounds, is remarkable and central to understanding perception of fluent speech. Lindblom [J. Acoust. Soc. Am. 35: 1773–1781, 1963] shared with the field some of the most compelling early descriptions of the acoustic effects of coarticulation, and with Studdert-Kennedy [J. Acoust. Soc. Am. 42: 830–843, 1967] provided perceptual data that remain central to theorization about processes for perceiving coarticulated speech. In years that followed, hypotheses by others that have intended to explain the ability to maintain perceptual constancy despite coarticulation have relied in some way or another upon relatively detailed reference to speech articulation. A number of new findings are reported here that suggest general auditory processes, not at all specific to speech, contribute significantly to perceptual accommodation of coarticulation. Studies using nonspeech flanking energy, capturing minimal spectral aspects of speech, suggest simple processes (that can be portrayed as contrastive) serve to ‘undo’ assimilative effects of coarticulation. Data from nonhuman animal subjects suggest broad generality of these processes. At a more mechanistic explanatory level, psychoacoustic and neurophysiological data suggestive of underlying sensory and neural mechanisms are presented. Lindblom and Studdert-Kennedy’s early hypotheses about the potential for such mechanisms are revived and supported. Copyright © 2000 S. Karger AG, Basel

Introduction

‘Lack of invariance’ between phonemes and attributes of the acoustic signal poses a central dilemna in understanding the nature of speech perception. The problem is that there seem to exist few (or no) unitary attributes in the acoustic signal that uniquely specify particular phonemes. The primary culprit of the acoustic variability across productions of a given phoneme is coarticulation – the spatial and temporal overlap of adjacent articulatory activities.



Lori L. Holt Department of Psychology, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213 (USA) Tel. +1 412 268-4964, Fax +1 412 268-2798 E-Mail [email protected]

The influence of coarticulation is reflected in the acoustic signal by severe context dependence. Lindblom [1963] documented one of the earlier and most influential demonstrations of context dependence in speech acoustics in an investigation of the influence of consonant context on vowel spectra. The acoustic patterns of vowels are affected by coarticulation with adjacent consonants such that formant frequencies may diverge considerably from patterns of isolated vowels. Lindblom [1963] found that second formant (F2) frequencies of vowels in [bTb] and [bRb] context were significantly lower in frequency than productions of the same vowels ([T] and [R]) in isolation, whereas vowel F2 frequencies were higher in productions of [dTd] and [dRd]. In each context, F2 frequency of the vowels was shifted away from the F2 frequency observed for the same vowel in isolation and toward the frequency of F2 for adjacent consonants (lower for [bVb] and higher for [dVd]). This example of acoustic context dependence consequent to coarticulation is typical of many observations of acoustic effects of coarticulation that have followed Lindblom’s early observations. In general, coarticulation is the compromise between where articulators have been and where they are going in connected speech [Stevens and House, 1963; Öhman, 1966; Mann, 1980]. This is reflected as assimilation in the acoustic patterns of speech. As a result of coarticulation, the acoustic realization of a particular phoneme can vary substantially across phonetic context, always in the direction of neighboring sounds. Such effects can be dramatic. Lindblom [1963], for example, reported F2 frequency deviations as great as 70% for a single vowel articulated across several consonant contexts. The ability of listeners to recover speech information despite this dramatic context dependence is central to understanding speech perception. In the laboratory, this capacity has been indexed by observing perception of otherwise identical speech tokens in different phonetic contexts. For example, Lindblom and Studdert-Kennedy [1967] examined perception of consonant-vowel-consonant (CVC) syllables using three synthetic vowel series. The first consisted of a steady-state series varying perceptually from [T] to [R] via manipulation of F2 frequency. The second and third series were comprised of these same vowels, identical in midpoint formant frequencies, embedded in [wVw] and [jVj] contexts1. Recall that, relative to F2 frequencies of isolated vowels, Lindblom [1963] observed lower F2 formant frequencies for vowels produced in [bVb] context and higher vowel F2s in [dVd] context. Complementary to Lindblom’s [1963] measures of production, listeners perceived more syllables as [R] (with lower F2 frequencies) in the [wVw] context, and more as [T] (with higher F2 frequencies) in the [jVj] context [Lindblom and Studdert-Kennedy, 1967]. Neighboring consonant context influenced vowel perception in a fashion complementary to the assimilative effects of coarticulation. These oft-cited findings have since been extended by Nearey [1989], who found the same pattern of results for [bVb] and [dVd] syllables and vowel sounds ranging from [‡] to [$] and from [$] to [£]. Examples of perceptual accommodation of coarticulation have been documented across a wide variety of other phonetic contexts including identification of stop consonants and fricatives [e.g. Mann, 1980; Mann and Repp, 1980, 1981; Repp, 1983]. In

1

These stimuli were very similar to those Lindblom [1963] had explored earlier in that labial consonant contexts ([bVb] and [wVw]) have lower F2 onsets than do palatoalveolar contexts (i.e. [dVd] and [jVj]). Acoustically, the main distinction between stops [b, d] and semivowels [w, j] lies in the rate of change of formant transitions.

General Auditory Accommodation of Coarticulation

Phonetica 2000;57:170–180

171

these studies, too, perception adjusts in a direction opposite that of coarticulatory assimilation. How is this accomplished? Lindblom and Studdert-Kennedy [1967] offered several potential theoretical interpretations of their results. They noted that their data were congenital to articulation-based theoretical perspectives like Motor Theory [Liberman et al., 1957] and Analysis-by-Synthesis [Stevens and House, 1963; Stevens and Halle, 1967]. In most later examinations, authors have interpreted their data more categorically as evidence that context effects in speech perception originate in properties of speech distinct from its auditory characteristics [e.g. Mann, 1980; Repp, 1982; Williams, 1986; Fowler et al., 1990]. The theoretical thrust of these accounts, whether embodied by modular processes [Liberman et al., 1957; Liberman and Mattingly, 1985], Direct Realist emphasis on distal events [Fowler, 1986, 1996], or reference to tacit knowledge [Repp, 1982, 1983], suggests mechanisms of speech perception intimately linked with speech production.

Evidence for General Auditory Processes

Human Listeners Lindblom and Studdert-Kennedy [1967] devoted most of their discussion to potential explanations in terms of general auditory processes. A number of new findings described below suggest that this emphasis was well-warranted and prescient. General auditory mechanisms appear to contribute significantly to perceptual accommodation of coarticulation. The first line of evidence involves close correspondence between effects of neighboring speech and nonspeech context upon speech perception. These studies have used nonspeech flanking energy, capturing minimal spectral aspects of speech, to probe the relative specificity of context effects in speech perception. Holt et al. [1996] used this method to examine the effect of context observed by Lindblom and Studdert-Kennedy [1967]. Two sets of stimuli were constructed. The first set patterned those of earlier studies [Lindblom and Studdert-Kennedy, 1967; Nearey, 1989]. Synthetic speech stimuli varied perceptually from [b$b] to [b£b] and [d$d] to [d£d]. Along these series, each stimulus step increased in vowel F2 frequency, rendering a change in perceived vowel identity from [$] to [£]. The primary distinction between [bVb] and [dVd] series was frequency of F2 onset and offset. For [bVb] stimuli, F2 onset and offset was relatively low in frequency (800 Hz); whereas for [dVd] stimuli, F2 was higher (2,270 Hz). Pseudo-spectrograms illustrate these stimuli in figure 1. A pair of hybrid nonspeech-speech stimulus series, comprised of glide-vowelglide stimuli, also was constructed. The vowel segment was created using synthesis parameters identical to those used to create the midpoint of the CVC speech stimuli. The critical distinction was that, for these hybrid stimuli, nonspeech frequency-modulated (FM) sine-wave glides, instead of consonant transitions, served as flanking context. These glides modeled only the trajectory of F2 center frequency for transitions in the speech stimuli. Although nonspeech glides share minimal spectral qualities with F2 transitions of CVC stimuli, they fall far short of perceptual or acoustic equivalence. Voiced speech sounds possess rich harmonic structure with energy at each multiple of the fundamental frequency (F0). The FM glides, in contrast, had energy only at the nominal F2 center frequency, with no fine harmonic structure and no energy mimicking

172

Phonetica 2000;57:170–180

Holt/Kluender

a

b

c

d

Fig. 1. CVC series endpoints. Representative pseudo-spectrograms of Holt et al. [1996]: stimulusseries endpoints. The top row (a, b) represents [bVb] stimuli. The bottom row (c, d) shows [dVd] endpoints. The left column (a, c) corresponds to low-F2 ([$]) stimulus endpoints and the right column (b, d) depicts high-F2 ([£]) endpoints. Eight intermediate stimuli (not shown), with F2 midpoint fre-

quency increasing in 50-Hz steps, comprised the remainder of the [bVb] and [dVd] series.

F1 or F3. In addition, formant transitions of speech stimuli are not much like FM tones, because component frequencies of speech do not vary (at constant F0). Instead, relative amplitudes of harmonics change with changing shapes of the spectral envelope. Given these differences, simple FM glides capture only minimal spectral characteristics of energy in the region of F2. Nevertheless, data from this experiment provide evidence that these minimal similarities are sufficient to elicit similar context effects on vowel perception. Parallel to the results of Lindblom and Studdert-Kennedy [1967], listeners’ vowel identification in CVC context was shifted such that, in the context of [dVd], vowels were more often identified as [$]. In the context of [bVb], vowels were identified more often as [£]. Remarkably, these results with speech were mirrored when energy adjacent to vowels was simple FM glides. As shown in figure 2, listeners labeled vowels flanked by glides modeling F2 of [bVb] more often as [£] than they did the same vowels


Phonetica 2000;57:170–180

173

Fig. 2. Mean identification of vowels in consonant and glide contexts. Mean identification functions for vowels in the context of [bVb], [dVd], and two glide contexts that modeled these consonant contexts. Percent ‘PUTT’ responses as a function of midpoint F2 frequency are plotted by context.

flanked by glides modeling F2 of [dVd]. The correspondence of speech and nonspeechspeech hybrid context effects suggests that general auditory processes contribute to perceptual accommodation of coarticulation. In order to evaluate whether listeners were somehow interpreting flanking energy as speech, Holt et al. [1996] conducted another study in which listeners were asked to identify flanking glides as [b] of [d]. Listeners were near perfect when asked to label consonants of speech CVCs. However, when complying with the unusual request to label FM glides as consonants, listeners varied in the extent to which they were consistent in labeling glides as [b] or [d]. Some consistently labeled glides ‘backwards’ (descending glides as [b]), while others consistently labeled descending glides as [d]. Some listeners were consistently inconsistent in labeling. No matter how listeners labeled FM glides, there was no correlation between labels and vowel identification. The effect of nonspeech context generalizes to other phonetic contexts. For VCVs such as [iba], [ida], [uba], and [uda], initial vowel context ([i] or [u]) exerts considerable influence on the perception of the following consonant ([b] or [d]). Based upon the results with CVCs, Holt [1999] predicted that FM glides mimicking only F2 of naturally produced [i] or [u] would elicit effects on perception of a [ba] to [da] series like that found for preceding [i] and [u]. As was the case for CVC contexts, preceding FM glides produced a significant shift in vowel identification in the same direction as fullformant vowel stimuli despite an intervening silent interval corresponding to vocal tract closure. Similarly, Lotto and Kluender [1998] examined [VC CV] disyllables. Perception of a series of CVs as [da] or [ga] can be altered by the presence of a preceding VC [Mann, 1980]. When listeners identify members of a synthetic speech series varying acoustically in F3 onset frequency and perceptually from [ga] to [da], listeners report more [ga] percepts when the CV is preceded by the VC [al]. Conversely, listeners iden-

174

Phonetica 2000;57:170–180

Holt/Kluender

tify the same stimuli more often as [da] when following [ar]. Lotto and Kluender [1998] demonstrated that sine-wave FM glides modeling F3 transitions of [al] or [ar], and even constant-frequency tones set at F3 offset frequencies of [al] or [ar], induced the same pattern of [da]-to-[ga] identification responses as did natural and synthetic speech tokens of [al] and [ar]. The effects of nonspeech context have thus been demonstrated for vowels in CVCs and well as for consonants in VCVs and VCCVs. Nonhuman Animal Listeners Finally, another line of evidence demonstrates that these effects extend across species. Lotto et al. [1997] trained Japanese quail (Coturnix coturnix japonica) to peck at a lighted key in response to endpoints of the same [da] to [ga] series used by Lotto and Kluender [1998]. Birds trained to peck to [ga] pecked most vigorously to novel intermediate members of the [da]-to-[ga] series that were preceded by [al]. Correspondingly, quail trained to peck to [da] pecked most robustly when novel stimuli were preceded by [ar]. Japanese quail exhibited shifts in responses contingent upon VC like those found for identification of CVs by human listeners. Considering that quail should have had no selective pressure to develop specialized mechanisms for processing human speech, these data suggest broad generality of the responsible perceptual processes. In addition, quail had no experience with coarticulated speech, so behavior cannot be explained based on learned covariance of acoustic attributes of coarticulated speech. Similar to this notion of experience playing a role, Lindblom and StuddertKennedy [1967] noted the potential contribution of ‘expectancy’ in perception of coarticulated speech; although, they did not favor such an explanation. Based upon the quail findings, it is safe to conclude that, while experience with coarticulated speech could play a role (see ‘The Role of Learning’), it is not a necessary condition. General perceptual processes, extensive enough to be in force for avian subjects, appear sufficient to elicit effects of context in speech perception.

Ecology of Sound-Producing Sources

Taken together, these multiple lines of evidence suggest that some general property of spectral processing in the auditory system contributes significantly to perceptual compensation of coarticulation. In light of the cross-species correspondence in perception of coarticulated speech, one may be encouraged to consider perceptual processes quite generally. Most broadly, one can begin considering perception of speech and other auditory events with an appreciation for the ecology of sound-producing sources. Perceptual systems must operate in accord with physical regularities in the environment. Most theorists (vision and other senses included) assume that perceptual systems are not general signal analysis devices. Instead, they evolved to use information as it is structured in the world. If one desires to bring general principles of auditory perception to bear upon speech perception, one must begin to think about general physical principles that structure the auditory world. In this case, general principles with respect to physical constraints on sound-producing devices – and presumed to have shaped the evolution of auditory systems – may be extended to the vocal tract as a sound source. At a minimum, there must be at least an abstract accordance between the process of perception and the ecology of soundproducing sources. For perception of speech and other acoustic events, principles that


Phonetica 2000;57:170–180

175

govern a listener’s maintenance of a cohesive auditory world must be in agreement with (though not strictly isomorphic to) principles that govern output from a sound source, and general constraints upon sound-producing events (speech articulation being one example) should be approximated in the operation of sensory processes [Kluender, 1991; Lotto et al., 1997]. This view is roughly consistent with Roger Shepard’s [1984] theory of internalized constraints he calls ‘psychophysical complementarity’. What can one say, generally, about the ecology of sound-producing sources? It is known that most wordly physical structures, in contrast to electronic gadgets, produce sound that can change only so much so fast. Due to inertia and mass, physical systems tend to be assimilative. The configuration of a system at time t is significantly constrained by its configuration at time t–1. Connected speech follows this patterns.

Perceptual Contrast

Most generally, an effective way to ‘neutralize’ assimilatory effects is perceptual contrast. One frequently cited example for audition is frequently contrast [Cathcart and Dawson, 1928, 1929; Christman, 1954]. One sound is perceived as higher following a sound that is lower, and vice versa. Effects of adjacient [b] and [d] in CVCs are consistent with a contrast account, as listeners report more percepts of [£] (higher F2) flanked by lower-frequency [b] and more percepts of [$] flanked by higher-frequency [d] [Holt et al., 1996]. Similarly, listeners are more likely to report hearing [d] (higher F2) following [u] (lower F2), and [b] following [i] [Holt, 1999]. Finally, for effects of [al] and [ar] on perception of following [da] or [ga], to the extent that frequency composition of the offset is higher (F3 for [al]), listeners are more likely to report hearing [ga] (lower F3 onset frequency) and vice versa [Lotto and Kluender, 1998]. For all the examples above, and for all reported cases, the influence of coarticulation is assimilative, and context effects in speech perception can be depicted as contrastive. A general mechanism of perceptual contrast may serve as a valuable tool in maintaining perceptual constancy across context dependence produced by soundsource assimilation. Contrast in itself, however, is a designation that does not implicate any specific auditory mechanism(s). Across perceptual modalities, contrast is an important mechanism for exaggerating differences between neighboring objects and events. The best-known examples are in the visual domain: enhancement of edges produced by lateral inhibition [Hartline and Ratliff, 1957], lightness judgments [Koffka, 1935], judgment of line orientation [Gibson, 1933]. Context effects in behavior are as varied as tempo of behavior [Cathcart and Dawson, 1928/1929], weight lifting [Guilford and Park, 1931]. Mechanisms of contrast exist for every perceptual modality [von Bekesy, 1967; Warren, 1985]. Across domains, contrast is a familiar observation of mechanisms that serve to exaggerate change in the physical stimulus and to maintain an optimal dynamic range. Perceptual contrast, in this case spectral contrast, may play an important role in perception of coarticulated speech.

176

Phonetica 2000;57:170–180

Holt/Kluender

Relationship to Psychoacoustics and Auditory Neurophysiology

However alluring the broad concept of contrast may be, its ubiquity betrays the fact that it falls short as a rigorous explanation. Fortunately, there exist precedents in auditory perception that lend both greater precision to the construct and reveal potential explanation in underlying processes. At least one class of psychoacoustic findings, known collectively as ‘auditory enhancement’, bears note for its similarity to the present findings. Auditory enhancement refers, generally, to the observation that if one member of a set of equal-amplitude harmonics is omitted and then reintroduced, it stands out perceptually from the rest of the components. Most relevant to the present data, Summerfield et al. [1984, 1987] have related auditory enhancement to speech perception. When a uniform harmonic spectrum composed of equal-amplitude harmonics is preceded by a spectrum complementary to a particular vowel (with troughs replacing formants and vice versa), listeners report hearing a vowel during presentation of the uniform spectrum [Summerfield et al., 1984]. Moreover, the vowel percept is appropriate for a vowel with formants at frequencies where there were troughs in the preceding spectrum. In a similar vein, a uniform harmonic spectrum precursor intensifies perception of vowels defined by very modest formants of 2–5 dB [Summerfield et al., 1987]. For each of these cases of auditory enhancement, results may be described as contrast between two complex sounds. The spectral composition of a preceding auditory stimulus shifts perception of a following stimulus such that frequencies absent in the precursor are enhanced relative to frequencies represented in the spectral makeup of the precursor. Cast in this way, these findings are strikingly similar to effects of context upon vowel identification reported here. Speech and nonspeech-speech hybrids modeled after [dVd] (which have predominantly high-frequency F2 composition), for example, generate more low F2 frequency vowel identifications ([$]). This perceptual shift is toward frequencies less well represented in adjacent consonants and nonspeech analogues. Neural adaptation or adaptation of suppression may serve to enhance changes in spectral regions where previously there had been relatively little energy. As with so many other central issues for speech perception, Lindblom and Studdert-Kennedy [1967, p. 840] were ahead of their time in noting how some effects of acoustic context can be ‘exemplified for instance by adaptation and fatigue’. Although adaptation, thus far, has been explored only in service of explaining auditory enhancement effects in psychoacoustic studies, there is great potential for extension to speech as Lindblom and Studdert-Kennedy [1967] conjectured. Delgutte [1996] and Delgutte et al. [1996] have established a case for a broad role of adaptation in perception of speech, noting that adaptation may enhance spectral contrast between sequential speech segments. This enhancement is proposed to arise because neurons adapted by stimulus components spectrally close to their preferred, or characteristic, frequency are relatively less responsive to subsequent energy at that frequency, whereas components not present in a prior stimulus are encoded by more responsive unadapted neurons. Often, arguments that effects in speech perception arise from ‘general auditory’ processes are taken as suggesting that such mechanisms must be peripheral. If, by peripheral, one means to imply that mechanisms exert their influence in the cochlea or at the level of the auditory nerve, then this is almost certainly false for the case of context effects in speech perception. Evidence in other stimulus paradigms suggests that


Phonetica 2000;57:170–180

177

more central regions of the auditory system are likely involved. Effects of context maintain for dichotic presentation, and the influence of preceding spectral energy on speech identification decays gradually over a time-course that extends into hundreds of milliseconds [Lotto, 1996; Holt, 1999]. In addition, neurophysiological investigation of these effects has borne little evidence of neural encoding of speech context effects or auditory enhancement at the level of the auditory nerve [Palmer et al., 1995; Holt and Rhode, 2000]. The mechanisms appear to have a more central origin. It appears that more than 30 years ago, Lindblom and Studdert-Kennedy [1967] were well ahead of their time. Recent findings described above are consistent with their experimental precedents, and their auditory hypothesis appears secure, supported better than ever by psychoacoustic and neurophysiological findings. In contrast to intervening theorization, general perceptual mechanisms of spectral contrast play a substantial role in compensating for context-dependent patterns of CVC coarticulation.

The Role of Learning

The implication of general auditory processes should not be taken to suggest that there is no more to perceptual accommodation of coarticulation. It is not being claimed that spectral contrast serves to explain all perceptual accommodation of acoustic consequences of coarticulation. The clearest counterexamples are instances for which perception of preceding sounds is affected by following sounds, e.g. [al] to [ar] preceding [E] or [s] [Mann and Repp, 1981]; [u] to [i] preceding [d] or [g] [Ohala and Feder, 1994], [E] to [s] preceding [a] or [u] [Mann and Repp, 1980]. Although backward contrast effects abound in studies of visual perception, there is nothing in our present understanding of auditory neurophysiology that suggests that adaptation and suppression exert such effects. Aside from auditory processes, experience with speech also is likely to play a significant role in perception. Lindblom and Studdert-Kennedy’s [1967] consideration of ‘expectancies’ may also wear well with time. In coarticulated speech, owing to articulatory constraints, there is a strong correlation between the acoustic properties of preceding and following sounds. In general, when one considers perceptual learning, it is conceptualized in terms of perceptual systems coming to ‘expect’ this correlated structure though experience with such structure. By way of concrete example, with experience, listeners ‘expect’ [£] to have lowered frequencies following [b] such that, when judging stimuli varying along a continuum from [$] to [£], listeners are more likely to report hearing [£] following [b]. Coarticulation yields multiple covariances in the signal that are orderly in as much as they reflect dependable regularities in physical constraints upon articulators. Learning is precisely about behavior coming to reflect covarying aspects of the environment. Holt et al. [1999] recently reported results from two studies using Japanese quail and chinchilla (Chinchilla laninger) to investigate the role of F0 on perception of voicing for stop consonants. Human listeners are more likely to perceive stimuli with higher F0 values as voiceless – a pattern that follows regularities in production [e.g. Chistovich, 1969; Haggard et al., 1970; Fujimara, 1971; Whalen et al., 1993]. Some investigators [Kingston and Diehl, 1994] have suggested that higher F0 values enhance perception of voicelessness by exploiting auditory predispositions. A second hypothesis is that this perceptual pattern arises due to experience with F0/VOT covari-

178

Phonetica 2000;57:170–180

Holt/Kluender

ation in production. The first hypothesis was tested using chinchilla trained to respond differentially to stops depending upon whether the stops were voiced or voiceless. Absent experience with covariation between F0 and VOT, there was no effect of F0 on responses to novel stimuli with intermediate VOTs following training. In a second experiment, three groups of Japanese quail were trained to respond differentially to voiced versus voiceless stops in conditions with three different patterns of F0/VOT covariation. In training, VOT and F0 varied in the natural pattern (longer VOT, higher F0), in an inverse pattern (longer VOT, lower F0), or in a random pattern (F0 and VOT uncorrelated). When tested on stimuli with intermediate values of VOT, the third group of quail (no correlation) replicated the chinchilla results with no significant effect of F0. For the other groups, responses followed the experienced pattern of covariation. These data highlight the potential importance of experienced covariation in speech perception. Such findings do not depreciate the importance of contrast processes in perception of coarticulated speech, but they serve as reminder that there is likely more to the full explanation.

Conclusion

In closing, it is appropriate that Lindblom and Studdert-Kennedy [1967, p. 842] get the last word: ‘It is worth reiterating… that mechanisms of perceptual analysis whose operations contribute to enhancing contrast in the above-mentioned sense are precisely the type of mechanisms that seem well suited to their purpose given the fact that the slurred and sluggish manner in which human speech sound stimuli are often generated tends to reduce rather than sharpen contrast.’

References Bekesy, G. von: Sensory inhibition (Princeton University Press, Princeton 1967). Cathcart, E.P.; Dawson, S.: Persistence (2). Br. J. Psychol. 19: 343–356 (1928/1929). Chistovich, L.A.: Variations of the fundamental voice pitch as a discriminatory cue for consonants. Sov. PhysicsAcoustics 14: 372–378 (1969). Christman, R.J.: Shifts in pitch as a function of prolonged stimulation with pure tones. Am. J. Psychol. 67: 484–491 (1954). Delgutte, B.: Auditory neural processing of speech; in Hardcastle, Laver, The handbook of phonetic sciences (Blackwell, Oxford 1996). Delgutte, B.; Hammond, B.M.; Kalluri, S.; Litvak, L.M.; Carian, P.A.: Neural encoding of temporal envelope and temporal interactions in speech; in Ainsworth, Greenberg, Proc. Auditory Basis of Speech Perception (European Speech Communication Association, 1996). Fowler, C.A.: An event approach to the study of speech perception from a direct-realist perspective. J. Phonet. 14: 3–28 (1986). Fowler, C.A.: Listeners do hear sounds, not tongues. J. acoust. Soc. Am. 99: 1730–1741 (1996). Fowler, C.A.; Best, C.T.; McRoberts, G.W.: Young infants’ perception of liquid coarticulatory influences on following stop consonants. Percept. Psychophys. 48: 559–570 (1990). Fujimura, O.: Remarks on stop consonants: synthesis experiments and acoustic cues; in Hammerich, Jakobson, Zwirner, Form and substance: phonetic and linguistic papers presented to Eli Fischer-Jørgensen, pp. 221–232 (Copenhagen, Akademisk Forlag 1971). Gibson, J.J.: Adaptation, after-effect and contrast in the perception of curved lines. J. exp. Psychol. 16: 1–31 (1933). Guilford, J.P.; Park, D.G.: The effect of interpolated weights upon comparative judgments. Am. J. Psychol. 43: 589–599 (1931). Haggard, M.; Ambler, S.; Callow, M.: Pitch as a voicing cue. J. acoust. Soc. Am. 47: 613–617 (1970). Hartline, H.K.; Ratliff, F.: Inhibitory interaction of receptor units in the eye of Limulus. J. gen. Physiol. 40: 1357–1376 (1957).


Phonetica 2000;57:170–180

179

Holt, L.L.: Auditory constraints on speech perception: an examination of spectral contrast; doct. diss. University of Wisconsin-Madison (unpublished, 1999). Holt, L.L.; Lotto, A.J.; Kluender, K.R.: Perceptual compensation for vowel undershoot may be explained by general perceptual principles. 131st Meet. Acoust. Soc. Am., Indianapolis 1996. Holt, L.L.; Lotto, A.J.; Kluender, K.R.: Influence of fundamental frequency on stop-consonant voicing perception: a case of learned covariation or auditory enhancement? 138th Meet. Acoust. Soc. Am., Columbus 1999. Holt, L.L.; Rhode, W.S.: Examining context-dependent speech perception in the chinchilla cochlear nucleus. 23rd Midwinter Meet. Assoc. Res. Otolaryngol. (2000). Kingston, J.; Diehl, R.L.: Phonetic knowledge. Language 70: 419–454 (1994). Kluender, K.R.: Psychoacoustic complementarity and the dynamics of speech perception and production. Perilus 14: 131–135 (1991). Koffka, K.: Principles of Gestalt psychology (Harcourt, Brace, Mew York 1935). Liberman, A.M.; Cooper, F.S.; Shankweiler, D.P.; Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74: 431–461 (1957). Liberman, A.M.; Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985). Lindblom, B.E.F.: Spectrographic study of vowel reduction. J. acoust. Soc. Am. 35: 1773–1781 (1963). Lindblom, B.; Studdert-Kennedy, M.: On the role of formant transitions in vowel recognition. J. acoust. Soc. Am. 42: 830–843 (1967). Lotto, A.J.: General auditory constraints in speech perception; doct. diss. University of Wisconsin-Madison (unpublished, 1996]. Lotto, A.J.; Kluender, K.R.: General contrast effects of speech perception: effect of preceding liquid on stop consonant identification. Percept. Psychophys. 60: 602–619 (1998). Lotto, A.J.; Kluender, K.R.; Holt, L.L.: Perceptual compensation for coarticulation by Japanese quail (Coturnix coturnix japonica). J. acoust. Soc. Am. 102: 1134–1140 (1997). Mann, V.A.: Influence of preceding liquid on stop-consonant perception. Percept. Psychophys. 28: 407–412 (1980). Mann, V.A.; Repp, B.H.: Influence of vocalic context on perception of the [sh]-[s] distinction. Percept. Psychophys. 28: 213–228 (1980). Mann, V.A.; Repp, B.H.: Influence of preceding fricative on stop consonant perception. J. acoust. Soc. Am. 69: 548–558 (1981). Nearey, T.: Static, dynamic, and relational properties in vowel perception. J. acoust. Soc. Am. 85: 2088–2113 (1989). Ohala, J.J.; Feder, D.: Listeners’ normalization of vowel quality is influenced by ‘restored’ consonantal context. Phonetica 51: 111–118 (1994). Öhman, S.E.G.: Coarticulation in VCV utterances: spectrographic measurements. J. acoust. Soc. Am. 39: 151–168 (1966). Palmer, A.R.; Summerfield, Q.; Fantini, D.A.: Responses of auditory-nerve fibers to stimuli producing psychophysical enhancement. J. acoust. Soc. Am. 97: 1786–1799 (1995). Repp, B.H.: Phonetic trading relations and context effects: new experimental evidence for a speech mode of perception. Psychol. Bull. 92: 81–110 (1982). Repp, B.H.: Bidirectional contrast effects in the perception of VC-CV sequences. Percept. Psychophys. 33: 147–155 (1983). Shepard, R.N.: Ecological constraints on internal representation: resonant kinematics of perceiving, imaging, thinking and dreaming. Psychol. Rev. 91: 417–447 (1984). Stevens, K.N.; Halle, M.: Remarks on analysis by synthesis and distinctive features; in Wathem-Dunn, Models for the perception of speech and visual form (MIT Press, Cambridge 1967). Stevens, K.N.; House, A.S.: Perturbations of vowel articulations by consonantal context: an acoustical study. J. Speech Hear. Res. 6: 111–128 (1963). Summerfield, Q.; Haggard, M.P.; Foster, J.; Gray, S.: Perceiving vowels from uniform spectra: phonetic exploration of an auditory aftereffect. Percept. Psychophys. 35: 203–213 (1984). Summerfield, Q.; Sidwell, A.; Nelson, T.: Auditory enhancement of changes in spectral amplitude. J. acoust. Soc. Am. 81: 700–707 (1987). Warren, R.M.: Criterion shift rule and perceptual homeostasis. Psychol. Rev. 92: 574–584 (1985). Whalen, D.H.; Abramson, A.S.; Lisker, L.; Mody, M.: F0 gives voicing information even with unambiguous voice onset times. J. acoust. Soc. Am. 93: 2152–2159 (1993). Williams, D.R.: Role of dynamic information in the perception of coarticulated vowels; doct. diss. University of Connecticut (unpublished, 1986).

180

Phonetica 2000;57:170–180

Holt/Kluender


Received: October 30, 1999 Accepted: March 12, 2000

Adaptive Dispersion in Vowel Perception Keith Johnson Department of Linguistics, Ohio State University, Columbus, Ohio, USA

Abstract The ‘hyperspace effect’ in vowel perception may be taken as evidence that adaptive dispersion is an active perceptual process. However, a previous study tested for adaptive dispersion in isolated vowel stimuli spoken in a voice unfamiliar to the listeners. The experiment reported in this paper addressed both of these potential concerns and found that both consonant context and talker familiarity modulate the hyperspace effect. However, the reductions induced by context and familiarity were slight. Listeners’ preferred perceptual spaces remained hyperarticulated relative to the production vowel space. Copyright © 2000 S. Karger AG, Basel

Adaptive dispersion refers to the hypothesis that the distinctive sounds of a language tend to be positioned in phonetic space so as to maximize perceptual contrast [Liljencrants and Lindblom, 1972; Lindblom and Engstrand, 1989; Lindblom, 1990]. For example, vowel systems across languages tend to utilize the available acoustic vowel space so that maximal auditory contrast is maintained. Three-vowel systems tend to be composed of /i/, /a/ and /u/; five-vowel systems tend to be composed of /i/, /e/, /a/, /o/, and /u/, and so on. Liljencrants and Lindblom [1972] modeled this tendency by treating each vowel category as a repeller in a dynamical system. They argued that the positions of vowels in the acoustic vowel space are influenced by this dynamical repelling force which has come to be called ‘adaptive dispersion’. The concept of adaptive dispersion has wide applicability in accounting for language sound systems. For example, Padgett [in press] discusses a case of adaptive dispersion (in his terms ‘contrast dispersion’) in consonants. He observes that the traditional plain and palatalized consonants in Russian are realized phonetically as a contrast between velarized and palatalized consonants [see also Halle, 1959]. Despite the appealing logic of adaptive dispersion as an explanatory principle, there is very little psycholinguistic evidence that adaptive dispersion is an active perceptual biasing pressure on linguistic systems. Such psycholinguistic evidence can be adduced, however, from the results of Johnson et al. [1993; see also Bradlow, 1995, 1996]. Johnson et al. [1993] asked listeners to choose the ‘best’ exemplar of different English vowel categories from an array of synthetic vowels that spanned a range of F1–F2 combinations. The stimuli were natural-sounding five-formant steady-state



Keith Johnson, Department of Linguistics Ohio State University, 222 Oxley Hall, 1712 Neil Avenue Columbus, OH 43210-1298 (USA) Tel. +1 614 292-1841, Fax +1 614 292-8833 E-Mail [email protected]

a

b

Fig. 1. Perceptual hyperspace results redrawn from Johnson et al. [1993]. Results from the corner vowels /i/, /u/, /æ/ and /Æ/ are shown. a Average formant values for male speakers are compared with the results of the pretest in Johnson et al. [1993]. b Results from three perception tests using different

instructions (‘best’ versus ‘as you say it’) and different stimuli (controlling for intrinsic F0 and duration). The points within each condition are connected by a spline function for illustrative purposes.

vowels produced with a software formant synthesizer. The authors found that listeners chose vowel qualities that were more disperse in the acoustic vowel space than the vowel qualities produced in normal speech (fig. 1a). This perceptual ‘hyperspace’ effect was undiminished under instructions to choose the vowel ‘as you would say it’ or with stimuli that had more natural intrinsic F0 and duration (fig. 1b). From these results showing that listeners prefer an expanded acoustic vowel space, Johnson et al. [1993] suggested that the hyperspace effect reflects listeners’ production targets that are subject to undershoot in production [Lindblom, 1963]. Alternatively, the hyperspace effect could be taken as evidence of adaptive dispersion in vowel perception. It is this latter interpretation that forms the basis for the experiment reported in this paper.

The Problem

Though the results of Johnson et al. [1993] suggest that adaptive dispersion is an active perceptual process, two aspects of the experiments limit their relevance for explaining cross-linguistic sound patterns. First, the stimuli used by Johnson et al. [1993] were isolated synthetic vowels. This is a methodological weakness because the words used as visual prompts (‘heed’, ‘had’, ‘who’d’, etc.) illustrated the vowels in consonantal context while the auditory stimuli were isolated vowels. Because consonants have a large influence on vowel formant values [Lindblom, 1963; Stevens and House, 1963] and listeners perceptually compensate for vowel target undershoot [Lindblom and Studdert-Kennedy, 1967], the hyperspace effect may have been artificially

182

Phonetica 2000;57:181–188

Johnson

enhanced by the use of isolated vowel stimuli. The second limitation is that the stimuli were produced by a computer ‘voice’ that was unfamiliar to the listeners. The impact of adaptive dispersion as a biasing effect on language sound systems may be overestimated if the effect is a talker-contingent process that is more likely to occur when listening to the speech of an unfamiliar talker. Recent evidence regarding the talker-contingent nature of speech perception suggests that the hyperspace effect may be reduced when the listener is familiar with the talker. We turn now to a brief review of this evidence.

Talker Familiarity Effects

Interactions between talker identity and speech processing have been the focus of some recent studies. Mullenix et al. [1989; Creelman, 1957] found that intelligibility in noise is lower in mixed-talker lists than it is in same-talker lists. This ‘talker effect’ seems to suggest that listeners adapt to talkers over time and that when this process is not interrupted, as it is in the mixed-talker condition, listeners are better able to recognize words. Mullennix and Pisoni [1990; see also Green et al., 1997] found that talker information cannot be treated as a separable dimension in phonetic perception. Using the Garner paradigm they found that irrelevant talker variation interferes with speeded phonetic classification. In addition to these studies indicating that the voice of the talker is not ‘normalized out’ during speech processing, there are some studies indicating that the listener’s familiarity with the talker has an impact on processing. For example, Walker et al. [1995] found that face-voice incongruency reduces the McGurk effect, but only if listeners are familiar with the talker. This paper is important because it shows that a supposedly low-level speech processing mechanism – the integration of visual and acoustic speech cues – is mediated by personal information. When listeners were familiar with the visual talker shown in the McGurk display, and when the voice of the stimulus was not that of the visual talker but of someone else, the visual/auditory integration usually found with such stimuli was greatly reduced. Visual/auditory integration for these face-voice incongruent stimuli was not reduced (relative to congruent stimuli) for listeners who were not familiar with the talker. In addition to showing that talker identity interacts with speech perception [see also Johnson et al., 1999], this result suggests that speech processing is influenced by familiarity with the talker. Further evidence of the effect of talker familiarity in speech processing comes from a study reported by Nygaard et al. [1994]. They found that speech was more intelligible when listeners were familiar with the talker. This result is directly relevant to the hyperspace effect in the following way. If speech produced by an unfamiliar talker is less intelligible than speech produced by a familiar talker, then it stands to reason that listeners would prefer an expanded vowel space for an unfamiliar voice. In this way, the unfamiliar symthetic talker used by Johnson et al. [1993] may have contributed to the hyperspace effect. If the hyperspace effect is purely an artifact of the use of isolated vowel stimuli produced by an unfamiliar voice, then their finding cannot be reasonably taken as evidence that adaptive dispersion, as an active perceptual process, has much practical impact in shaping language sound systems. These potentially invalidating factors were tested in the experiment reported below.

Adaptive Dispersion

Phonetica 2000;57:181–188

183

Methods Production Data One male native speaker of American English (K.J.) recorded five repetitions of a word list including ‘heed’, ‘had’, ‘who’d’, ‘hod’, and ‘hud’. Each word was read in isolation in a careful listreading style with a falling intonation contour on each word. Frequency values of the first three formants at vowel midpoint were measured from spectrograms of these tokens. The average F1 and F2 frequencies of /i/, /u/, /æ/ and /Æ/ were compared with listener’s perceptual choices for F1 and F2. One production of ‘hud’ was selected for use in constructing the perceptual stimuli. Stimuli The stimuli for this experiment were modeled on those used by Johnson et al. [1993]: they were 330 synthetic stimuli that sampled the acoustic vowel space in equal Bark intervals on F1 and F2, producing a grid of F1 and F2 combinations. F4 and F5 were steady-state and had the same values for all of the stimuli. F3 was calculated by rule as in Johnson et al. [1993]. The stimuli in this experiment differed from those in Johnson et al. [1993] in that they were given a final /d/ consonant using final formant trajectories calculated by rule, and a naturally produced final closure interval and /d/ release from a production of ‘hud’. The voice source for the stimuli was also different from the synthetic source used in Johnson et al. [1993]. In the current experiment the voice source was the LPC residual signal from one token of ‘hud’. This source function, which included both the glottal frication of /h/ and voicing during the vowel, was filtered by a bank of time-varying band-pass filters having formant values as in experiment 1 of Johnson et al. [1993], with the addition of /d/-final formant trajectories. The resulting stimuli sounded like the talker (K.J.), who produced the voice source token. Listeners Twenty-two listeners served as volunteers in this study. They fell into three groups: 7 naive listeners, 7 students, 8 colleagues. The naive listeners were unfamiliar with the voice of the talker who produced the stimuli. The students were currently enrolled in a course being taught by the talker and so had some familiarity with his voice, and the colleagues had worked with the talker for 1–5 years. Procedure A grid of boxes was presented to listeners on a computer monitor. Each box was associated with a synthetic vowel stimulus. When the listener clicked on one of the boxes the associated stimulus was played over earphones at a comfortable listening level. Listeners could hear any of the 330 stimuli at any point in a trial by clicking on the associated visual box. They could repeat a stimulus by clicking on a box more than once. There were no constraints on the amount of time that a listener could take in a trial or on the order of stimuli presented during a trial. In each trial one of the test words, ‘heed’, ‘who’d’, ‘had’, or ‘hod’, was printed in ordinary orthography at the top of the computer screen. The listeners’ task on each trial was to select one stimulus from the grid of 330 stimuli that sounded most like the visually presented word. Each word was presented to each listener 5 times in random order for a total of 20 trials per listener. The groups of listeners had slightly different instructions. Naive listeners were instructed to choose the best example of the word. This instruction corresponds to the ‘best exemplar’ condition in Johnson et al. [1993]. Students and colleagues on the other hand were asked to select a stimulus for each word keeping in mind the question: ‘Is this what K.J. would sound like saying this word?’ At the conclusion of the experiment, colleagues were also asked to rate their familiarity with the speech of the talker (1 = highly familiar, 7 = not familiar at all), and the success of the synthesis (1 = sounded very much like K.J., 7 = did not sound at all like K.J.).

184

Phonetica 2000;57:181–188

Johnson

Fig. 2. Perceptual results from Johnson et al.

[1993] compared with results of the present experiment. Average formant frequencies for the corner vowels /i/, /u/, /æ/ and /Æ/ are shown. Data from Johnson et al. [1993] are labeled V: and are plotted with filled circles. Data from the naive listeners in the present experiment are labeled hVd and are plotted with open circles. The points in each condition are connected by a spline function for illustrative purposes.

Results

Figure 2 shows a comparison of the average median formant values chosen by naive listeners in this experiment and comparable data from Johnson et al. [1993]. (Each listener’s formant choices for a word were taken from the median of his/her five trials for that word.) There were several differences between the earlier study and this one: listeners in Johnson et al. [1993] were from California while the naive listeners in this study were from Ohio, the stimuli in this experiment were synthesized with an LPC residual voice source while the stimuli in Johnson et al. [1993] used a purely synthetic voice source, and their stimuli were steady-state vowels with no consonant context while the stimuli for this experiment had /h/ frication at the onset and /d/ offset formant transitions and a /d/ closure interval and release burst. In view of these differences it is therefore quite remarkable that the formant values for /i/ and /u/ chosen by listeners in the two studies are nearly identical. For these vowels, the differences between the earlier study and this one seem to have had no effect. The formant values chosen for /æ/ and /Æ/ were less extreme in the current study than they were in Johnson et al. [1993]. The most sensible interpretation of this difference is that lack of consonantal context in Johnson et al. [1993] led to perceptual overshoot [Lindblom and Studdert-Kennedy, 1967]. Listeners seem to expect low vowels in hVd context to have less extreme formant frequencies than they do in isolation. This interpretation is motivated by the lack of an overall effect, as might be expected from the voice-source difference, and the lack of substantial dialectal differences between California and central Ohio for /æ/ and /Æ/. Figure 3 shows average formant values for speaker K.J. compared with the average perceptual results for the naive listeners, colleagues, and the students. As in Johnson et al. [1993], naive listeners chose F1 and F2 values that encompass a larger region of the acoustic vowel space than is found for carefully articulated list-reading production. That this vowel space expansion was obtained despite the effect of consonantal context (fig. 2) suggests that perceptual overshoot with the isolated vowel stimuli in Johnson et al. [1993] did not exaggerate the apparent magnitude of the hyperspace effect very much.

Adaptive Dispersion

Phonetica 2000;57:181–188

185

a

b

Fig. 3. Perceptual results for the three groups of listeners compared with average vowel formant frequencies for talker K.J. Results from the corner vowels /i/, /u/, /æ/ and /Æ/ are shown. a Average

formant frequencies for K.J. are compared with the average formant frequencies chosen by naive listeners and students. b Average formant frequencies for K.J. compared with average formant frequencies chosen by naive listeners and colleagues. The points within each condition are connected by a spline function for illustrative purposes.

The perceptual data were submitted to two analyses of variance with factors stimulus word (‘heed’, ‘had’, ‘hod’, or ‘who’d’), and listener group (naive, colleague, student). In the analysis of the F1 data the only reliable effect was that for stimulus word [F(3, 76) = 309.5, p < 0.01]. In the analysis of the F2 data there was also a main effect of word [F(3, 76) = 487, p < 0.01]. There was also a main effect of listener group [F(2, 76) = 5.0, p < 0.01], but the interaction between group and word did not reach significance [F(6, 76) = 1.36, p = 0.24]. Post-hoc comparisons of the groups are mentioned below. As with the naive listeners, the results shown in figure 3a suggest that the hyperspace effect also occurred in the response of the students. However, for these listeners, familiarity seems to have reduced the effect for the front vowels /i/ and /æ/. The formant values chosen by students for /i/ and /æ/ were more similar to those produced by K.J. than were the formants chosen by naive listeners. Pairwise comparisons of F2 means, using Fisher’s least significant difference test, found that naive listeners and students differed for both /i/ and /æ/. It is important to note, however, that the students knew the purpose of the experiment and the hypothesized role of talker familiarity in reducing the hyperspace effect. So, their reduction of the hyperspace effect may have been caused by demands characteristics. Interestingly, even with such a bias against producing the hyperspace effect, these listeners chose relatively peripheral vowel formant frequencies. Figure 3b compares measured vowel formants in the speech of the talker with the average formant frequencies chosen by naive listeners and colleagues. As with the stu-

186

Phonetica 2000;57:181–188

Johnson

Table 1. Median ratings of colleagues’ familiarity with the voice

of the talker and their judgment of the success of the synthetic stimuli in mimicking the talker’s voice

Familiarity Synthesis

Median

Range

2 3

1–3 2–7

1 = highest rating; 7 = lowest rating.

dents, the hyperspace effect was found in the responses of the colleagues, and there is a small reduction of the hyperspace relative to the values chosen by naive listeners, but only for /i/. Pairwise comparisons of F2 means, using Fisher’s least significant difference test, found that naive listeners and colleagues differed for /i/ but not /æ/. This indicates that demand characteristics probably did play a role in the students’ responses, but the general pattern of results for the two groups is similar. The colleagues were also asked to rate their familiarity with the talker and to evaluate the success of the synthetic tokens in mimicking his voice. These ratings are shown in table 1. The rating data show that this group of listeners felt that they were familiar with the voice of the talker, but that the synthetic tokens were not particularly good examples of his speech.

Discussion

The results of this experiment confirm that the hyperspace effect is very robust. Listeners chose vowels that defined a large acoustic vowel space even though the vowels appeared in a /hVd/ context and, for two groups of listeners, even though they were familiar with the voice of the talker. As predicted, however, both consonant context and talker familiarity affected listeners’ choices. In /hVd/ context the hyperspace effect was reduced for low vowels.1 The effect of talker familiarity was to reduce the hyperspace effect for front vowels, where naive listeners deviated most from the talker’s vowel space. One weakness of the experiment is that the synthetic tokens were not very convincing exemplars of the talker’s voice for some of the listeners. The general result from this study is that vowels that sound ‘right’ to listeners tend to be more disperse in the acoustic vowel space than natural productions, but listeners’ choices are modulated by consonant context and familiarity with the talker. The perceptual preference, found in Johnson et al. [1993] and in this study, for a large acoustic vowel space supports Lindblom’s theory of adaptive dispersion. Most evidence for adaptive dispersion is indirect, inferring that listeners must prefer maxi-

1

One reviewer suggests that further reduction of the hyperspace effect might be found with variable consonant contexts, because then listeners cannot make a more or less constant adjustment for the context. This may well be true but needs further study.

Adaptive Dispersion

Phonetica 2000;57:181–188

187

mal contrast because languages tend to roughly obey such a rule more often than they violate it. The robustness of the hyperspace effect is important because it is direct psycholinguistic support for adaptive dispersion. Listeners do prefer maximal contrast. The modulating effects produced by consonant context and talker familiarity present an interesting theoretical challenge. How are these modulating factors encoded in the perceptual system? Two possibilities suggest themselves for further research. First, the effects of consonant context and talker variation could be due to processes of cue interaction or cue weighting that operate to determine listener expectations and/or preferences in the vowel selection task. This process-based account could be called procedural encoding because talker and consonant variation information is encoded or stored in the recognition processes. A second possibility is that speech perception is based on exemplar storage [Johnson, 1997]. In this representation-based account, the perceptual system stores consonant and talker variation in rich exemplar-based category representations. Consonant context might be procedurally encoded because it is largely rule-governed, but talker familiarity effects probably require rich exemplar-based representations to the extent that talker variation is arbitrary [Johnson et al., 1999].

References Bradlow, A.R.: A comparative acoustic study of English and Spanish vowels. J. acoust. Soc. Am. 97: 1916–1924 (1995). Bradlow, A.R.: A perceptual comparison of the /i/-/e/ and /u/-/o/ contrasts in English and Spanish: universal and language-specific aspects. Phonetica 53: 55–85 (1996). Creelman, C.D.: Case of the unknown talker. J. acoust. Soc. Am. 29: 655 (1957). Green, K.P.; Tomiak, G.R.; Kuhl, P.K.: The encoding of rate and talker information during phonetic perception. Perception Psychophysics 59: 675–692 (1997). Halle, M.: The sound pattern of Russian (Mouton, The Hague 1959). Johnson, K.: Speech perception without speaker normalization: an exemplar model; in Johnson, Mullennix, Talker variability in speech processing, pp. 145–166 (Academic Press, New York 1997). Johnson, K.; Flemming, E.; Wright, R.: The hyperspace effect: phonetic targets are hyperarticulated. Language 69: 505–528 (1993). Johnson, K.; Strand, E.A.; D’Imperio, M.: Auditory-visual integration of talker gender in vowel perception. J. Phonet. 27: 359–384 (1999). Liljencrants, J.; Lindblom, B.: Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 839–862 (1972). Lindblom, B.: Spectrographic study of vowel reduction. J. acoust. Soc. Am. 35: 1773–1781 (1963). Lindblom, B.: Explaining phonetic variation: a sketch of the H&H theory; in Hardcastle, Marchal, Speech production and speech modeling, pp. 403–439 (Kluwer, Dordrecht 1990). Lindblom, B.; Engstrand, O.: In what sense is speech quantal? J. Phonet. 17: 107–121 (1989). Lindblom, B.; Studdert-Kennedy, M.: On the role of formant transitions in vowel recognition. J. acoust. Soc. Am. 30: 693–703 (1967). Mullennix, J.W.; Pisoni, D.B.: Stimulus variability and processing dependencies in speech perception. Perception Psychophysics 47: 379–390 (1990). Mullennix, J.W.; Pisoni, D.B.; Martin, C.S.: Some effects of talker variability on spoken word recognition. J. acoust. Soc. Am. 85: 365–378 (1989). Nygaard, L.C.; Sommers, M.S.; Pisoni, D.B.: Speech perception as a talker-contingent process. Psychol. Sci. 5: 42–45 (1994). Padgett, J.: Contrast dispersion and Russian palatalization; in Hume, Johnson, The role of speech perception phenomena in phonology (Academic Press, New York, in press). Stevens, K.N.; House, A.S.: Perturbation of vowel articulations by consonantal context: an acoustical study. J. Speech Hear. Res. 6: 111–128 (1963). Walker, S.; Bruce, V.; O’Malley, C.: Facial identity and facial speech processing: familiar faces and voices in the McGurk effect. Perception Psychophysics 59: 1124–1133 (1995).

188

Phonetica 2000;57:181–188

Johnson


Received: November 2, 1999 Accepted: January 30, 2000

Language Acquisition as Complex Category Formation Andrew J. Lotto Loyola University Chicago, Ill., USA

Abstract Purported units of speech, e.g. phonemes or features, are essentially categories. The assignment of phonemic (or phonetic) identity is a process of categorization: potentially discriminable speech sounds are treated in an equivalent manner. Unfortunately the extensive literature on human categorization has typically focused on simple visual categories that are defined by the presence or absence of discrete features. Speech categories are much more complex. They are often defined by continuous values across a variety of imperfectly valid features. In this paper, several kinds of categories are distinguished and studies using human subjects, animal subjects and computational models are presented that endeavor to describe the structure and development of the sort of complex categories underlying speech perception. Copyright © 2000 S. Karger AG, Basel

There appears to be a tendency in work on speech communication to presume that the form of speech (and language) is a given and that one must hypothesize elaborate mental processes that can accommodate the complexities of this divinely ordained communication system. For example, Chomsky [1957, 1965] endowed children with a specialized Language Acquisition Device to allow them to become competent in the complex recursive rules of language and to discover the underlying structure in the hopelessly impoverished speech of their parents. Liberman and Mattingly [1985] proposed a specialized speech-perception module to help the poor listener deal with the ‘lack of invariance’ problem that they are unfortunately saddled with because of the variability inherent in speech. Alternative to this viewpoint, one can presume that the general perceptual and cognitive processes of humans are the givens and that the specific form of our communication system evolved to take advantage of the specific operating characteristics of our cognitive system. I refer to this view as the General Auditory and Learning Approach (GALA) [Lotto, 1996]. It is probably best exemplified in the work of Lindblom [e.g. Liljencrants and Lindblom, 1972; Lindblom et al., 1983; Lindblom, 1986] and Ohala [e.g. 1974, 1999]. GALA is founded on the notion that the development of particular linguistic systems and speech as a communication system is constrained by our general inherited cognitive systems and properties of speech production. In this



Andrew J. Lotto Department of Psychology, Loyola University Chicago 6525 North Sheridan Road, Chicago, IL 60626 (USA) Tel. +1 773-508-8227, Fax +1 773-508-8713 E-Mail [email protected]

approach, the characteristics of speech and particular linguistic systems offer opportunities to study mechanisms of perception and cognition. A good example of GALA is Björn Lindblom’s [1986] attempt to predict typical vowel inventories by computing auditory distinctiveness. The accuracy of these predictions is enhanced by more detailed information about the operating characteristics of the peripheral auditory systems. Thus, we see the fingerprints of the auditory system on the content of linguistic sound systems. From these results, it is certainly reasonable to suggest that the contents of vowel systems are constrained in part by a goal of sufficient auditory distinctiveness, where this metric is a function of the capabilities of the auditory system. This is one example of the many successes of explaining structure and function in speech communication by appealing to constraints of the auditory and articulatory systems. Because of this area of productivity, researchers who eschew the notion of specialized speech mechanisms have sometimes been called ‘auditorists’ [e.g. Nearey, 1997]. However, this term refers to only one half of GALA. There has always been an assumption in the work of the ‘auditorist’ that learning plays an important role in accounting for speech behavior. The auditory system provides a representation of the acoustics of a sound, but it is through general learning processes that a listener develops a representation that is useful for communication. Unfortunately, empirical work on this learning component has been lacking relative to the auditory component. In particular, it is not clear how the characteristics of our general learning processes constrain the form of speech [though, see Lindblom et al., 1983; Nearey, 1997]. What would the fingerprints of the learning system look like? To begin to formulate an answer, let’s look at the task for the language learner.

Language Acquisition

The task for infants learning their first language or adults learning a second language appears daunting. Some of the variance in the acoustic input that they receive is directly relevant to the intended message of the speaker. Other variance in the input, however, is the result of extra linguistic influences such as the particular structure of the speaker’s vocal tract. The language learner must parse the input variance to discriminate those contrasts that carry information and ignore variation within a contrast that is due to speaker characteristics, coarticulation, articulatory undershoot, etc. Complicating this task is the fact that the language learner must do this in a language-appropriate manner. Languages utilize some subset of over 800 sounds as phonemes and this subset can range from 11 to 141 phonemes [Maddieson, 1984]. As a result of this diversity, variance that is extralinguistic in one language community may be pivotal for discovering the intended message of a speaker in another language environment. Thus, the task for the language learner is to discriminate some of the acoustic variance and to treat the remaining, potentially discriminable, variance as functionally equivalent. In other words, the language learner must create auditory categories that map the linguistically relevant distinctions for the particular language he is attempting to learn. Categorization is a general perceptual process. Much of our perceptual behavior entails treating potentially discriminable stimuli as equivalent. Thus, we may be able to predict the constraints of learning on speech by understanding the operating characteristics of our general categorization processes.

190

Phonetica 2000;57:189–196

Lotto

Unfortunately, the empirical work on perceptual categorization has focused primarily on visual categories that are defined by the presence or absence of discrete cues. In contrast, speech sound categories (phonemes) are auditory and are defined by values across a number of imperfectly valid continuous cues. In order to describe language acquisition as categorization and to show the effect of categorization on the structure of speech, we need to understand the processes of complex auditory category formation. My colleagues and I have begun attempts to empirically define the important constructs of learning and categorization to try to supplement our understanding of the role of audition and learning in GALA.

Functionally Defined Categories

In order to study language acquisition in speech, we need to develop stimulus sets that contain the same degree of complexity as speech sound categories. Simple tones or clicks will not do. My students and I have tried to create a complex auditory stimulus set that is easily manipulated and that will not be identifiable as speech or any other environmental sound. We believe that we have a set of stimuli that fulfill these desired characteristics. The stimuli are sculpted from 300-ms bursts of white (Gaussian) noise. White noise, as opposed to single tones or collections of several tones, has energy across the range of frequencies; in this way, it is like speech. Using digital signal processing, three attributes are added to the white noise bursts. First, a linear ramp in amplitude defines the onset of the stimulus. This ramp varies in duration (from 10 to 100 ms), which gives the stimuli different degrees of ‘attack’. The other two attributes are frequency ‘notches’ in the noise created by band-stop filters. One 300-Hz-wide notch (where energy is greatly attenuated) varies in low-frequency cutoff from 400 to 850 Hz. The second 300-Hz-wide notch varies from 2,200 to 3,100 Hz. These notches, or gaps, are referred to as NF1 and NF2, respectively (for negative first formant and negative second formant, to make obvious the analogy to vowel stimuli, thus serving a mnemonic purpose despite being an abuse of the term ‘formant’). After adding these attributes, the resulting noises sound nothing like speech, but have a desirable amount of complexity. The three attributes can vary independently over distinct continuous ranges. Arbitrary categories can be created across any combination of the three attributes. We had 5 adults learn two categories constructed from this stimulus set across 10 one-hour sessions. Category A contained stimuli that had onsets less than 50 ms, NF1s lower than 600 Hz, and NF2s greater than 2,800 Hz. Category B was the complement of this set. A stimulus belonged to A if two of the three attributes fell within the range of the description of category A (e.g. onset = 20 ms, NF1 = 450 Hz, and NF2 = 2,500 Hz). As compared to typical procedures in categorization, this is an extremely complicated categorization task. The categories are defined on attributes that vary continuously and none of the attributes are necessary nor sufficient to define the category. Listeners must integrate across all attributes and they must do it quickly because they hear the stimulus presented on each trial only once. These characteristics of the task make it very much like the task of learning a speech sound category. The subjects were presented each sound and asked to press a button labeled A or B. After responding, they received feedback in the form of a light appearing above the correct response button. The question was whether humans could learn to perform this

Acquisition as Categorization

Phonetica 2000;57:189–196

191

Fig. 1. Percent of category A responses on 1st

day and 10th day of training. Filled bars are correct responses (i.e. the sound did come from category A). Unfilled bars are incorrect responses (i.e. sound actually was a member of category B).

complicated task. Despite subjects’ concerns about the difficulty of the task, they were able to learn the categories to some extent after less than 10 h. Figure 1 is a graph showing percent A responses after the 1st day of training and after the 10th day of training. These results demonstrate that auditory categories with the complexity of speech can be learned by humans. These novel categories have many similarities to speech sound categories. In particular, these nonspeech categories ‘suffer’ from the problem of lack of invariance. None of the three attributes were necessary nor sufficient to define the category. Yet, subjects were able to learn the categories by simply using general categorization processes. This is similar to the demonstration that birds can learn to correctly identify (categorize) syllables starting with /d/ despite the lack of a single defining cue [Kluender et al., 1987]. In this experiment, the categories were defined completely by the response required for the stimuli (as indicated by feedback). That is, stimuli that required an equivalent response were considered members of the same category. I refer to these as functionally defined categories. Speech sound categories are certainly functionally defined to some degree. That is, sounds are grouped into a phonemic category because they are functionally equivalent when it comes to lexical access. This kind of functional-equivalence category is similar to traditional definitions of linguistic categories. For example, phonemes are often defined in terms such as: ‘… a family of uttered sounds… in a particular language which count for practical purposes as if they were one and the same…’ [Jones, 1967, p. 258]. However, there is also systematic variance in the input distributions of speech sounds that can serve as information for a language learner about the definition of phonemic categories. For example, the distributions of voice onset time for voiced and voiceless consonants in various languages are Gaussian-like in shape with exemplars in the middle of a category having higher frequencies of occurrence than exemplars near the boundaries between categories [Lisker and

192

Phonetica 2000;57:189–196

Lotto

Abramson, 1964]. This statistical information could be used by language learners to parse the space of voice onset time into phonemic categories. In addition, correlations between features can provide information about categories, as correlations tend to be higher within a natural category than between categories. Statistical information must be playing some role in language acquisition, because infants show evidence of native-language phonemic categorization at 6 months of age before a lexicon is established to define functional equivalence [Kuhl et al., 1992]. The input distributions for the functionally defined nonspeech categories presented to our subjects in the experiment described above were rectangular with no correlational structure. This may be why our subject found these categories rather difficult to learn. What happens if we add statistical information? Does this change the type of learning involved?

Functional-Statistical Category

As mentioned above, speech-sound categories are good examples of categories that are functionally and statistically defined. Of course, it is difficult to study the formation of speech-sound categories in humans because it is difficult to ethically control input or to get a valid measurement of what input a child or second-language learner actually receives. However, one can control precisely the input to a nonhuman animal and, thus, study the response structures arising from functional-statistical category formation (with the presumption that animal and human perceptual categorization processes are fundamentally similar). My colleagues at the University of Wisconsin and I have run such a study [Kluender et al., 1998]. Birds (European starlings, Sturnus vulgaris) were trained to peck in response to exemplars from one vowel category (e.g. /i/) and to refrain from pecking when presented exemplars from a second category (e.g. /R/). The exemplars were chosen from stylized distributions and varied in first (F1) and second (F2) formant frequencies. The birds were reinforced for pecking to exemplars from one of these distributions (the positive response category was randomly assigned to each bird). That is, the categories were defined functionally; all sounds in the /i/ category were to be responded to equivalently (by pecking the key). In addition, there was statistical information about category boundaries in the input distributions during training. The vowel distributions were nonoverlapping. The area around the centroid of each distribution (in F1 × F2 space) was more densely sampled (though the true centroid of the distribution was not presented during training) and fewer exemplars were sampled from the boundaries of each distribution. Thus, one could detect the categories by differentiating input density across F1 × F2 space. Could this information affect avian categorization of vowel distributions? After less than 100 h of exposure to this task, the birds were tested on the categories. They demonstrated excellent categorical behavior. Peck rates to positive vowels (e.g. /i/) were substantially higher than pecks to negative vowels (e.g. /R/). The birds easily generalized their responses to novel exemplars from the categories (e.g. pecking to an /i/ exemplar with F1 and F2 values that were never presented before). Because of control conditions within the experimental design, we could determine that birds’ responses were due solely to their experience with the sounds and reinforcement conditions. Certain pairs of stimuli were part of a single category for some birds, but straddled two categories for other birds. Starlings for whom the pair members fell within


Phonetica 2000;57:189–196

193

distinct categories pecked differentially to the pair members, indicating that they were able to discriminate the two. Birds who had been trained to equate the exact same stimuli responded to them equivalently. The birds also showed a gradient in their response structures. They pecked far more vigorously to positive stimuli that were maximally separated in the F1 × F2 space from negative stimuli (e.g. stimuli with high F2 and low F1 if the positive vowel was /i/) than to stimuli that were near the boundary with the other category. This gradient in response may be a general consequence of learning a functional equivalence class. Classic work in discrimination learning describes results in which responses to a positive stimulus tend to strengthen as one moves away from the negative stimulus [e.g. Spence, 1936, 1937, 1952, 1960; Hansen, 1959; Mackintosh, 1995]. Thus, a response gradient may be a fingerprint of the processes involved in forming functionally defined perceptual categories. Do we see this gradient in human speech categories? Additional data from this project demonstrates that a similar structure is apparent in responses of humans judging the representativeness or ‘goodness’ of vowels [Kluender et al., 1998]. We presented human adults with the same vowel distributions that were presented to the birds. Human listeners judged the stimuli in terms of ‘goodness’ as exemplars of the English vowels /i/ and /R/. The responses of the adults showed a gradient similar to that exhibited by the birds. The best /i/ exemplars were judged to be those furthest from the /R/ distribution (i.e. low F1 and high F2). Interestingly, these exemplars of /i/ would be very rare in natural speech because vowels are often reduced (moved away from the extremes of the F1 × F2 space) in normal speaking contexts [Lindblom, 1963; Johnson et al., 1993]. Other researchers have found similar gradients in human perceptual responses to vowels [Johnson et al., 1993; Aaltonen et al., 1997; Lively and Pisoni, 1997]. In all cases, listeners appear to prefer vowels that are maximally distinguished from competing vowel categories. Thus, this gradient may be a nice example of the fingerprint of general mechanisms underlying learning of functionally defined categories. Besides this gradient, there was a second salient feature present in the structure of birds’ pecking responses. Birds tended to peck more to the exemplar at the centroid of the positive distribution than to more remote exemplars. (This resulted when averaging across all stimulus tokens of the distribution. On a token-by-token basis, the highest peck rates were not at the centroid, but at the extremes of the space. This latter finding defines the gradient as described above.) This is despite the fact that the bird had never experienced the centroid stimulus during training. This pattern of responses is similar to patterns that have been used as justification for prototype models of classification in the general categorization literature [Posner and Keele, 1968; Rosch, 1973, 1988]. This prototype structure is interesting because it demonstrates that the birds learned something about the structure of input distributions. The centroid stimulus came from the most densely sampled region (in F1 × F2 space) of the input distribution. Birds’ responses reflected this statistical fact of the distributions, although it was not necessary for the birds to learn about the statistical structure of the input to perform the task correctly. The functional equivalence classes that the birds were to learn could be defined by a simple (linear) boundary in stimulus space. Nonetheless, there was evidence that birds’ responses were affected by statistics of the input structure; the birds seemed to pick up statistical information about the task incidentally. This is consonant with many recent findings demonstrating an amazing ability for adult and infant humans to learn statistical information (e.g. transitional probabilities) in auditory

194

Phonetica 2000;57:189–196

Lotto

streams, even when the information is unrelated to any current task [Saffran et al., 1996, 1997, 1999]. Thus, it may be that the prototype in the response structure is a fingerprint of the general processes underlying the formation of statistically defined categories. Do we see anything like this in human speech sound categories? Yes. Similar prototype structures were also present in the ‘goodness’ ratings of our human adult subjects for the /R/ vowel distributions used in the avian learning study. In fact, the agreement between the birds’ responses and the humans’ judgments was quite amazing. The correlation coefficient across vowels was r = 0.99 and the average r within any particular vowel distributions was about 0.7. Both humans and birds demonstrated gradients (indicative of functionally defined categories) and prototypes (indicative of statistically defined categories). We were also able to model these response structures fairly successfully with a simple linear associator (conceptualized as a neural network). Together these data suggest that general categorization processes may play a role in the development and maintenance of phonetic categories.

Conclusions

These previous experiments serve as initial attempts to discover the kinds of effects that general learning processes may have on the form of speech categories. They have demonstrated that prototype effects and solutions to the lack of invariance problem may be delivered by general processes of categorization. Additionally, recent work on computational models of learning suggests that decreased intracategory discrimination or categorical perception is an expected result of any categorization process [Damper and Harnad, in press]. One may note that lack of invariance, prototypes, and categorical perception are all concepts that one time or another have been proposed as evidence for specialized mechanisms for speech perception. Research on general categorization processes offers hope of a parsimonious explanation for all these phenomena. With continuing work on the systems of learning and audition, GALA now holds promise as a coherent and integrated framework for understanding structure and function in speech.

References Aaltonen, O.; Eerola, O.; Hellström, Å.; Uusipaikka, E.; Heikki Lang, A.: Perceptual magnet effect in the light of behavioral and psychophysiological data. J. acoust. Soc. Am. 101: 1090–1105 (1997). Chomsky, N.A.: Syntactic structures (Mouton, The Hague 1957). Chomsky, N.A.: Aspects of the theory of syntax (MIT Press, Cambridge 1965). Damper, R.I.; Harnad, S.R.: Neural network models of categorical perception. Percept. Psychophys. (in press). Hansen, H.M.: Effects of discrimination training on stimulus generalization. J. exp. Psychol. 58: 321–372 (1959). Johnson, K.; Flemming, E.; Wright, R.: The hyperspace effect: phonetic targets are hyperarticulated. Language 69: 505–528 (1993). Jones, D.J.: The phoneme (Cambridge University Press, Cambridge 1967). Kluender, K.R.; Diehl, R.L.; Killeen, P.R.: Japanese quail can learn phonetic categories. Science 237: 1195–1197 (1987). Kluender, K.R.; Lotto, A.J.; Holt, L.L.; Bloedel, S.B.: Role of experience in language-specific functional mappings for vowel sounds as inferred from human, nonhuman, and computational models. J. acoust. Soc. Am. 104: 3568–3582 (1998). Kuhl, P.K.; Williams, K.A.; Lacerda, F.; Stevens, K.N.; Lindblom, B.: Linguistic experiences alters phonetic perception in infants by 6 months of age. Science 255: 606–608 (1992). Liberman, A.M.; Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985).


Phonetica 2000;57:189–196

195

Liljencrants, J.; Lindblom, B.: Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 839–862 (1972). Lindblom, B.: Spectrographic study of vowel reduction J. acoust. Soc. Am. 35: 1773–1781 (1963). Lindblom, B.: Phonetic universals in vowel systems; in Ohala, Jaeger, Experimental phonology (Academic Press, Orlando 1986). Lindblom, B.; MacNeilage, P.; Studdert-Kennedy, M.: Self-organizing processing and the explanation of phonological universals; in Butterworth, Comrie, Dahl, Explanations of the phonetic universals (Mouton, The Hague 1983). Lisker, L.; Abramson, A.S.: A cross-language study of voicing in initial stops: acoustical measurements. Word 20: 384–422 (1964). Lively, S.E.; Pisoni, D.B.: On prototypes and phonetic categories: a critical assessment of the perceptual magnet effect in speech perception. J. exp. Psychol. hum. Perception Performance 23: 1665–1679 (1997). Lotto, A.J.: General auditory constraints in speech perception: the case of perceptual contrast; PhD diss. University of Wisconsin-Madison (1996). Mackintosh, N.J.: Categorization by people and pigeons: the twenty-second Bartlett Memorial Lecture. Q. Jl. exp. Psychol. 48: 193–214 (1995). Maddieson, I.: Patterns of sound (Cambridge University Press, Cambridge 1984). Nearey, T.M.: Speech perception as pattern recognition J. acoust. Soc. Am. 101: 3241–3254 (1997). Ohala, J.J.: Experimental historical phonology; in Anderson, Jones, Historical linguistics II: Theory and description in phonology (North Holland, Amsterdam 1974). Ohala, J.J.: Acoustic-auditory aspects of speech. 35th Regional Meet. Chicago Ling. Soc., Univ. Chicago, April 1999. Posner, M.I.; Keele, S.W.: On the genesis of abstract ideas. J. exp. Psychol. 77: 353–363 (1968). Rosch, E.: Principles of categorization; in Collins, Smith, Readings in cognitive science: a perspective from psychology and artificial intelligence (Morgan Kaufmann, San Mateo 1998). Rosch, E.H.: Natural categories. Cognitive Psychol. 4: 328–350 (1973). Saffran, J.R.; Aslin, R.N.; Newport, E.L.: Statistical learning by 8-month-olds. Science 274: 1926–1928 (1996). Saffran, J.R.; Johnson, E.K.; Aslin, R.N.; Newport, E.L.: Statistical learning of tone sequences by human infants and adults. Cognition 70: 27–52 (1999). Saffran, J.R.; Newport, E.L.; Aslin, R.N.; Tunick, R.A.; Barrueco, S.: Incidental language learning: listening (and learning) out of the corner of your ear. Psychol. Sci. 8: 101–105 (1997). Spence, K.W.: The nature of discrimination learning in animals. Psychol. Rev. 43: 427–449 (1936). Spence, K.W.: The differential response in animals to stimuli varying within a single dimension. Psychol. Rev. 44: 430–444 (1937). Spence, K.W.: The nature of the response in discrimination learning. Psychol. Rev. 59: 89–93 (1952). Spence, K.W.: Behavior theory and learning (Prentice-Hall, Englewood Cliffs 1960).

196

Phonetica 2000;57:189–196

Lotto

Biology of Communication and Motor Processes Phonetica 2000;57:197–204

Received: October 31, 1999 Accepted: February 3, 2000

Singing Birds, Playing Cats, and Babbling Babies: Why Do They Do It? Sverre Sjölander Department of Biology, University of Linköping, Sweden

Abstract Rarely, animals do what they do because they are aware of the function of the behaviour or its outcome. Instead, they will very often perform behaviour out of context, spontaneously, as play. The impression (strengthened by introspection in the human species) is that they do it because they get some kind of internal reward. Nevertheless, such seemingly meaningless behaviour may have an ultimate function to adjust behavioural programs to the body, to practice, to perfect the execution of the behaviour. If the proximate reason for doing what the animal does may be to attain a pleasurable state, the ultimate, evolutionary reason may still be that increased practice will give some gain in fitness. If one presupposes internal rewarding and punishing systems as intervening factors, it becomes much simpler to explain why birds sing, kittens play or babies babble without any outer reward and out of any functional context, more than needed from a strictly functional view, spontaneously and just for the fun of it. Copyright © 2000 S. Karger AG, Basel

Niko Tinbergen [1951] was one of the first to call attention to the many different ways to answer the question why a bird sings: physiological, mental, evolutionary, etc. His main purpose was to elucidate the difference between ultimate and proximate explanations to biological phenomena. Whereas the singing of a particular blackbird here and now may be accounted for by referring to physiological, neural, hormonal, or other causal factors working at the individual level (i.e. by providing a proximate explanation), the reason why the blackbird as a species sings is better sought in ultimate evolutionary explanations in terms of the greater inclusive reproductive success of birds endowed with the capacity to sing, in comparison with non-singing ones. The early science of ethology did not shy away from the explanation that birds may sing simply ‘for the fun of it’, or because not singing makes them feel ‘uncomfortable’. After all, pioneers like Konrad Lorenz, Niko Tinbergen, and Oskar Heinroth worked in an intellectual atmosphere, which was strongly influenced by Freud’s ideas, and where the notion of drives, Lust and Unlust permeated the discussion in many different fields. The adherence to such concepts is evident, for instance, in the following



Sverre Sjölander Department of Biology University of Linköping S–581 83 Linköping (Sweden)

apposite remark by Heinroth, often quoted by Lorenz [see Koenig 1988]: ‘Animals are emotional people with very limited understanding’ (the original German text is more to the point : ‘Tiere sind Gefühlsmenschen mit sehr wenig Verstand’). What they wished to emphasize was the fact that when animals are closely observed for a longer period of time, at least vertebrates give the impression of being strongly behaviourally governed by their emotions and affects [Lorenz, 1973; Scherer and Ekman, 1984; BischofKöhler, 1989]. Animals do not seem to do things merely because of the external results which may be achieved; they very often seem to do the things they do because the activity itself affords them some kind of pleasure, or in order to alleviate or eliminate some kind of discomfort. Furthermore, it is often quite clear that they are entirely unaware of the function of their behaviour. The following example may illuminate the matter at hand. When warming her eggs in the nest, a female bird breeding for the first time cannot have any idea whatsoever of the function of brooding. She has never seen eggs before in her life, and so it is impossible for her to have any notion of what eggs are, and of the necessity of keeping a certain temperature in order for them to develop. She may, of course, warm the eggs purely automatically, by way of some kind of response which is triggered in the nervous system by the key stimulus ‘eggs’ and then releases the behaviour in an entirely reflexive manner. Behaviourists like Pavlov or Skinner would probably have maintained this. However, the fact that the female will display stress symptoms when prevented from brooding, and symptoms of relaxation when allowed to brood, makes it hard for a human observer not to assume that she experiences some kind of negative emotion when prevented from brooding, and that she is in some kind of positive mental state when allowed to brood unimpeded, and that these mental states decide what the bird does – not an awareness of the reason for, or the function of, the behaviour [see e.g. Griffin, 1976, 1984; Roitblat et al., 1984; Dennett, 1990; Sjölander, 1999, etc.]. The common textbook explanation for the singing of the blackbird is that the bird announces that it is a blackbird, that it belongs to the male sex, that it claims a certain territory, and also its individual identity. However, there are no indications of this being the most plausible proximate explanation for its singing; on the contrary, there is evidence suggesting that it is a highly implausible explanation. A blackbird captured in spring will continue to sing in captivity, loudly and intensely, despite the fact that all the ulterior functions of singing have been obliterated [Dabelsteen, 1985]. Like the brooding female, it will display symptoms of nervousness and restlessness if kept from singing. Here, too, it is difficult to refrain from making the inference, on the basis of the posture and the general attitude of the bird, that it not only has a ‘need’ to sing, but also ‘enjoys’ singing when permitted to do so. The present-day trend in ethology towards ultimate explanations for behaviour has lessened the interest in proximate mechanisms, apart from those operating on a neural or hormonal level, with the result that discussions of emotion and affect in animals are often regarded as fruitless. An explanation to the effect that a blackbird sings because it is unpleasant to refrain from singing, and pleasant to sing, is often regarded as either trivial or unscientific, with few if any interesting repercussions. Like Skinner, many biologists think that those aspects of behaviour that are impossible to investigate by way of direct observation, experiments, and quantitative methods should be payed no regard. As we can never acquire any scientifically warranted beliefs about the subjective experiences of animals, we had better leave these experiences aside.

198

Phonetica 2000;57:197–204

Sjölander

The problem of other minds – of whether and how anyone really knows that any other organism is sentient – represents an age-old discussion in philosophy, but it gained wide attention in modern times with the rise of evolutionary epistemology [e.g. Campbell, 1974; Riedl, 1980, 1987; Vollmer, 1983; von Glasersfeld, 1987; Oeser and Seitelberger, 1988; Riedl and Wuketits, 1987; Sjölander, 1999, and others]. Epistemological solipsism, or the theory that I can only acquire real kwowledge about my own mental states, leads us nowhere, scientifically speaking, and so is of very limited interest outside of philosophy, even though it cannot be denied that I have a certain ‘privileged access’ to my own mental state. (Thus, epistomological solipsism should not be confused with metaphysical solipsism, i.e. the theory that I am the sole existent.) To be sure, there is no logical contradiction in suggesting that the behaviour of others can in every single respect resemble mine, but that they still – for all I know – might be unconscious automata. Logical possibilities aside, many thinkers still have found the so-called ‘argument from analogy’ very persuasive: I know that my own mental states may be accompanied by certain behaviours, and when I observe that other bodies similar to mine display similar behaviours, I infer that these behaviours, just like my own, accompany certain mental states. By analogy, it seems reasonable to draw the conclusion that other people have the same capacity for emotions, affects, and experiences that I have, since they are human beings like myself. The case for making the same inference as regards members of other species than Homo sapiens – primarily other mammals, but also birds and lower vertebrates – may perhaps seem somewhat weaker, pending further scientific investigation. However, it is interesting to note that if I affirm the proposition that other people have the same kind of negative emotions that I experience when harmed, but deny that other animals, closely related to man, may have those experiences under similar conditions, then I am on a ‘slippery slope’, epistemically speaking. As soon as I take the first step of recognizing that other humans may experience certain kinds of emotion, because they resemble me in various ways, then I will start sliding down the slope towards a recognition of the fact that other animals may experience the same things, because they in part have a similar construction as man and share the same general evolutionary background. The progress towards this concession can only be stopped by providing scientific evidence to the effect that there are important differences between humans and other mammals, differences that make the belief that other mammals lack the capacity for psychological experiences a scientifically warranted conclusion. Thus, the burden of proof lies on those who take sides with thinkers like Descartes, denying that animals have this mental capacity, and claiming instead that, for instance, the howling of a dog in pain is a purely automated response, not accompanied by any subjective experience of pain. Obviously, we have already assumed this not to be the case, since we have legislation against cruelty to animals. Applying Occam’s razor would, in this case, not so much mean that the number of capacities acknowledged in a species should be minimized, as being economic as regards the number of general principles postulated for the behaviour of various species across the board. It is uneconomic to appeal to one principle when explaining behaviour in, e.g., chimpanzees, and quite another one in the explanation for the same behaviour in humans. If we postulate a model of animal behaviour (at least for vertebrates) on the assumption that there are internal reward and punishment systems, that animals do a lot of what they do in order to get rid of the action of the internal punishment system, or in

Singing Birds, Playing Cats, and Babbling Babies

Phonetica 2000;57:197–204

199

order to attain a desirable, pleasant internal state, we would of course do little more than applying an old model of human behaviour [particularly well articulated by Freud, 1941] to animal behaviour. Many biologists might regard such a model with distaste, since it introduces in an otherwise hard natural science an element of non-quantifiable, operationally undefinable, and even metaphysical reasoning. They could claim that despite the fact that behaviourism has proven to be an incomplete empirical theory for behaviour, this should not prevent us from making a commitment to methodological behaviourism, by sticking to observable, quantifiable factors that go into the black box, and observable, quantifiable items that come out of it. Animals copulate in order to have offspring, and we can measure both sides of the matter. Who cares if animals actually copulate because of the pleasure this gives them, and not because it is an automated instinct or because they know what the function of copulation is? What is the point with a model that presupposes an intermediate factor, i.e. the attainment of pleasure, the gratification of desire, and/or the alleviation of displeasure? Functional explanations of animal behaviour, e.g. that birds build nests in order to have a place to lay their eggs, are inadequate insofar as they do not explain at all why individual birds behave like they do, here and now. Ultimate explanations, e.g. to the effect that birds that developed a nest-building behaviour improved their reproductive success and were selected for through many generations, are fruitful in that they explain the evolution of a genetic program, but fall short of saying anything about the motivational force behind the actions of individual birds. Both these explanatory models are especially inapt when it comes to accounting for all the instances where animals do something out of context, spontaneously, or ‘just for the fun of it’. Well aware of the function of copulation and its evolutionary background, humans will often go to great lengths in avoiding its biological outcome. Since in our case the activating of an internal reward system (and the satisfaction of a specific need) is so blatantly the reason for copulation, it seems sensible to make the same inference regarding other vertebrates, pending evidence to the contrary. Animal behaviour abounds with instances where the individual animal does things just for the sake of doing it, as far as we can see. The cat that plays with a mouse, catching it, releasing it, hitting it to make it move, catching it again, throwing it in the air, and finally eating it, is a well-known example. It is hard, perhaps impossible, to find an ultimate or functional explanation for this behaviour in a grown cat. However, the situation becomes different when we consider a kitten. The kitten clearly has an innate predisposition for catching small prey, as anybody raising a kitten will find out. But this program, or software, runs into difficulties when it is applied to the hardware, that is, the body, which in a growing kitten is subjected to constant change. Consequently, if the highly complicated, rapid movements involved in hitting a fleeing object, jumping at it, biting it, etc., are to be performed with sufficient accuracy, then almost daily adjustment is called for. The kitten needs a repeated updating of its bodily self-assessment and self-adjustment, the software must be changed to fit a changing hardware. It is not just the fine-tuning of a complex behaviour that is involved; it is perhaps of equal importance to adjust the behaviour so that the energy expenditure is minimized. Clearly, this importance is different relative to different behaviours. Locomotion, used daily and often for long periods, must necessarily be highly optimized as far as energy expenditure goes, whereas behaviours rarely performed may be in lesser need for such an optimization. But a sound assumption is that all behaviours to a larger or lesser degree are adapted to minimize energy expenditure [Lindblom, 1983, this vol.].

200

Phonetica 2000;57:197–204

Sjölander

It clearly has a survival value if, once it is weaned, the kitten is already well adjusted in its hunting behaviour. (Despite claims to the contrary, the mother cat teaches her kittens nothing, and they do not mimic her behaviour. By bringing injured prey she rather gives them opportunity to practice and learn by themselves.) What would be a suitable predisposition in the kitten for it to be successful in this daily adjustment of the software to the hardware, for making the execution of it smooth, efficient and energy-saving? The answer is obvious: a disposition for playing. The more the kitten plays, the better the adjustment will be. [For a further discussion on play behaviour, see e.g. Burghardt, 1999.] As for the motivation of the kitten to perform the play behaviour, why look any further than to a mammal we all know first hand? If you play because it is fun, that is, because you get emotionally awarded inside your brain, you need neither enforcement nor reward, from the external world. And the adjustment of hardware and software will be achieved, well ahead of the time when an actual need arises. Evolution will favour playing that results in competences you need in the future, and this by the same rules that are responsible for bringing other adaptations into existence. Whereas a cat should predominantly play in the context of prey catching (but also fighting), a dog should primarily play in such a way that it practices fighting (but also chasing and bringing down prey). The prey catching of a cat is a more complex behavioural pattern than prey catching in wolves, and the cat only fights during the breeding time. Wolves display a less complicated prey-catching behaviour, but spend their days in a state of perpetual danger of fights in the pack. If we explain prey-catching play in the kitten as an adjustment of software to hardware, what about the behaviour of the grown cat, when it releases the mouse, only to catch and release it again, and again? On the assumption that cats catch mice because it is highly enjoyable to do so, this behaviour is easily explained in a house cat that normally does not have to resort to hunting in order to survive. If, however, cats catch mice with the sole purpose of eating them, this behaviour is incomprehensible. That the behaviour remains observable in the domestic cat may be explained by the breeding towards infantile behaviour and attitudes in the domestic cat, as well as the lack of opportunities to hunt [Lorenz, 1973]. Why does a foal suddenly, for seemingly no reason at all, break into a gallop, jump, and prance? Because foals that ‘enjoyed’ doing this adjusted their software to the hardware, concerning fleeing, and were well prepared the day predators attacked. It would seem unnecessary to seek any other explanation for similar motoric behaviour in our own species. It is, of course, possible that our children learn to walk, jump, and gallop because we encourage them to do so and reward them when they succeed, and that they would remain on all fours in the absence of our encouragement and rewards. But anybody who has seen a baby’s face radiating happiness after the first successful attempt at walking will have difficulties believing that walking in itself does not release some kind of very gratifying state in the baby’s mind, that walking is its own reward, no matter how many candies the baby is offered. From a biologist’s point of view, we have every reason to believe that a similar explanation holds true for the babbling of human babies [e.g. Oksaar, 1979]. We clearly have the ability to produce a great many different sounds, using a highly complex apparatus. If other animals need to play in order to adjust their behavioural programming to their individual body, in its present and changing state, it seems sensible that children, who enjoyed the production of sounds for its own sake, also were the ones well pre-


Phonetica 2000;57:197–204

201

pared to acquire a language, and that such a predisposition was favoured by evolution. And since the brain and its programming cannot anticipate which particular local language the child will learn later on, it must be open to all alternatives, and the baby should play with sounds within the full potential. There are, of course, many indications suggesting that later language acquisition as well as babbling is not primarily governed by a behaviouristic reward system [e.g. Studdert-Kennedy, 1982; Lieberman, 1984]. After all, children learn many words without being immediately rewarded for doing so, and it is probably possible to maintain that acquisition may take place in the complete absence of rewards. That learning words may lead to an immediate reward does not preclude the possibility of a learning for its own sake – one would rather expect the two factors to interact and mutually reinforce one another. No doubt an important point in this respect is that children enjoy learning nonsensical rhymes and other meaningless but nice-sounding (sic!) sound sequences. Performing series of sounds – meaningful or not – is clearly internally rewarding, as is evidenced by observing children as well as by introspection. The willingness to learn how to sing, or the tendency towards music, may be taken as a particularly apt example of behaviour performed ‘just for the fun of it’. Music is in many respects very close to language, in the production of sounds, in the building of lengthy, complicated structures, and in the utilization of timbre as an indicator/releaser of moods. It is done for the immediate pleasure of performing it, being an activity which is its own justification, since it is intrinsically pleasant. We can discern no functional reason for doing it, and ultimate explanations of why the performing of music should increase your fitness are, to say the least, speculative (although the behaviour of young ladies at rock concerts might give indications that there could be some connection). Music done for its own sake is ubiquitous; talking for its own sake, not only in order to exchange information, is likewise easy to observe, and clearly is of great social importance [e.g. Dunbar, 1997]. So, why do birds sing, cats play, and babies babble? The answer could well be: for the fun of it. If anybody would like to retort that this is trivially true, all the better. It is. But in a number of widespread models of both human and animal behaviour, it is not seen as a trivial truth; in fact, it is not taken as a truth at all. But the fact that we cannot provide an operational definition of what ‘for the fun of it’ really means, does not entitle us to neglect it, by behaving or thinking as if such subjective experiences did not exist. Our present inability to imagine what an internal, mental rewarding or punishing system really might be, and what neural correlates such experiences may have, should not lead to the simple solution of pretending that such experiences do not exist, that they do not play an important, probably decisive role in animal behaviour. The fact that a phenomenon is inaccessible to our present-day science, because it cannot be properly observed or measured, is no ground for excluding it from our theories or models. If we assume a more or less direct connection netween, e.g. the cat’s mouse-play and its function (to get food), it will be quite easy to devise experiments questioning such a proposition. We might find that cats perform such play much more frequently than necessitated by the function, that they do it out of any functional context, that deprivation of playing has little effect. In short, it will be hard to prove that playing with a mouse is directly related to hunting prowess, and comparatively easy to falsify that proposition. The same may hold true for babbling in babies: the assumption that the sole function of babbling is to practice sound production can easily be questioned, and evidence to the contrary can be found without much trouble.

202

Phonetica 2000;57:197–204

Sjölander

However, if we assume an intervening internal rewarding system, that such behaviours are not done primarily/directly to practice, to adjust software to hardware, but for the fun of it, the matter will look different. If the connection is not a primary, functional one, but a secondary, where the adjustment of software to hardware is an indirect outcome (but nevertheless the one selected for), it is far easier to understand why we often see the behaviour performed out of context, much more than needed, and why deprivation may have little effect. In the latter case an admonition might be justified: we have to have rather sophisticated measuring devices to be able to say that a behaviour, despite absence of practice, develops in a ‘perfectly normal’ way. Most people can perform a high jump, but there are differences between an olympic athlete and an aged zoology professor as for the outcome, even if both jump in a perfectly normal way. If a child deprived of babbling develops to be ‘perfectly normal’ this cannot be taken as proof that babbling does not act in an indirect way to economize and smoothen the production of sound, any more than the fact that a cat that has never played with mice may catch one on the first attempt proves that playing does not have an improving effect on the behaviour. Especially, the indirect connection between play and practice cannot be used as an argument for the thesis that evolution cannot have favoured a genetic constitution where complex behaviours are performed for their own sake, for the fun of it. On the contrary, it is easy to see the advantages if the adjustment of the behaviour is done before it is actually put to test. And we do not have to suppose that it always and in all individuals has a positive outcome. Like other adaptations, an internal rewarding system will be selected for if it sufficiently often means a fitness improvement, however small, for some individuals, compared to the cost of doing it. And playing, doing things for the fun of it, is typically done in situations involving low cost; when there is little danger, and when the individual can afford the energy expenditure [e.g. Smith, 1984]. Biology cannot tell us what the internal reward, the pleasing experience, the ‘fun’ is. That subject belongs to another discipline, i.e. the philosophy of mind. However, it can be shown that, in order to account for certain common, if not universal, aspects of animal and human behaviour, an inner, subjective reward or punishment of some kind has to be invoked; otherwise, these aspects of behaviour will remain unexplained and incomprehensible. Moreover, biology can explain in a convincing way why an individual who enjoyed doing certain things for their own sake might be better prepared for life and more successful. A blackbird who sings for the fun of it will achieve all the functional goals – announcing his species, sex, and territorial claims, as well as his individual identity. But he does not have to know any of this; he just has to enjoy singing. After all, that is the reason why I myself sing. So why should I exclude this from my explanations as to why the bird sings, the cat plays, and the baby babbles?

Acknowledgements I wish to thank Jeanette Emt for commentaries and substantial improvements of the reasoning.


Phonetica 2000;57:197–204

203

References Bischof-Köhler, D.: Spiegelbild und Empathie (Huber, Bern 1989). Burghardt, G.: Conceptions of play and the evolution of animal minds. Evol. Cogn. 5: 114–122 (1999). Campbell, D.: Evolutionary epistemology; in Schliff, The library of living philosophers, vol. 14 (Open Court, Lasalle 1974). Dabelsteen, T.: Messages and meanings of bird song with special reference to the blackbird (Turdus merula) and some methodology problems. Biol. Skr. Dan. Vid. selsk. 25: 173–208 (1985). Dennett, D.C.: The intentional stance (MIT Press, Cambridge 1990). Dunbar, R.: Grooming, gossip and the evolution of language (Harvard University Press, Cambridge 1997). Freud, S.: Formulierungen über zwei Prinzipien des psychischen Geschehens (1941). Gesammelte Werke (Imago, London 1950). Glasersfeld, E. von: Wissen, Sprache und Wirklichkeit: Arbeiten zum radikalen Konstruktivismus (Vieweg, Braunschweig 1987). Griffin, D.R.: The question of animal awareness (Rockefeller University Press, New York 1976). Griffin, D.R.: Animal thinking (Harvard University Press, Cambridge 1984). Koenig, O.: Wozu hat aber das Vieh diesen Schnabel? (Piper, München 1988). Lieberman, P.: The biology and evolution of language (Harvard University Press, Cambridge 1984). Lindblom, B.: Economy of speech gestures; in McNeilage, The production of speech (Springer, New York 1983). Lindblom, B.: Emergent phonology. Phonetica (this vol.). Lorenz, K.: Die Rückseite des Spiegels: Versuch einer Naturgeschichte menschlichen Erkennens (Piper, München 1973). Oeser, E.; Seitelberger, F.: Gehirn, Bewußtsein und Erkenntnis (Wissenschaftliche Buchgesellschaft, Darmstadt 1988). Oksaar, E.: Spracherwerb und Kindersprache in evolutiver Sicht; in Peisl, Mohlner, Der Mensch und seine Sprache (Propyläen Verlag, Berlin 1979). Riedl, R.: Biologie der Erkenntnis: Die stammesgeschichtlichen Grundlagen der Vernunft (Parey, Berlin 1980). Riedl, R.: Begriff und Welt: Biologische Grundlagen des Erkennens und Begreifens (Parey, Berlin 1987). Riedl, R.; Wuketits, F.: Die evolutionäre Erkenntnistheorie: Bedingungen, Lösungen, Konsequenzen (Parey, Berlin 1987). Roitblat, H.L.; Beaver, T.G.; Terrace, H.S.: Animal cognition (Erlbaum, Hillsdale 1984). Scherer, K.R.; Ekman, P.: Approaches to emotion (Erlbaum, Hillsdale 1984). Sjölander, S.: On the evolution of reality – some biological prerequisites and evolutionary stages. J. theoret. Biol. 187: 595–600 (1997). Sjölander, S.: How animals handle reality – the adaptive aspect of representation; in Riegler, von Stein, Peschl, Does representation need reality? pp. 277–281 (Kluwer Academic/Plenum Publishers, New York 1999). Smith, P.K.: Play in animals and humans (Blackwell, London 1984). Studdert-Kennedy, M.: The beginnings of Speech; in Immelmann, Barlow, Petrinovich, Main, Behavioral development in animals and man (Parey, Berlin 1982). Tinbergen, N.: The study of instinct (Oxford University Press, Oxford 1951). Vollmer, G.: Evolutionäre Erkenntnistheorie (Hirzel, Stuttgart 1983).

204

Phonetica 2000;57:197–204

Sjölander



The Phonetic Potential of Nonhuman Vocal Tracts: Comparative Cineradiographic Observations of Vocalizing Animals W. Tecumseh Fitch Department of Organismic and Evolutionary Biology, Harvard University and Program in Speech and Hearing Science (Harvard/MIT), Cambridge, Mass., USA

Abstract For more than a century it has been noted that the adult human vocal tract differs from that of other mammals, in that the resting position of the larynx is much lower in humans. While animals habitually breathe with the larynx inserted into the nasal cavity, adult humans are unable to do this. This anatomical difference has been cited as an important factor limiting the vocal potential of nonhuman animals, because the low larynx of humans allows a wider range of vocal tract shapes and thus formant patterns than is available to other species. However, it is not clear that the static anatomy of dead animals provides an accurate guide to the phonetic potential of the living animal’s vocal tract. Here I present X-ray video observations of four mammal species (dogs Canis familiaris, goats Capra hircus, pigs Sus scrofa and cotton-top tamarins Sagunius oedipus). In all four species, the larynx was lowered from the nasopharynx, and the velum was closed, during loud calls. In dogs this temporary lowering was particularly pronounced. Although preliminary, these results suggest that the nonhuman vocal tract is more flexible than previously supposed, and that static postmortem anatomy provides an incomplete guide to the phonetic potential of nonhuman animals. The implications of these findings for theories of speech evolution are discussed. Copyright © 2000 S. Karger AG, Basel

Introduction

Movements of the human vocal tract during speech production are of paramount importance in the production of spoken language, and have been subjected to intense scrutiny by speech researchers for decades. It is thus surprising that, except for a few ground-breaking studies [Lieberman, 1968; Lieberman et al., 1969; Andrew, 1976], research on vocal production in nonhuman mammals has focused almost entirely on the anatomy and physiology of the larynx. Little is known about the anatomy of the other portions of the vocal tract or their dynamics during vocalization. This is not because the role of the vocal tract in mammalian vocalizations is negligible. The static



Dr. W. Tecumseh Fitch Department of Organismic and Evolutionary Biology Harvard University and Program in Speech and Hearing Science (Harvard/MIT), 33 Kirkland Street, Room 982 Cambridge, MA 02138 (USA), Tel. +1 617 496-6575 Fax +1 617 496-8279, E-Mail [email protected]

anatomy of the vocal tract plays a crucial role in determining formant frequencies in animal calls [Fitch, 1997; Riede and Fitch, 1999]. Recent work indicates that animals dynamically manipulate their supralaryngeal vocal tracts while vocalizing [Bauer, 1987; Hauser et al., 1993; Hauser and Schön-Ybarra, 1994]. Finally, perceptual studies indicate that conspecific listeners could perceive the resulting changes in formant frequencies with an accuracy rivaling that of humans [Owren, 1990; Sommers et al., 1992]. Thus, after decades of neglect, the role of vocal tract movements and formant frequencies in animal communication is becoming a focus for renewed research efforts. Formant-like spectral features are present in the vocalizations of many different nonhuman animals, including alligators, some birds, and many mammals including nonhuman primates. In a few species (dogs and macaques) these spectral features have been shown to be formants by combining anatomical measurements of vocal tract length with acoustic analysis of the same individuals’ vocalizations [Fitch, 1997; Riede and Fitch, 1999]. The length of an air tube plays a critical role in determining the spacing of its resonant frequencies, along with other factors such as the location of constrictions in the tube, or end effects. For a relatively uniform tube, formant spacing should be accurately predicted by vocal tract length, as was found in both of the above studies. Thus, both dog and monkey vocalizations possess formants, and their formant frequencies are largely determined by static vocal tract anatomy. However, little is currently known about vocal tract dynamics during nonhuman vocalization. Human speech is characterized by rapid, precise movements of vocal tract articulators (lips, tongue, jaw, velum, larynx). The resulting changes in the shape of the supralaryngeal vocal tract (specifically its cross-sectional area function) lead to the dynamic pattern of formant variation which typifies human vocal communication. Spectrographic observations and several more direct techniques suggest that some animal vocalizations also involve such movements. For example, Bauer [1987] and Hauser et al. [1993] used video analysis to demonstrate that changes in lip configuration were associated with acoustic changes in chimpanzee and macaque vocalizations, and Hauser and Schön-Ybarra [1994] experimentally induced significant acoustic changes in macaque vocalizations by immobilizing the lips with xylocaine injections. Unfortunately, such analyses can offer only a glimpse of the full range of articulatory possibilities open to the animal, since most of the important articulators (tongue, larynx, velum) are typically invisible. X-ray video, or cineradiography, offers an ideal window into such vocal tract movements. Because the anatomical structures comprising the vocal tract overlap nearly completely with those involved in swallowing and feeding, techniques developed and tested for swallowing research can be readily adapted for studies of articulation during vocalization. Unfortunately, the only published cineradiographic observations of nonhuman animals vocalizing are extremely schematic and present no detailed data [chickens, White, 1968, and guinea pigs, Arvola, 1974]. In this paper, I report cineradiographic observations of vocalizing mammals of four different mammal species (dogs, goats, pigs and tamarins, which are New World monkeys). This study is part of a larger study of the vocal tract dynamics underlying vocalization in mammals that will be reported more fully elsewhere. A number of extremely basic questions about animal vocalization are easily answerable using cineradiography. Is mammal vocal tract morphology static or dynamic during vocalization? Are laryngeal calls emitted through the oral or nasal cavities (or both)? To date, despite the common assumption that most mammals vocalize orally, there are no empirical data addressing this question.

206

Phonetica 2000;57:205–218

Fitch

Another set of questions of interest to speech scientists relates to the evolution of human speech capabilities. Although animals possess formants, the variety of formant frequency patterns observed in animal vocalizations seems limited relative to those observed in human speech. In contrast to human speech, where a wide variability in the lower formants (especially F1 and F2) is exploited to create a large set of discriminable speech sounds, animal vocalizations appear to have relatively evenly spaced formants that do not vary greatly from those predicted for a constant-diameter tube of the appropriate length [e.g. Fitch, 1997; Riede and Fitch, 1999, but see Richman, 1976, for a dissenting opinion]. In a classic study, Lieberman et al. [1969] combined anatomical investigations of a rhesus macaque with computer modeling techniques to show that the range of vocal tract shapes, and thus formant patterns, that could be produced by this monkey species was quite limited relative to the wide human vowel space. This was also the conclusion of a more recent study [Owren et al., 1997], which found that chacma baboon grunts utilize a very limited range of formant frequencies relative to humans, despite their similar vocal tract lengths. What is the explanation of the limited variability of formant patterns that typifies most nonhuman mammal species? The traditional explanation derives from the anatomical observations of Negus [1929, 1949], who found that the resting position of the larynx in adult humans differed from that of the other mammals he examined. In particular, while most mammals can insert the larynx into the nasopharynx to form a sealed nasal respiratory path, humans past the age of a few years cannot, due to the much lower resting position of the larynx in our species. Negus [1929, 1949] also observed that the laryngeal position of human newborns resembled that of other mammals more than that of human adults, and described a gradual descent of the larynx in human ontogeny, subsequently verified by later researchers [e.g. Laitman and Crelin, 1976; Sasaki et al., 1977]. This anatomical difference led Lieberman et al. [1969, 1972] to suggest that the high laryngeal position of human infants and nonhuman mammals eliminates the large vertical pharynx typical of humans, and thus physically blocks anterior-posterior movements of the tongue body. Such movements are essential for the ‘two-tube’ vocal anatomy that is necessary to produce certain formant patterns, in particular those that characterize the point vowels /i/, /a/ and /u/ that are found in almost all human languages [see Lieberman, 1984, 1998, for reviews]. Thus, these researchers hypothesized that the inability of animals to produce a wide range of formant patterns stemmed directly from the anatomy of their peripheral vocal tract, and in particular the high resting position of the larynx. This hypothesis was based on anatomical observations of dead, formalin-fixed specimens. The technology necessary to observe the dynamic anatomy of the larynx and vocal tract was not available during the years when Negus was active. Thus Negus [1949] was forced to assume that the static anatomy of the vocal tract, as observed in dead animals, provides an accurate guide to its dynamic capabilities in life. With regards to the position of the larynx in animals, the cineradiographic results presented below call this assumption into question. In particular, these data show that the larynx is lowered out of the nasopharynx during loud vocalizations in all of the species that have been investigated thus far. In some species, such as dogs, this ‘dynamic descent’ of the larynx is extensive. These data suggest that the differences between human vocal anatomy and that of other species, while indubitably important for speech production, may have been overemphasized.

The Phonetic Potential of Nonhuman Vocal Tracts

Phonetica 2000;57:205–218

207

Methods Cineradiographic observations were made of 3 dogs (Canis familiaris), 2 goats (Capra hircus), 2 cotton-top tamarin monkeys (Saguinus oedipus) and 1 pig (Sus scrofa). Movements of the tongue, velum, hyolaryngeal apparatus and jaw were digitally videotaped (Sony DCR-VX1000) at 60 frames/s using a Siemens Tridoros 150 G-3 cineradiography system. No animals were harmed in this study: the levels of radiation produced by this system are harmless for the periods for which our subjects were filmed. All protocols were approved by the Harvard University Animal Care and Use Committee. Goats were restrained using a thoracic harness, while the pig and monkeys were kept within the field of view in small transport cages two or three times their body length. Dogs were induced to sit in front of the camera by use of treats. Animals were filmed during chewing, swallowing, lapping water (goats and dogs only), sucking a nipple (pigs only) and vocalization. The piglet (a young animal) vocalized spontaneously and extensively. Goats and monkeys were induced to vocalize by playing recordings of conspecific vocalizations. One dog vocalized (howls and whines) upon hearing his master howling, while another was trained to bark upon verbal command.

Results

The Mammalian Vocal Gesture We observed the following vocalization types in our experimental animals: dog barks, whines, and panting; pig grunts and squeals; goat quiet and loud bleats and tamarin long calls, chirps and chatters. In the vast majority of calls we observed, the larynx was lowered from the nasopharynx during vocalizations and the velum was raised, apparently closing off the nasal passage. Thus most of these calls were emitted solely from the mouth. The exceptions were for quiet calls (dog whines, pig grunts, and goat quiet bleats): in these calls the larynx remained inserted in the nasopharynx, yielding nasally emitted calls. The degree of laryngeal lowering varied between species, with goats showing the least and dogs the greatest descent (the larynx was too small to be clearly visualized in most frames for monkeys, preventing an accurate estimate of laryngeal lowering in this species). Our observations suggest a relatively stereotyped ‘vocal gesture’ for orally emitted calls, which applies to all four mammal species, and is illustrated in figure 1 for dog barking: (1) Prevocal breathing: larynx engaged in nasopharynx (velar-epiglottal contact, fig. 1a). (2) Arytenoid tensing: the arytenoid cartilages rotate the vocal folds into phonatory position. (3) Laryngeal lowering: the larynx is retracted from the nasopharynx (loss of contact between epiglottis and velum, fig. 1b). (4) Jaw opening: the mandible lowers. (5) Velar tensing: the velum raises, closing of the nasal passage (fig. 1c, vocalization occurs in this example in fig. 1d). (6) Cessation: the articulators return to their prephonatory position, except that the epiglottis does not return to the intranarial position until swallowing. (7) Swallowing: returns the epiglottis to the retrovelar position of (1). Vocalization can occur at any of the stages between (2) and (7). Some degree of laryngeal lowering occurred in all species. A second clear finding is that the velum is extremely mobile in all of these species and appears to completely close off the nasal passages during loud vocalizations in all four species. This is illustrated in figure 2 for monkeys, and figure 3 for dogs, goats and pigs. Besides this basic finding of laryngeal lowering and velar closure during most vocalizations, the most surprising finding was that the postvocal return of the larynx to the intranarial position requires swallowing. In other words, after being pulled out of contact with the velum by laryngeal descent, the

208

Phonetica 2000;57:205–218

Fitch

a

b

c

d Fig. 1. Four frames illustrating vocal gesture associated with dog barking. a Resting breathing. b Laryngeal lowering. c Velar closing. d Bark.


Phonetica 2000;57:205–218

209

Fig. 2. Cotton-top tamarin chatter call illustrating velar closure.

epiglottis appears to remain subvelar until a swallow occurs. In the case of both dogs and goats, this period of both oral and nasal respiration can last for longer than 10 s. This delayed postvocal return is independent of oral breathing during panting, which was also observed in both of these species. Certain vocalizations (dog whines and pig grunts) appear to be nasally emitted. In the case of a panting dog (which was breathing orally), we observed a raising of the larynx to touch the velum during the whine, after which the epiglottis lowered back to rest against the tongue root. Thus, this call (a high-pitched, relatively tonal vocalization) appears to be an obligate nasal vocalization (at least for that individual). This gesture is illustrated in figure 4. All pig grunts appeared to be nasally emitted. Pig grunts thus require little muscular activity beyond vocal fold tensing to be produced. Mammalian Phonetic Capabilities These data indicate a minimum of four binary phonetic distinctions that would be available to these species. These are illustrated by differences in the naturally produced vocalizations of the dog: (1) jaw position: open or closed (± open) – bark vs. growl;

210

Phonetica 2000;57:205–218

Fitch

a

b

c

Fig. 3. Velar closure in three species: dogs (a), goats (b), pigs (c). Top frame illustrates resting breathing in each case, while bottom shows velar closure during vocalization. Note that the dog is panting, and thus breathing both orally and nasally.

(2) velum position: open or closed (± nasal) – whine vs. bark; (3) excitation: voiced or turbulent (± voiced) – growl vs. pant, and (4) segment temporal duration: short vs. long (± long) – whine vs. growl. These distinctions are not meant to correspond exactly to particular distinctive features in human speech, or to particular phonetic classes such as consonants or vowels, but simply to indicate the ability to modulate an ongoing vocalization stream in various ways. These represent an extremely conservative estimate of the phonetic capabilities of the dog’s vocal tract, since they are based on what dogs were actually observed to do during vocalization. A more realistic appraisal of the phonetic capabilities of the canine vocal tract, supposing that a human nervous system were in control, would include additional distinctions that appear to be available based on observations of nonvocal behaviors (pant-


Phonetica 2000;57:205–218

211

a

b

4

c

5

Fig. 4. Dog whine during panting: During panting breathing (a) air is inspired through the nose and then exhaled through the mouth, which is accomplished by rapid pulsation of the velum. During whining (c) (inspired by the dog’s owner singing) the epiglottis is pulled backwards into contact with the velum to produce a nasally emitted vocalization. Note, prior to phonation (b), the darkening of the vocal folds due to tensing. Fig. 5. Retroflex tongue position in panting dog.

212

Phonetica 2000;57:205–218

Fitch

Fig. 6. Tongue body retraction in dog. Note slight concomitant laryngeal lowering.


Phonetica 2000;57:205–218

213

ing, chewing, swallowing, and lapping). Examples include: (5) place of articulation (labial, dental, ‘velar’); (6) tongue blade flat or recurved (± retroflex, fig. 5); (7) taps and trills (by velum or tongue blade), and (8) other vowels (some front-back capability, fig. 6). Two of these possibilities are illustrated by the X-ray frames in figures 5 and 6. ‘Velar’ place of articulation is placed in quotes because the dog’s vocal tract does not have the strong angulation at the oral/pharyngeal junction that is typical of humans. Thus, contact between velum and the back of the tongue, while clearly possible for the dog, would not have an equivalent acoustic effect to a velar place of articulation in humans. As noted above, it is the loud calls which are invariably associated with open mouths, a closed velum and retracted larynges, a vocal configuration that renders these calls purely oral. One plausible explanation for this nonnasal configuration is that the nasal cavities absorb more acoustic energy than the oral cavities, due to the large and compliant surface area caused by nasal side branches and the nasal turbinates. To test this prediction I broadcast sound through the oral and nasal cavities of fresh cadaver heads (sheep and rhesus macaque) obtained from a veterinary morgue. A small piezoelectric speaker generating pulse trains at various fundamental frequencies was inserted in the nasopharynx or oropharynx and the transmitted sound recorded at 10 cm from the tip of the snout. The purely oral sounds had peak amplitude levels five times those of the nasally emitted sounds. RMS amplitude values of the oral sounds over a 400-ms window were 14–15 dB above those of nasal sounds. Thus, the nasal cavities absorb a considerable portion of the acoustic energy generated at the larynx, and to produce maximally loud vocalizations an animal should produce purely oral calls.

Discussion

While the current results are preliminary, involving only a few individuals of a few species, they do permit us to draw some important conclusions about the relationship between peripheral anatomy and vocalizations in nonhuman animals. In particular, the cineradiographic observations clearly indicate that animal vocal tracts are surprisingly elastic and mobile, and that dead (and typically formalin-fixed) specimens provide a poor guide to the range of vocal movements available to the living animal. These data indicate that the vocal tract configuration of vocalizing animals, at least in dogs, pigs, goats and monkeys, is more similar to that of human talkers than was previously inferred on the basis of dissections of dead animals. All four species appear to raise the velum, closing off the nasal airway, during loud vocalizations. Finally, the current results show that all four nonhuman species examined can and do lower their larynges during loud vocalizations, either to a relatively minor degree (goats) or to a surprisingly extensive degree (dogs). These data suggest that the differences between the vocal tracts of humans and nonhuman mammals, particularly regarding the resting position of the larynx, have been overemphasized. In particular, the notion that the larynx physically blocks anterior-posterior movements of the tongue body in nonhuman animals must be reevaluated. Our data suggest that by tensing the strap muscles connecting the hyolaryngeal apparatus to the sternum (sternothyroid and sternohyoid), most mammals can pull the larynx and tongue downward out of the nasopharyngeal region and thus provide space for at least some anterior-posterior movement of the tongue body. All of our subject

214

Phonetica 2000;57:205–218

Fitch

species do so during loud vocalizations. In the case of species such as the dog this laryngeal lowering is extreme enough to provide a roomy pharygeal cavity, which could potentially allow dogs to produce a much wider variety of formant frequencies than they do. In short, these observations indicate that the high resting position of the normal mammalian larynx cannot by itself explain the reduced variability in the formant patterns produced by nonhuman animals. Observations of tongue movements during chewing and swallowing also indicate considerable flexibility and control of the tongue (especially in dogs and goats). This contrasts with the essentially static tongue position observed during species-specific vocalizations. Taken together, the cineradiographic results suggest that the nonhuman vocal tract is more versatile than previously imagined, and could support a much wider variety of vocalizations than any of the species studied actually produce. It thus seems possible that the shortcomings of these species with regards to producing a diversity of formant patterns results less from their peripheral anatomy than from the neural control mechanisms involved during vocalization. Of course, much more work needs to be done before we have a solid grasp of the possibilities provided by, and the constraints imposed by, the vocal anatomy of other mammals. Hopefully, these observations will lead to a reinvigoration of research on the mechanisms underlying mammalian vocal production and a better understanding of the role of vocal tract dynamics in mammals. In addition to providing animal models appropriate for studying certain aspects of human speech production, such research may help clarify the similarities and differences between human vocal tracts and those of other mammals, and thus elucidate the evolutionary path which led to our own unique vocal anatomy. The Evolution of Speech and Human Vocal Anatomy Although the data above show that animals lower their larynx during vocalizations, the fact remains that humans differ from other species in having a larynx that is permanently lowered. How can we explain this difference? I suggest that the hypothesis of Lieberman et al. [1969, 1972] still provides the best explanation for the current adaptive value of the descended larynx: it is an adaptation for producing articulate speech. For an organism that spends a lot of time talking, the necessity to tense the strap muscles (sternothyroid and sternohyoid) before each vocalization might prove significantly more energetically costly than simply leaving it low. The relatively small size of the strap muscles in humans relative to (for example) rhesus macaques is consistent with this hypothesis. Also, having a stable low position of the tongue root may enable humans greater control in producing speech sounds that require fine control over vocal tract shape (e.g. sibilants, fricatives and perhaps point vowels). Thus, having the skeletal support for the tongue body in a permanently low, stabilized position may provide significant advantages for producing the rapid, precisely controlled vocal tract movements that characterize modern human speech. The standard mammalian vocal gesture described above involves laryngeal lowering during vocalization. It seems plausible that this aspect of mammalian vocal behavior provided a preadaptation for human style vocalization, where a lowered larynx confers two degrees of freedom for tongue body movement. Although it would be extremely premature to suggest that humans are the only mammals that have exploited [‘exapted’, Gould and Vrba, 1982] this possibility, the four species of mammals we have studied do not appear to make use of it. In fact, these animals seem to make little use of the dynamic possibilities of their vocal tracts during vocalization, compared to


Phonetica 2000;57:205–218

215

the complex gymnastics of the tongue body and blade, and the lips, that are observed when they feed. In general, the vocal tract appears to be placed in a static position, ideally suited for maximal amplitude, during the entire call. But it is clear that movements of the tongue body made while the tongue root was lowered could produce a wide range of formant patterns. This suggests a phylogenetic sequence for the evolution of human speech abilities (especially formant variation). In stage one, some early ancestor used a standard mammalian vocal gesture to produce calls, but introduced tongue body perturbations during larynx lowering to produce a wider range of formant patterns (and hence a greater diversity of discriminable ‘calls’ or phones). The rhesus macaque ‘girney’ call, which shows clear formant movement across a single utterance, may be a representative of such a stage. In stage two, the use of dual degrees of freedom of the tongue body was consolidated into the communication system, with a variety of vowel-like sounds and formant transitions being produced. However, these sounds would be made with a temporarily lowered larynx, and the larynx would be returned to the nasopharynx during resting breathing. Finally, in stage three, the larynx would have assumed a permanent low resting position during ontogeny, as it does today, giving these hominids less effortful speech and perhaps more vocal control, as suggested above. Although this proposed sequence implies that the selective force responsible for the transition to stage three was improved speech abilities, there are other possibilities as well. For instance, Michael Owren [pers. commun.] has suggested that vocal tract length plays a crucial role in individual recognition among primates, and that permanent laryngeal lowering was a way of preserving ancestral vocal tract lengths as the snout shortened over the course of hominid evolution. Similarly, it has been suggested [Ohala, 1983, 1984; Fitch, 1994, 1997] that descent of the larynx may have originally functioned as an adaptation to exaggerate the impression of body size conveyed by vocalizations. In all of the mammals that have been examined thus far, there is an overall correlation between an individual’s body size (either height or weight) and its vocal tract length, which in turn determines the formant frequencies of its vocalizations. This vocal tract length/body size correlation has been observed in monkeys [Fitch, 1997], humans [Fitch and Giedd, 1999] and dogs [Riede and Fitch, 1999], suggesting that formant frequencies could provide an accurate cue to body size in a variety of mammals. However, once perceivers use formants to estimate body size, an individual able to elongate its vocal tract and thus lower its formant frequencies could duplicate the formant patterns of a larger individual that lacked this ability, and thus exaggerate the impression of size conveyed by its vocalizations. Perceptual experiments using computer-synthesized vowels indicate that humans do in fact use formants to estimate body size of speakers [Fitch, 1994]. Consistent with this ‘size exaggeration’ hypothesis, the larynx of human males shows a second descent at puberty [Goldstein, 1980; Fitch and Giedd, 1999], thus elongating the vocal tract relative to that of prepubescent males and females. More extreme examples of vocal tract elongation are seen in many birds [Fitch, 1999] and in fallow and red deer males. In the latter case, male deer use their powerful strap muscles to retract the larynx nearly to the sternum when producing roar vocalizations during the rutting period [Fitch and Reby, unpubl. data]. These observations suggest that the dynamic laryngeal lowering that precedes vocalization in the mammals studied here also represents a preadaptation to size exaggeration via vocal tract elongation. It is currently unknown to what degree this is typical of other mammal species, but anatomical observations of lions and other ‘roar-

216

Phonetica 2000;57:205–218

Fitch

ing cats’ [Pocock, 1916; Hast, 1989], which have a very loose and elastic stylohyoid ligament, suggest that they might also lower the larynx during roaring. Thus, a second adaptive function of a lowered larynx is to allow the production of more impressive vocalizations, with lower formants and a more ‘baritone’ timbre, via vocal tract elongation. It seems possible that this function played some role in the evolution of the human vocal tract, either prior to or simultaneous with selection for enhanced speech abilities.

Acknowledgments This work could not have been performed without the extensive guidance and cooperation of Prof. A. W. Crompton, to whom I am greatly indebted. Technical assistance of Rebecca German, Katherine Musinsky, Tomas Ostercowicz and Carol Tomlins was invaluable. The cadaver specimens were made available through the kind assistance of Andrew Hoffman at Tufts Medical School and Jonathan Fritz at the NIH. Comments on this work from Randy Diehl, Klaus Kohler, Philip Lieberman, Peter MacNeilage, John Ohala, Joseph Perkell, Kenneth Stevens, and Michael Studdert-Kennedy are gratefully acknowledged. Most importantly, I thank Bjorn Lindblom for his inspiring research, hospitality in Stockholm and incisive comments on the current work. This research was supported by an NIH postdoctoral fellowship (NIDCD T32 DC00038) to W.T.F.

References Andrew, R.J.: Use of formants in the grunts of baboons and other nonhuman primates. Ann. N.Y. Acad. Sci. 280: 673–693 (1976). Arvola, A.: Vocalization in the guinea-pig, C. porcellus L. Annls. zool. fenn. 11: 1–96 (1974). Bauer, H.R.: Frequency code: orofacial correlates of fundamental frequency. Phonetica 44: 173–191 (1987). Fitch, W.T.: Vocal tract length perception and the evolution of language; PhD thesis, Brown University (1994). Fitch, W.T.: Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. acoust. Soc. Am. 102: 1213–1222 (1997). Fitch, W.T.: Acoustic exaggeration of size in birds by tracheal elongation: comparative and theoretical analyses. J. Zool., Lond., 248: 31–49 (1999). Fitch, W.T.; Giedd, J.: Morphology and development of the human vocal tract: a study using magnetic resonance imaging. J. acoust. Soc. Am. 106: 1511–1522 (1999). Goldstein, U.G.: An articulatory model for the vocal tracts of growing children; DSc thesis Massachusetts Institute of Technology (1980). Gould, S.J.; Vrba, E.S.: Exaptation – a missing term in the science of form. Paleobiology 8: 4–15 (1982). Hast, M.: The larynx of roaring and non-roaring cats. J. Anat. 163: 117–121 (1989). Hauser, M.D.; Evans, C.S.; Marler, P.: The role of articulation in the production of rhesus monkey (Macaca mulatta) vocalizations. Anim. Behav. 45: 423–433 (1993). Hauser, M.D.; Schön-Ybarra, M.: The role of lip configuration in monkey vocalizations: experiments using xylocaine as a nerve block. Brain Lang. 46: 232–244 (1994). Laitman, J.T.; Crelin, E.S.: Postnatal development of the basicranium and vocal tract region in man; in Bosma, Symp. on Dev. Basicranium (US Government Printing Office, Washington 1976). Lieberman, P.: Primate vocalization and human linguistic ability. J. acoust. Soc. Am. 44: 1574–1584 (1968). Lieberman, P: The biology and evolution of language (Harvard University Press, Cambridge 1984). Lieberman, P.: Eve spoke: human language and human evolution (Norton, New York 1998). Lieberman, P.; Crelin, E.S.; Klatt, D.H.: Phonetic ability and related anatomy of the newborn and dult human, Neanderthal man, and the chimpanzee. Am. Anthrop. 74: 287–307 (1972). Lieberman, P.; Klatt, D.H.; Wilson, W.H.: Vocal tract limitations on the vowel repertoires of rhesus monkeys and other nonhuman primates. Science 164: 1185–1187 (1969). Negus, V.E.: The mechanism of the larynx (Mosby, St. Louis 1929). Negus, V.E.: The comparative anatomy and physiology of the larynx (Hafner, New York 1949). Ohala, J.J.: Cross-language use of pitch: an ethological view. Phonetica 40: 1–18 (1983). Ohala, J.J.: An ethological perspective on common cross-language utilization of Fø of voice. Phonetica 41: 1–16 (1984). Owren, M.J.: Acoustic classification of alarm calls by vervet monkeys (Cercopithecus aethiops) and humans. II. Synthetic calls. J. comp. Psychol. 104: 29–40 (1990).


Phonetica 2000;57:205–218

217

Owren, M.J.; Seyfarth, R.M.; Cheney, D.L.: The acoustic features of vowel-like grunt calls in chacma baboons (Papio cyncephalus ursinus): implications for production processes and functions. J. acoust. Soc. Am. 101: 2951–2963 (1997). Pocock, R.I.: On the hyoidean apparatus of the lion (F. leo) and related species of Felidae. Ann. Mag. nat. Hist. 8: 222–229 (1916). Richman, B.: Some vocal distinctive features used by gelada monkeys. J. acoust. Soc. Am. 60: 718–724 (1976). Riede, T.; Fitch, W.T.: Vocal tract length and acoustics of vocalization in the domestic dog (Canis familiaris). J. exp. Biol. 202: 2859–2867 (1999). Sasaki, C.T.; Levine, P.A.; Laitman, J.T.; Crelin, E.S.: Postnatal descent of the epiglottis in man. Archs. Otolar. 103: 169–171 (1977). Sommers, M.S.; Moody, D.B.; Prosen, C.A.; Stebbins, W.C.: Formant frequency discrimination by Japanese macaques (Macaca fuscata). J. acoust. Soc. Am. 91: 3499–3510 (1992). White, S.S.: Movements of the larynx during crowing in the domestic cock. J. Anat. 103: 390–392 (1968).

218

Phonetica 2000;57:205–218

Fitch


Received: December 14, 1999 Accepted: February 14, 2000

Dynamic Simulation of Human Movement Using Large-Scale Models of the Body Marcus G. Pandy Frank C. Anderson Department of Kinesiology and Biomedical Engineering Program, University of Texas at Austin, Austin, Tex., USA

Abstract A three-dimensional model of the body was used to simulate two different motor tasks: vertical jumping and normal walking on level ground. The pattern of muscle excitations, body motions, and ground-reaction forces for each task were calculated using dynamic optimization theory. For jumping, the performance criterion was to maximize the height reached by the center of mass of the body; for walking, the measure of performance was metabolic energy consumed per meter walked. Quantitative comparisons of the simulation results with experimental data obtained from people indicate that the model reproduces the salient features of maximum-height jumping and normal walking on the level. Analyses of the model solutions will allow detailed explanations to be given about the actions of specific muscles during each of these tasks. Copyright © 2000 S. Karger AG, Basel

Introduction

Many studies have used computer models to simulate the kinematic and kinetic patterns observed during human movement, but very few have included the actions of the muscles in a three-dimensional model of the body. The reason is that detailed computer simulations involving dynamic models of the musculoskeletal system incur great computational expense [Anderson et al., 1995, 1996]. With the emergence of high-speed, parallel supercomputers, it is now possible to use very detailed theoretical models of the body to produce realistic simulations of movement. An overall goal of our ongoing work is to develop a model of the body that could be used to simulate a wide range of locomotor tasks including walking and running at various speeds. A specific aim of this study was to use such a model, together with dynamic optimization theory, to simulate one full cycle of normal gait. Dynamic optimization was chosen because it presents a very powerful approach for simulating movement. Firstly, the dynamic optimization problem may be formulated independent of experimental data, which means that the motor patterns can be predicted rather than assumed. Secondly, this theoretical approach allows a model of the system dynamics



Marcus G. Pandy, PhD Department of Kinesiology and Health Education Bellmont Hall, Room 222, University of Texas at Austin Austin, TX 78712-D3700 (USA), Tel. +1 (512) 232-5404 Fax +1 (512) 471-8914, E-Mail [email protected]

Fig. 1. Frontal-plane view of the model skele-

ton. The inertial reference frame is fixed to the ground at the level of the floor. The axes of the inertial frame form a right-handed coordinate system: the X axis is directed forward, the Y axis is directed upward, and the Z axis is directed laterally. Twenty-three generalized coordinates are used to describe the position and orientation of all the body segments in the model.

(i.e. the body) and a model of the goal of the motor task to be included in the formulation of the simulation problem. Since the performance criterion for walking is somewhat ambiguous, we first solved a dynamic optimization problem for vertical jumping. Jumping was chosen not only because it presents a well-defined goal (i.e. to jump as high as possible), but also because it involves the coordinated motion of all the body segments. This task therefore provides an excellent paradigm for evaluating the dynamic response of any model. Once the dynamic optimization problem for jumping was solved and the response of the model validated against experimental data, the same model could then be used with greater confidence to simulate normal walking over level ground.

Methods Musculoskeletal Model of the Body The skeleton was represented as a ten-segment, 23-degree-of-freedom (dof) mechanical linkage. The pelvis was modeled as a single rigid body with 6 dof; the remaining nine segments branched in an open chain from the pelvis (fig. 1). The head, arms, and torso (HAT) were lumped into a single rigid body, and this segment articulated with the pelvis via a 3-dof ball-and-socket joint located at the 3rd lumbar vertebra. Each hip was modeled as a 3-dof ball-and-socket joint, and each knee was modeled

220

Phonetica 2000;57:219–228

Pandy/Anderson

Fig. 2. Schematic diagram showing some of the muscles included in the model. Fiftyfour back, abdomen, and leg muscles were used to actuate the model skeleton.

as a 1-dof hinge. Two segments were used to model each foot: a hindfoot segment and a toes segment. The hindfoot articulated with the tibia via a 2-dof universal joint comprising two axes of rotation: one for the ankle and the other for the subtalar joint. The toes articulated with the hindfoot via a 1-dof hinge joint. The positions and orientations of the axes of rotation of each joint were based on experimental data reported in the literature [e.g. Inman, 1976]. The interaction of the feet with the ground was simulated using a series of spring-damper units distributed under the sole of each foot. Four ground springs were located at the corners of the hindfoot segment and one was positioned at the distal end of the toes. Each ground spring applied forces in the vertical, fore-aft, and transverse directions simultaneously. Details of the force-displacement-velocity relations assumed for the ground springs are given by Anderson and Pandy [1999]. The model was actuated by 54 musculotendinous units. Each leg was actuated by 24 muscles, and relative movements of the pelvis and HAT were controlled by 6 abdominal and back muscles (fig. 2). The path of each actuator was based on geometric data (musculotendon origin and insertion sites) reported in the literature [e.g. Friederich and Brand, 1990]. Straight lines and combinations of straight lines and space curves were used to represent the three-dimensional path of each muscle in the model. Each actuator was modeled as a three-element, Hill-type muscle in series with an elastic tendon

Dynamic Simulation of Movement

Phonetica 2000;57:219–228

221

[Zajac, 1989; Pandy et al., 1990]. Parameters defining the nominal properties of each actuator (i.e. peak isometric force and the corresponding fiber length and pennation angle of muscle plus tendon slack length) are reported by Anderson and Pandy [1999]. Muscle excitation-contraction dynamics was modeled as a first-order process [Zajac, 1989]. Details of our neuromusculoskeletal model of the body are given by Anderson and Pandy [1999] and Anderson [1999]. Optimization Problems The model was used to simulate two tasks: vertical jumping and normal walking on the level. Dynamic optimization theory was used to find the pattern of muscle excitations and the corresponding muscle forces and limb motions subject to a performance criterion which models the goal of each motor task. For jumping, the problem was to find the pattern of muscle excitations needed to maximize jump height (i.e. the height reached by the center of mass of the whole body). No kinematic constraints were imposed on the dynamic optimization solution for jumping; however, to reduce the size of the problem, the muscle excitation histories for each side of the body were assumed to be identical. For walking, the performance criterion was to minimize the total metabolic energy consumed by the muscles over one cycle of gait. A large number of experimental studies have shown that walking speed is selected in order to minimize metabolic cost per unit distance traveled [Ralston, 1976]. Thus, the dynamic optimization problem for walking was to find the pattern of muscle excitations needed to minimize metabolic energy consumption per meter walked. Muscle energy production was calculated by summing five terms: basal heat, activation heat, maintenance heat, shortening heat, and the mechanical work done by all the muscles [Woledge et al., 1985]. Some heat is liberated by muscle merely as a consequence of being alive. This is called the resting or basal heat. When muscle contracts, heat is liberated as a result of movement of calcium in and out of the sarcoplasmic reticulum. This is known as activation heat. Heat is also liberated as a result of the interaction between actin and myosin, as the myosin heads attach and detach from the actin filaments (cross-bridge cycling) during an isometric contraction. The heat produced during an isometric contraction is called maintenance heat. When muscle contracts and shortens, extra heat is liberated over and above that produced during an isometric contraction. This amount of heat is called shortening heat or the Fenn effect. If during a contraction a muscle also changes its length, then mechanical work is done by the muscle as it moves the bones about the body joints. From the First Law of Thermodynamics, the total energy produced during a shortening or lengthening contraction is equal to the total heat liberated plus the mechanical work done. Two kinematic constraints were used to simplify the dynamic optimization problem for walking. First, the gait cycle was assumed to be bilaterally symmetric; that is, the left-side stance and swing phases were assumed to be identical with the right-side stance and swing phases, respectively. Introducing this constraint simplifies the problem because it means that only one half of the gait cycle needs to be simulated. Second, the simulated gait pattern was made repeatable by constraining the states of the model to be equal at the beginning and end of the gait cycle. Computational Method The dynamic optimization problems for jumping and walking were solved using parameter optimization [Pandy et al., 1992]. In this method each muscle excitation history is discretized into a set of independent variables called control nodes. The problem then is to find the values of all the control nodes which minimize the value of the performance criterion. Once the values of the control nodes have been found, the excitation history for each muscle in the model is reconstructed by linearly interpolating between the control nodes. Sixteen control nodes were used to represent the excitation history for each muscle in the model. For the 54 muscles included in the model the total number of unknown variables was 864. Computational solutions to the dynamic optimization problems for jumping and walking were found using two parallel supercomputers: an IBM SP-2 and a Cray T3E. Human Experiments To evaluate the model predictions, kinematic, kinetic, and muscle EMG data were recorded from 5 adult males. Each subject first performed a series of maximal vertical jumps. Each jump began from a static squatting position and was executed with the arms crossed over the chest. The subject then walked at his freely selected cadence and stride length on a level walkway in the laboratory. During

222

Phonetica 2000;57:219–228

Pandy/Anderson

Fig. 3. Vertical ground-reaction force generated by the model (black line) and each of the 5 subjects

(gray lines) during the ground-contact phase of a maximum-height, squat jump. For the model, the resultant force was found by summing the forces developed by the ground springs located under the sole of each foot. Time t = 0 s marks the instant that the model and the subjects leave the ground.

each activity the three-dimensional positions of the body segments were recorded using a four-camera, passive-marker, video-based system. Vertical, fore-aft, and transverse components of the ground-reaction force were measured using a six-component, strain-gauge force plate. Muscle EMG data were recorded using surface electrodes mounted over the following muscles on the right side of the body: tibialis anterior, soleus, lateral gastrocnemius, vastus lateralis, rectus femoris, hamstrings, adductor magnus, gluteus maximus, gluteus medius, erector spinae, and the external abdominal obliques.

Results

Jumping There was quantitative agreement between the response of the model and the way each of the subjects executed a maximal, vertical, squat jump. EMG data indicated a stereotypic pattern of muscle coordination: the back, hip, and knee extensors were activated at the beginning of the jump, followed by the more distal ankle plantar flexors. Peak vertical forces measured for the subjects ranged from 1,500 to 2,100 N compared with a peak force for the model of 2000 N (fig. 3). Much smaller forces were exerted on the ground in the fore-aft direction; peak fore-aft forces ranged from 120 to 170 N for the subjects compared with 270 N for the model. The joint-angular displacements of the HAT, pelvis, hips, knees, and ankles were also similar for the model and the subjects during the propulsion phase of the jump.


Phonetica 2000;57:219–228

223

Fig. 4. Vertical ground-reaction force generated by the model (black line) and each of the 5 subjects

(gray lines) during walking. For the model, the resultant force was found by summing the forces developed by the ground springs located under the sole of each foot. OTO denotes opposite toe off, OHS is opposite heel strike, TO is toe off, and HS is heel strike. 0% and 100% indicate heel strike of the same leg (one gait cycle) for the model and the subjects.

The peak vertical acceleration of the center of mass of the model was 19 m/s2, well within the range of values measured for the subjects. The model left the ground with a vertical velocity of 2.3 m/s and jumped to a height of 36.9 cm. Subjects left the ground with vertical velocities ranging from 2.0 to 2.5 m/s, and their mean jump height was 37 cm. Walking Good agreement was also obtained between the simulation results and the experimental data recorded for gait. The model and the subjects walked at an average speed of 81 m/min, which is very close to the optimal speed estimated by Ralston [1976]. Compared with the muscles of the leg, the abdominal and back muscles were activated at relatively low levels. The back muscles were quiet throughout the gait cycle, except in the neighborhood of opposite heel strike. The abductors were excited in a double burst: the first during double support and the second during the middle portion of single support. Consistent with experiment, vasti and biceps femoris short head in the model were excited simultaneously between heel strike and opposite toe off. It is significant that the dynamic optimization solution predicts cocontraction of the uniarticular quadriceps and hamstring muscles, for this result lends support to our minimummetabolic-energy hypothesis for walking. Translations of the model pelvis in the sagittal, frontal, and transverse planes were consistent with the prototypical patterns measured for the subjects. For example, in the

224

Phonetica 2000;57:219–228

Pandy/Anderson

transverse plane the pelvis traced a sinusoidal path as it moved forward and shifted laterally from over one foot to over the other. The amplitudes of lateral oscillation for the model and the subjects were 2.6 and 4.0 cm, respectively. The joint-angle trajectories for the back, hips, knees, and ankles in the model were also consistent with kinematic data obtained from the gait experiments. For example, following heel strike the knee flexed 20° until opposite toe-off, extended to near full extension prior to opposite heel strike, flexed to 70° shortly after toe-off, and then extended once again to near full extension at heel strike. The vertical force exerted by the ground showed the familiar double-hump pattern, with the first hump occurring near opposite toe-off and the second occurring just prior to opposite heel strike (fig. 4). The fore-aft component was directed posteriorly from heel strike to 30% of the gait cycle, and anteriorly thereafter. The variation in the mediolateral component was more complicated, but the result predicted by the model was very similar to that measured for the subjects, at least until 40% of cycle time. The model expended metabolic energy at a rate of 6.6 J/(kg·s), which is about 50% higher than the value obtained from oxygen consumption measurements made in people [Burdett et al., 1983]. The model calculations also suggest that the mechanical efficiency of muscles for walking at normal speeds is about 30%, which is a little lower than the value (approximately 40%) obtained from heat measurements made on isolated muscle [Hill, 1964].

Discussion

Several aspects of our work should be compared and contrasted with previous attempts to simulate whole-body movement. First, by changing only the performance criterion and the initial conditions of the task, we have demonstrated that it is possible to use the same model of the body to simulate two very different movements: vertical jumping and walking. Since the performance criterion for vertical jumping is clearly defined, dynamic optimization solutions for jumping are ideally suited to evaluating and refining a model of the system dynamics (i.e. the body). When the model is used to simulate other tasks such as walking and running at various speeds, emphasis may then be placed on evaluating and refining a model of the performance criterion. Second, the simulations of jumping and walking performed in this study are much more elaborate than what has been published previously. The model used in this study has 23 dof and is actuated by 54 muscles. The number of degrees of freedom is 2–3 times greater and the number of muscles is 2–5 times greater than that included in previous dynamic optimization models of movement [Hatze, 1976; Davy and Audu, 1987; Pandy et al., 1990; Yamaguchi and Zajac, 1990; Tashman et al., 1995]. Finally, our approach to simulating movement is predictive rather than descriptive; that is, the body-segmental motions, ground-reaction forces, and muscle excitation patterns are all calculated rather than assumed. Previous studies have simulated movement by forcing the model to track measurements of the time histories of bodysegmental displacements and velocities [Davy and Audu, 1987; Yamaguchi and Zajac, 1990]. In this study, the body-segmental motions, ground-reaction forces, and muscle excitation patterns were all predicted given only the states of the model at the beginning and/or end of the simulated movement. This is an important aspect of our approach, because it guarantees that the muscle-force histories in the model are deter-


Phonetica 2000;57:219–228

225

mined almost entirely by the performance criterion used to model the goal of the motor task. For walking, in particular, the fact that the predicted kinematics, ground-reaction forces, and muscle coordination patterns all agree closely with those obtained from experiment supports the use of minimum metabolic energy as a measure of performance. In summary, the feasibility of performing realistic simulations of movement depends on a number of factors. First, a robust computational algorithm is needed to handle nonlinear problems characterized by a large number of input controls [Pandy et al., 1992]. Second, high-performance, parallel supercomputers are needed to converge to the optimal solution in a reasonable amount of time [Anderson et al., 1995]. Third, a detailed and computationally efficient model of the foot is needed to adequately simulate repetitive contact of the feet with the ground. Lastly, very fast computer-graphics workstations (e.g. SGI Onyx) must be used, so that the simulation results may be visualized in real time while a good initial guess to the solution is being sought. The work reported here lays the foundation for us to describe and explain the relationships between mechanics and energetics of locomotion at a much deeper level than has been possible to date. A large number of experiments have described the variation in kinematics, kinetics, and metabolic cost with walking speed [Murray et al., 1966; Grieve, 1968; Margaria, 1976; Ralston, 1976; Chen et al., 1997]. The results of these studies can be used to evaluate the response of the model when it is used to simulate humans walking on level ground at speeds in the range 3–9 km/h. Records of changes in kinematics, kinetics, and metabolic cost as humans walk up and down inclines [Margaria, 1976; Inman et al., 1981] can also be used to evaluate the response of the model under conditions of varying external load. Because the model gives detailed information about the forces developed by the leg muscles during movement, analyses of the simulation results will allow (1) more detailed descriptions to be given for the relationships between metabolic energy consumption and muscle force [Taylor et al., 1980], and (2) rigorous testing of the hypothesis that metabolic cost of locomotion is determined mainly by the cost of generating muscle force [Taylor, 1994]. Relevance of Modeling and Optimization to Speech Why is perhaps the most difficult question in biology to answer. Why do we move our limbs in a characteristic manner during walking? And why do our vocal tracts move in a certain way during speech? Experiments alone cannot produce unequivocal answers to such questions. To answer the question for walking, the forces developed by the leg muscles must be measured, and these data must then be correlated with independent measures of muscle energy consumption, joint-reaction forces, etc. Invasive methods for quantifying muscle force such as instrumented buckle transducers cannot be used on living people, and, unfortunately, noninvasive methods such as electromyography do not provide the quantitative accuracy needed nor do they permit access to all the muscles of interest. Computational modeling combined with optimization theory offers a powerful alternative for determining muscle forces in the body, and there is every reason to believe that this approach can also be used to study the mechanics and energetics of speech. Physiological models of speech production have been developed and reported in the literature [e.g. Laboissiere et al., 1995; Willhelms-Tricarico and Perkell, 1995], but, to our knowledge, these models have not yet been placed in the context of dynamic optimization so that interactions between vocal tract movements and speech production may be explained.

226

Phonetica 2000;57:219–228

Pandy/Anderson

It is tempting to suggest that all cyclic body movements, including those for speech, might be produced in the interests of minimizing metabolic energy consumption [Lindblom, 1999]. Although, as explained above, this hypothesis cannot be tested rigorously in the context of a physiological experiment, one may proceed for speech in the same way as we have done for walking. A dynamic optimization problem may be formulated based on the hypothesis that muscle metabolic energy is minimized during normal, steady-state speech. Muscle metabolic energy may be estimated by summing the heat produced and the work done by all the muscles that move the vocal tract. Should the optimization solution predict jaw movements at the same frequency as is observed during normal, steady-state speech (roughly 5 Hz), then the minimum-energy hypothesis would be supported, but not proven. A more compelling case could be made only if the model predictions of energy cost compared favorably with oxygen consumption measurements made on people [Lindblom et al., 1999]. Increasing or decreasing the frequency at which we speak presumably scales energy cost in proportion, so frequency is a parameter that can be altered to test the model further. Quantitative agreement between model and experiment over a wide range of speaking frequencies would demonstrate the model’s ability to simulate the energetics of speech production for unconstrained movements of the vocal tract. Because the model simulations also provide information about muscle force, explanations could then be given for differences between hypo- and hyperspeech not only in terms of energy cost, but also in terms of vocal tract mechanics.

Acknowledgments Supported by the Whitaker Foundation, NASA Grant NAG5-2217, NASA-Ames Research Center, and the University of Texas Center for High Performance Computing.

References Anderson, F.C.: A dynamic optimization solution for one cycle of normal gait; PhD diss. University of Texas at Austin, Austin (1999). Anderson, F.C.; Pandy, M.G.: A dynamic optimization solution for vertical jumping in three dimensions. Comput. Meth. Biomech. Biomed. Engng 2: 201–231 (1999). Anderson, F.C.; Ziegler, J.M.; Pandy, M.G.; Whalen, R.T.: Application of high-performance computing to numerical simulation of human movement. J. Biomech. Engng 117: 155–157 (1995). Anderson, F.C.; Ziegler, J.M.; Pandy, M.G.; Whalen, R.T.: Solving large-scale optimal control problems for human movement using supercomputers; in Witten, Vincent, Building a man in the machine: computational medicine, public health, and biotechnology. Part II, pp. 1088–1118 (World Scientific, 1996). Burdett, R.G.; Skrinar, G.S.; Simon, S.R.: Comparison of mechanical work and metabolic energy consumption during normal gait. J. Orthop. Res. 1: 63–72 (1983). Chen, I.H.; Kuo, K.N.; Andriacchi, T.P.: The influence of walking speed on mechanical joint power during gait. Gait and Posture 6: 171–176 (1997). Davy, D.T.; Audu, M.L.: A dynamic optimization technique for predicting muscle forces in the swing phase of gait. J. Biomech. 20: 187–201 (1987). Friederich, J.A.; Brand, R.A.: Muscle fiber architecture in the human lower limb. J. Biomech. 23: 91–95 (1990). Grieve, D.W.: Gait patterns and the speed of walking. Biomed. Engng 3: 119–122 (1968). Hatze, H.: The complete optimization of a human motion. Math. Biosci. 28: 99–135 (1976). Hill, A.V.: The efficiency of mechanical power development during muscular shortening and its relation to load. Proc. R. Soc. (Lond.), Ser. B 159: 297–318 (1964). Inman, V.T.: The joints of the ankle (Williams and Wilkins, Baltimore 1976). Inman, V.T.; Ralston, H.J.; Todd, F.: Human walking (Williams and Wilkins, Baltimore 1981). Laboissiere, R.; Ostry, D.J.; Perrier, P.: A model of human jaw and hyoid motion and its implications for speech production; in Elenius, Branderud, Proc. 13th Int. Congr. Phonet. Sci., vol. 2, pp. 60–67 (Stockholm University Press, Stockholm 1995).


Phonetica 2000;57:219–228

227

Lindblom, B.: How do children find the ‘hidden structure’ of speech? (Abstract) Speech Commun. Lang. Dev. Symp. (Stockholm University Press, Stockholm 1999). Lindblom, B.; Davis, J.; Brownlee, S.; Moon, S-J.; Simpson, Z.: Energetics in phonetics and phonology; in Fujimura et al., Linguistics and phonetics (Ohio State University Press, Columbus 1999). Margaria, R.: Biomechanics and energetics of muscular exercise (Clarendon Press, Oxford 1976). Murray, M.P.; Kory, R.C.; Clarkson, B.H.; Sepic, S.B.: Comparison of free and fast walking patterns of normal men. Am. J. Phys. Med. 45: 8–24 (1966). Pandy, M.G.; Anderson, F.C.; Hull, D.G.: A parameter optimization approach for the optimal control of large-scale musculoskeletal systems. J. Biomech. Engng 114: 450–460 (1992). Pandy, M.G.; Zajac, F.E.; Sim, E.; Levine, W.S.: An optimal control model for maximum-height human jumping. J. Biomech. 23: 1185–1198 (1990). Ralston, H.J.: Energetics of human walking; in Herman, Grillner, Stein, Stuart, Neural control of locomotion, pp. 77–98. (Plenum Press, New York 1976). Tashman, S.; Zajac, F.E.; Perkash, I.: Modeling and simulation of paraplegic ambulation in a reciprocating gait orthoses. J. Biomech. Engng 117: 300–308 (1995). Taylor, C.R.: Relating mechanics and energetics during exercise. Adv. Vet. Sci. Comp. Med. 38A: 181–215 (1994). Taylor, C.R.; Heglund, N.C.; McMahon, T.A.; Looney, T.R.: Energetic cost of generating muscular force during running: a comparison of large and small animals. J. Exp. Biol. 86: 9–18 (1980). Willhelms-Tricarico, R.; Perkell, J.S.: Towards a physiological model of speech production; in Elenius, Branderud, Proc. 13th Int. Congr. Phonet. Sci., vol. 2, pp. 68–75 (Stockholm University Press, Stockholm 1995). Woledge, R.C.; Curtin, N.A.; Homsher, E.L.: Energetic aspects of muscle contraction (Academic Press, 1985). Yamaguchi, G.T.; Zajac, F.E.: Restoring unassisted natural gait to paraplegics via functional neuromuscular stimulation: a computer simulation study. IEEE Trans. Biomed. Engng 37: 886–902 (1990). Zajac, F.E.: Muscle and tendon: properties, models, scaling, and application to biomechanics and motor control. CRC Crit. Rev. Biomed. Engng 19: 359–411 (1989).

228

Phonetica 2000;57:219–228

Pandy/Anderson

En Route to Adult Spoken Language Language Development Phonetica 2000;57:229–241

Received: November 26, 1999 Accepted: May 9, 2000

An Embodiment Perspective on the Acquisition of Speech Perception Barbara L. Davis a Peter F. MacNeilage b a

Department of Communication Sciences and Disorders, and of Psychology, University of Texas at Austin, Tex., USA

b Department

Abstract Understanding the potential relationships between perception and production is crucial to explanation of the nature of early speech acquisition. The ‘embodiment’ perspective suggests that mental activity in general cannot be understood outside of the context of body activities. Indeed, universal motor factors seem to be more responsible for the distribution of early production preferences regarding consonant place and manner, and use of the vowel space than the often considerable crosslanguage differences in input available to the perceptual system. However, there is evidence for a perceptual basis to the establishment of a language-appropriate balance of oral-to-nasal output by the beginning of babbling, illustrating the necessary contribution of ‘extrinsic’ perceptual information to acquisition. In terms of representations, at least one assumption that segmental units underlying either perception or production in early phases of acquisition may be inappropriate. Our work on production has shown that the dominant early organizational structure is a relatively unitary open-close ‘frame’ produced by mandibular oscillation. Consideration of the role of ‘intrinsic’ (self-produced) perceptual information suggests that this frame may be an important basis for perceptual as well as production organization. Copyright © 2000 S. Karger AG, Basel

Bjorn Lindblom has been a continuing source of inspiration for the quality of his scholarly work on fundamental issues in the field of phonetics, including generation of crucial questions for the understanding of speech acquisition. One of his basic themes is the necessity for consideration of the relationship between perception and production for full understanding of speech [e.g. Lindblom, 1992]. In contrast to Lindblom’s comprehensive focus, investigators of early speech acquisition have studied either production patterns or perceptual abilities, rarely integrating information from both perspectives. Following the lead of Lindblom, we will attempt to consider some implications of current knowledge of babbling and early speech production for conceptions of early perceptual organization.



Barbara L. Davis Department of Communication Sciences and Disorders University of Texas at Austin, Austin, TX 78712 (USA) Tel. +1 512-471-1929, Fax +1 512-471-2957 E-Mail [email protected]

The Embodiment Perspective

In considering the relationship between early speech perception and production, we adopt a perspective rapidly growing in importance in the human sciences, the perspective of ‘embodiment’. The general thesis of embodiment approaches is that mental activity and underlying brain activity cannot be understood outside of the context of bodily activities. Prominent representatives of the embodiment perspective are found in philosophy [Johnson, 1987], linguistics [Lakoff, 1987; see also Lakoff and Johnson, 1980, 1999], cognitive science [Varela et al., 1991], dynamic systems theory [Thelen and Smith, 1994], neuroscience [Damasio, 1994; Edelman, 1992], philosophy and artificial intelligence [Clark, 1997]. Clark [1997, p. 1] suggests that ‘minds evolved to make things happen’ and ‘minds make motions’. In his view, ‘The immediate products of much of perception are not neutral descriptions of the world so much as activitybound specifications of potential modes of action [Clark, 1997, p. 50].’ Extrinsic feedback (i.e. feedback from the environment), however, forms an additional supportive scaffolding, whereby the infant is seen as ‘piggybacking’ on reliable environmental properties of speech production present in adult speakers in the ambient speech community according to Clark [1997]. This embodiment perspective has arisen largely in reaction to the mind-body dichotomy of Cartesian philosophy. Within psychology, one ramification of the Cartesian zeitgeist has been an emphasis on knowledge of the world independent of an organism’s actions. The subareas of sensation, perception, learning and cognition have all focussed on input and mental operations on input. In the study of infant speech perception, the initial question was basically Cartesian, even though not explicitly stated as such: do infants have a priori (innate) mental categories for sounds [e.g. Eimas et al., 1971]? Subsequent work has continued to focus on development of mental categories based on adult models of discrete categories, typically the categories proposed by linguists [e.g. see Jusczyck, 1997, for a review].

Perception-Production Links

The concept of categorical perception, on which the original Eimas et al. [1971] study as well as subsequent studies were based, was defined in terms of a finite set of necessary and sufficient properties, as were the classical Platonic essences. Two key properties were operationalized by these experiments: (a) Identification. A category existed if a number of different stimulus variants could be given the same category label. Conceptually, a stimulus was considered to either belong to a category or not. (b) Discrimination. The ability to discriminate between unlike stimuli was predicted from identification. The idealization was that stimuli could be discriminated only to the degree that they could be identified as belonging to different categories. These paradigms showed that speech sounds in the world’s languages differ in ways that infants who have to eventually use these differences can detect. However, the link of these results from perception studies to what infants can do, in terms of language-related action has rarely been considered [though see Studdert-Kennedy, 1987; Vihman and DePaolis, in press; Vihman and Velleman, this vol.]. The motor theory of speech perception is also an apparent exception to the lack of attention to the relation between perception and action. However, in practice, the motor theory was also an example of

230

Phonetica 2000;57:229–241

Davis/MacNeilage

Platonic essences. The proposed motor categories mediating perception were deemed innate (i.e. given a priori), but never satisfactorily defined [Liberman and Mattingly, 1985]. From another perspective, the studies of Kluender et al. [1987] of animal categorization suggest an alternative view of categorization and discrimination based on basic auditory system characteristics rather than a priori linguistic categories. Animal studies, however, do not address the issue of the ways in which perceptual categorization in human infants is related to development of speech production skill. The classical approach to speech perception in human infants does not acknowledge that, in the most general sense, speech perception is in the service of speech production. The goal of the normal hearer is universally to become a speaker. Infants must perceive the language of those who surround them in order to be able to model their speech on the ambient environment. Consequently, their perceptual organization must be in a form that is accessible to the output mechanism. In other words, the demands of speech production must place constraints on perceptual organization, simply because there is a necessary interface between them. Karl Lashley [1951, p. 22] has noted: ‘speaking and listening have too much in common to depend upon entirely different processes.’ In this regard, work by Rizzolatti et al. [1997], and Rizzolatti and Arbib [1998] proposing an observation/execution matching system based on studies of mirror neurons in premotor cortex of monkeys provides an interesting platform to indicate the importance of study of the range of relationships between perception and production in human infants in speech development. In the experiments of Rizolatti et al. [1997], identical or ‘mirror’ neurons fire both when the monkey acts to grasp food and when the monkey observes an experimenter grasping food in the same manner, suggesting a tight linkage between action and perception whereby motor perception and representation emerge from the actions of the individual. Experimental research on acquisition of speech perception has been largely concerned with perception of external input from others. However, there is a second requirement that production places on perception, which is consistent with the embodiment perspective. Infants must be able to hear and evaluate their own vocalizations in relation to the movements made to produce them, to begin adjusting initially incorrect production attempts to better simulate the speech forms of the external environment. We will label the auditory input from others ‘extrinsic’ and auditory input from oneself ‘intrinsic’. The matching operation between extrinsic and intrinsic inputs also places constraints on the organization of speech perception. From this perspective, the development of speech production can be thought of as involving a progression towards matches between extrinsic and intrinsic perceptual inputs. The main goal of this paper is to consider what output patterns in babbling and early speech tell us about acquisition of perceptual capabilities. With respect to extrinsic input, the question is the extent to which perceptual factors might explain how favored infant production patterns in babbling and early speech are related to input characteristics of the ambient language. With respect to intrinsic input, we ask what would perceptual organization be like if it was strongly influenced by the characteristics of the child’s own behavioral repertoire? However, before we address these questions we will briefly review some aspects of our current knowledge of infant speech perception germane to our later discussion of production-perception relationships.

An Embodiment Perspective on the Acquisition of Speech Perception

Phonetica 2000;57:229–241

231

Aspects of the Acquisition of Speech Perception

The basic experimental paradigm for speech perception studies is to ask whether an infant is capable of successfully performing a narrowly defined task in a highly controlled and restricted experimental situation. The identification and discrimination results of Eimas et al. [1971] led Kuhl [1987, p. 311] to assert that ‘the number and diversity of contrasts that have been successfully discriminated by infants has led to the conclusion that infants under the age of 4 months can discriminate many, perhaps all, of the phonetic distinctions relevant in English’. Discrimination is only one indicant of an infant’s ability to form categories. Identification studies showed that very young infants demonstrate equivalence for vowels across talkers of different sex and age, for consonants across vocalic contexts, for features occurring in different segments, and for prosodic features [see Jusczyk, 1997, for a review]. These results led Kuhl [1987, p. 351] to further conclude that ‘by 6 months of age, infants appear to be natural categorizers’. Language-specific effects whereby infants are more efficient in discrimination tasks involving native language than nonnative phonetic categories by 9–10 months have also been established [Jusczyk et al., 1993; Werker and Tees, 1984]. Some role for learning [Gerken, 1994] has been proposed in studies showing later acquisition of fricative contrasts [e.g. Eilers and Minifie, 1975; Velleman, 1988] as well as sharpening of phoneme boundaries for voice onset time in Spanish- compared to Englishlearning infants [e.g. Macken and Barton, 1980]. Investigators have also focussed on the role of prosodic characteristics [e.g. Jusczyk et al., 1993], pauses at syllable, phrase, and utterance boundaries [e.g. Gleitman and Wanner, 1982; Hirsh-Pasek et al., 1987; Morgan and Saffran, 1995], and phonotactics [e.g. Friederici and Wessel, 1993]. In another line of perception research, older infants, from 10 to 35 months of age have been taught to designate referents of newly taught or known words [e.g. Barton, 1980; Edwards, 1974; Shvachkin, 1973]. These studies reveal that some perceptual contrasts such as vowels and consonant manner dimensions are much more accurately discriminated than others. It is important in the present context that the child’s ability to produce the sounds involved in these discrimination tasks appears to influence accuracy [Gerken, 1994]. The decontextualized categorical perception abilities demonstrated in the laboratory in the first 6 months do not clearly carry over to infants’ responses to experimental learning tasks in the early word period. Studies utilizing words add the additional burden of referential access to the child’s discrimination task in early categorical perception studies. Current research on distributional properties of language [e.g. Aslin et al., 1998] focuses on the child’s computational modeling of segmental language probabilities in the input after brief exposure to corpora of artificial languages. These studies are concerned with the effect of input on the nature of children’s response to learning tasks involving sound sequences. Infants as young as 8 months were found to utilize probabilistic information about input sequences in segmentation of the ambient input following brief exposure [Aslin et al., 1998]. No reference is made to the infant’s action capacities or to potential input-output links. Some observations about this research on infant speech perception need to be made before considering the implications of speech production for the nature of perception. The earliest studies focussed on the issue of segmental and subsegmental categories as well as the issue of the speech-specific innateness of those categories.

232

Phonetica 2000;57:229–241

Davis/MacNeilage

Subsequent studies showing that many of these effects can be obtained with comparable nonspeech stimuli and in experiments with other animals have cast doubt on the assumption of innateness [e.g. Kluender, 1998]. Later studies have been more concerned with more holistic aspects of infant perception [e.g. Stager and Werker, 1997], aspects that may be more relevant to the formation of early sound-meaning relationships, and concerned with the infant’s capacity to perceive aspects of a particular ambient language. One problem with this work is that the characteristic paradigms are capable of showing what an infant can do in a highly restricted laboratory context that is not necessarily germane to what infants actually do when listening to speech in natural situations. More importantly, this body of research tells us very little about how perceptual abilities relate to babbling output and in the early speech period.

Babbling and Early Speech Production

Turning the question of the relationship of babbling and early speech production to early speech perception, we focus on four well-established preferences in early speech: (1) consonant manner: complete occlusion; (2) consonant place: anterior (labial and coronal); (3) vowel space used: lower left quadrant, and (4) oral production mode. Occlusive Manner of Articulation The oral stop mode of production, requiring complete vocal tract occlusion is the most frequent mode of consonant production in babbling and early speech [e.g. Vihman et al., 1986]. Salient perceptual signals emerging from this occlusive movement, whether produced by the infant or others, are silence during closure or a low amplitude signal if voiced, followed by a burst release and optional aspiration, with formant transitions associated with surrounding vowels [Kent, 1997]. In contrast to the stop consonant manner of articulation, fricative, liquid and affricate manners of articulation are uniformly rare in babbling and first words [e.g. Davis and MacNeilage, 1995; Roug et al., 1989; Stoel-Gammon, 1985; Vihman et al., 1985]. A recent study [Gildersleeve-Neumann et al., in press] found a 7.6% frequency of occurrence in babbling (7–12 months) for these sound types in 4 English-speaking infants. Subsequent analysis of first words in 10 infants, including the 4 in the Gildersleeve-Neumann et al. [in press] study, showed only an 8.3% frequency during the single-word period at roughly 12–20 months. More generally, Locke [1983], in a survey of studies of first words from 15 language environments, found that no English fricatives, liquids or affricates occurred in the top 10 most frequent consonants in any of the first-word corpora analyzed. These estimates changed very little when Locke [1983] corrected for presence of the sound in the ambient language environment. Although fricatives and liquids are rare in babbling and in early speech, they occur frequently in languages [Maddieson, 1984]. With affricates, they constitute 41% of English consonants [Kent, 1994]. Broadband noise forms the acoustic signature for fricatives, with a difference among subclasses in the spectrum of the noise component. Affricates combine the properties of an oral stop consonant with a following fricative noise component. Fricatives, affricates and liquids are all signaled perceptually by direction and frequency of the second formant transition from the place of the constriction to the configuration for the contiguous segment. Available evidence on speech


Phonetica 2000;57:229–241

233

perception does not suggest obvious reasons why infants could not access perceptual representation of these nonfavored sounds just as successfully as the favored occlusives. However, with regard to the favored sounds, Stevens and Keyser [1989] suggest the possibility that there is an auditory-based preference for the voiceless unaspirated stops of the kind most commonly encountered in babbling. They propose a synergy between intrinsic production system effects and extrinsic perceptual effects. They note ‘an abrupt onset of energy over a range of frequencies preceded by an interval of silence or of low amplitude. This acoustic property leads to a distinctive response in the auditory system’ [Stevens and Keyser, 1989, p. 85]. Examination of production requirements gives ample evidence as to why fricatives, affricates, and liquids might be hard to produce. For fricatives and the fricative component of affricates, a small, carefully calibrated aperture sufficient to create a differential pressure and resultant friction is required. For affricates, a total occlusion must be followed rapidly by fricative production within the time constraints for a single phoneme. The tongue is especially involved in liquid production. For /r/ the tongue tip curls backward or bunches within the vocal tract to produce the necessary acoustic specifications. In producing /l/, the tongue must both occlude at the midline and simultaneously produce one or two apertures in the lateral dimension to allow airflow. While production requirements appear complex, no objective index of articulatory difficulty is available [although see Willerman, 1994, for an attempt to index articulatory difficulty]. However, production mastery for these manners of articulation is typically later than corresponding stop consonants at the same place of articulation, sometime even years later [e.g. Sander, 1972; Wellman et al., 1931]. In addition, both normally developing children and children with articulation delay often produce substitutions of oral stops at the same place of articulation as the requisite fricative, liquid or affricate [described as ‘stopping’, Ingram, 1976], indicating some perceptual awareness of the nature of the sound required. Thus, while the acoustic perceptual signature for fricatives, affricates, and liquids in the ambient input does not seem uniquely difficult to access compared to stops, production requirements seem to be greater. In addition, sound substitutions in both normal and delayed development appear to indicate some awareness of the perceptual requirements for fricative, affricate, and liquid productions even in the presence of scarcity of these sound types in the production repertoire. Anterior Place of Articulation Relatively anterior lip (labial) and tongue tip (coronal) closures are consistently described as more frequent than posterior tongue back closures (dorsals) for both nasal and oral stops in babbling and early speech [see Vihman, 1996, for a review]. This more frequent use of labials and coronals is confirmed in normative age-of-acquisition studies of large groups of young children learning English generally [e.g. Smit et al., 1990] as well as across languages [Locke, 1983]. However, in languages, dorsals are also strongly favored [Maddieson, 1984]. In English 24% of stops are dorsals compared to 22% for labials and 54% for coronals [Mines et al., 1978]. Perceptually, place of articulation (i.e. labial, coronal or dorsal in English) for oral or nasal stops is signaled primarily by second formant transition from the preceding vowel to the place of occlusion of the stop, with another transition to the following vowel [Kent, 1997]. There are no obvious perceptual reasons why labials and coronals are favored over dorsals in babbling and early speech. However, closed lips and an anterior position of the tongue in the mouth – the articulatory configurations for labials

234

Phonetica 2000;57:229–241

Davis/MacNeilage

and coronals – are characteristic configurations of the oral apparatus for vegetative purposes. Anterior tongue positioning provides a platform for sucking, licking, and chewing operations. Lip closure is a passive accompaniment of the mandibular oscillation of babbling and early speech. In swallowing, a vegetative operation, dorsal elevation is embedded in a serial reflexive operation involving many simultaneous oral configurations in contrast to tongue tip movements that may occur as individual motion in tongue tip elevation and lip closure may occur as a passive accompaniment to jaw closure. Vowels in the Lower Left Quadrant In studies of babbling, vowels are described as being produced in the lower left quadrant of the vowel space [e.g. Bickley, 1983; Buhr, 1980; Cruttenden, 1970; Lieberman, 1980; see MacNeilage and Davis, 1990 for a summary]. In the first-word period, lower left quadrant vowels are most frequent as well. Use of high vowels in absolute final word position is one of the first signs of systematic diversification in first words, even though vowels in the lower left quadrant are still favored. Acoustic studies of children’s productions in this period [e.g. Bickley, 1983; Hodge, 1989] indicate earlier control over jaw opening than tongue anterior-posterior positioning for vowel production, consistent with height expansion being earlier than tongue front-back expansion for diversification in first words. Davis and MacNeilage [1995] in transcription-based studies of 6 infants during babbling and 10 infants in the first-word period, Davis et al. (in revision) found significantly more height than front-back changes for vowels, suggesting predominance of jaw movement changes over tongue changes in the front-back axis within sequences. There is no speech perception research that suggests infants should have a greater ability to produce certain specific vowel qualities. Kuhl [e.g. 1983] has provided evidence that by 6 months of age infants have vowel categories sufficiently well structured to allow them to distinguish between vowels within categories in terms of goodness of fit (a conceptualization that goes beyond the classical essentialist notion of categories). There is presently no evidence that infants should have more difficulty forming such categories for some vowels than for others. Infant preferences for consonant manner and place of articulation and use of the vowel space in babbling and early speech seem to be a result of output factors rather than differences in the ability to form perceptual representations of favored versus nonfavored sounds. However it is important to note that the consonants and vowels concerned are not being produced as separable segments even though our review might, by default, give that impression. Even though perceptual studies have often addressed the question of infant segmental and subsegmental categories, it is not clear that these categories are used in a discrete way in listening under natural circumstances, or as perceptual representations underlying aspects of production. What we currently know about early perceptual organization is not adequate to allow explanation of the way that early sound production preferences deviate from those expected on the basis of frequency of extrinsic input of the sounds in the ambient language. Oral versus Nasal Production Modes Predominant use of an oral rather than a nasal production mode is characteristic of infant vocalization by the onset of canonical babbling (at 7 months of age). Davis and MacNeilage [1995] in a study of canonical babbling in 6 infants in an English-speaking environment found a 3:1 ratio of oral:nasal consonants, and this ratio was maintained


Phonetica 2000;57:229–241

235

in the first-word period. Locke’s [1983] survey of data on babbling and first words in 15 languages revealed a 7:1 ratio of oral:nasal consonants, overall. In babbling, and likely in early speech, nasal consonants co-occur with nasal vowels. In an acoustic study of canonical babbling, Matyear et al. [1997] showed that vowels in nasal consonantal environments are heavily nasalized. Thus, amount of oral versus nasal consonant use in babbling infants may be taken as an indicant of general oral or nasal quality. Oral and nasal consonants tend to be produced across whole utterances as well. Redford et al. [1997] in an acoustical study showed that in canonical babbling the consonant preceding and/or following a consonant of one type (oral or nasal) tends strongly to be the same type. An oral production mode is typical of languages also. In English, the proportion of oral:nasal consonants is approximately 6:1 [Mines et al., 1978]. More generally, the proportion of oral to nasal consonants is uniformly high across languages [Maddieson, 1984]. Nasal vowels are also relatively infrequent, and within a language there are usually fewer nasal vowels than oral vowels [Hawkins and Stevens, 1985]. The relative prohibition of nasals is generally considered to be due to their relative lack of perceptual distinctiveness [e.g. Lieberman, 1984]. Hura et al. [1992] have shown that nasal consonants in intervocalic clusters are more perceptually confusible than fricatives. Nasal vowels result in a concentration of energy in the middle of the first formant range, thus reducing the capacity of the first formant to signal vowel contrasts [Hawkins and Stevens, 1985]. In contrast to hearing infants, severe to profoundly hearing-impaired infants produce a far higher proportion of nasal consonants and vowels in vocal output at all ages [e.g. Lach et al., 1970; Stoel-Gammon, 1988]. A recent case study of a profoundly hearing-impaired infant who received a multichannel cochlear implant [MacCaffrey et al., in press] showed that 80% of all consonants were nasal in the preimplant period at 20–24 months of age, reducing dramatically 7 months postimplant. These results suggest that the developing infant needs minimal access to auditory information to simulate the oral production mode of adult speakers. Maintenance of an oral production mode across utterances requires departure from the resting position of the soft palate, which must be elevated by the levator palatini muscle in order to close the velopharyngeal port. Judging by the low frequencies of nasal consonants in babbling, hearing infants can effect active closure of the levator palatini by the onset of canonical babbling rather than maintaining the resting position which would result in pervasive nasality. Lacking both extrinsic and intrinsic information, hearing-impaired infants do not consistently make this transition. Thus, in the case of the oral-nasal distinction, perceptual access to the incoming auditory signal appears necessary to allow discovery and simulation of this very basic contrast in adult speech. If normal access to extrinsic and intrinsic information is available, an infant seems capable of achieving the low frequency of nasality in the ambient language by 7 months. These results suggest that in contrast to a secondary role in consonant manner and place, and vowel space use, perception plays a necessary and primary role in establishing a predominately oral production mode. This appears to be an important function of perception not previously emphasized. These results have potential implications for speech evolution as well. They suggest that the bias towards oral speech mode is not innate, in the sense of developing independent of experience. Evolving languages may have placed a constraint against heavy use of the nasal parameter because of its negative effects on perceptual distinctiveness. In contrast to use of the manner and place

236

Phonetica 2000;57:229–241

Davis/MacNeilage

dimensions of consonants and the use of the vowel space, which are apparently subject to relatively severe early motor constraints, a hearing infant apparently has the ability to translate perceptual representation into action in restricting nasality.

Intrinsic (Self-Generated) Information and Early Perceptual Organization

These results raise an important question from the embodiment perspective. At the onset of word use, what might be the nature of the infant’s perceptual representation in cases where he/she can correctly produce the sound/s or a word (labial and coronal stops, lower left quadrant vowels, the nasal–oral distinction), and where he/she cannot (dorsal stops, fricatives and liquids, high or back vowels)? The question involves the relation between extrinsic and intrinsic input. Consider the situation when an infant can correctly produce a sound or a word. We assume that the infant has developed some capacity to monitor his/her attempts as well as ability to correct the production of previously incorrect attempts. The confirmatory perceptual response from intrinsic input linked with extrinsic confirmation from the child’s perception of environmental input may contribute to a more well-developed perceptual representation than either type of input alone. But it could also be argued that in the situation where the infant cannot correctly produce the sound/s of a word, the contrast perceived by the self-monitoring mechanism between extrinsic models and the infant’s intrinsic monitoring of his/her attempt might serve to improve both. Thus the relations between perception and production can be viewed as mutually supportive based on the infant’s action capacities rather than ‘either-or’. However, this discussion is oversimplified, as it does not address a problem for both infant perception and production, the a priori assumption of units. Infants at the first-word stage have demonstrated problems in performing minimal pair discriminations between words. In contrast, younger infants are capable of making the same discriminations. Gerken [1994] asserts that early words may be both recognized and produced relatively holistically. Our main finding regarding acquisition of speech production is the generality of intrasyllabic co-occurrence constraints on consonants and vowels. We have interpreted this trend towards relatively holistic production in terms of the dominance of the ‘frame’ contributed by mandibular oscillation in babbling and early speech [MacNeilage and Davis, this vol.]. Mandibular oscillation produces the close-open alternation responsible for consonant-vowel (CV) alternations. Three CV co-occurrence patterns found in studies of babbling and early speech, i.e. coronal consonants – front vowels, dorsal consonants – back vowels, and labial consonants – central vowels, are evidence of a limited role of articulators other than the mandible in this period. Although we considered stops and vowels separately in the preceding section, our results suggest that they are not separately controlled in production. To the degree that embodiment is important in speech perception, they may not have a separate perceptual representation either. What assumptions can be made about intrinsic contributions to the perceptual organization associated with these CV co-occurrence patterns? Their high frequency of occurrence may not be primarily a result of perception of ambient language patterns, even though adult languages tend to have the same patterns [MacNeilage et al., 2000]. Importantly, in this regard, they occur consistently from the onset of babbling [Davis


Phonetica 2000;57:229–241

237

and MacNeilage, 1995]. Thus, there is intrinsic input for these patterns, since every time the infant produces them he/she hears them. As a consequence, a perceptual task for the infant in correctly producing a word with a favored CV co-occurrence pattern is not to generate a correct perceptual representation, but to match an already existing intrinsic CV pattern with extrinsic CV patterns from the environment. Consider now, in more detail, the possible intrinsic contribution to perception in the infant. Consistent with the embodiment perspective, it is necessary to consider input arising from the speaker’s own movements. The production of frames is seen as a predominant characteristic of serial organization of babbling and early words. While ‘pure’ frames (labial-central CVs) involve jaw movement alone, ‘fronted’ or ‘backed’ frames (coronal-front and dorsal-back CVs) may involve placing the tongue in the front or the back of the mouth before the utterance begins [MacNeilage and Davis, 1990]. The intrinsic perceptual information related to the basic close-open alternation arises directly from mandibular movement cycles themselves, and is time-locked. This association might suggest a logical perception-production relation based on linking produced forms with the nature of the emerging perceptual representation. For pure frames, the only active movement may be the mandibular cycle itself as the lips and tongue are both passive in the resting positions. Consequently the acoustic alternation between consonants and vowels is directly related to the mandibular movements. For fronted and backed frames, the perception of the close-open alternation will, as in the case of pure frames, be time-locked to the mandibular oscillation. No additional movement-related information will be specifically associated with the time varying cues for the coronal consonant and the front vowel or the dorsal-back vowel association. The only motor event is the static isometric contraction holding the tongue in the fronted or backed position, respectively, as there is no change within the utterance. How, then, does the infant learn not to position the tongue before the syllable begins? In subsequent developmental phases where the infant begins to develop mastery of autonomous tongue front-back movements as well as perception of a growing number of lexical types, both increase in production capabilities and increase in perception of environmental events could be responsible for the eventual dissociation of the CV co-occurrence pattems which dominate in babbling and in the first-word period. At the intersyllabic level, another situation exists for the tongue when an active movement is associated with the second syllable of an utterance (i.e. [bado] for ‘bottle’). As the utterance is already begun when the movement towards the alveolar ridge for [d] is made, the infant will hear acoustic changes concurrent with and closely related to the active tongue movement although mandibular movement is also contributing to the perceived pattern. This changing information may contribute valuable intrinsic input for perceptual mastery of tongue movement consequences necessary to produce complex speech sequences. This information may be infrequently available during the babbling and early speech period because most tongue movements may be made before phonation begins in the case of reduplicated utterances, the predominant utterance type in both periods.

Conclusion

Links between production and perception are not widely explored in available literature on earliest periods of speech acquisition. In addition, the role of perceptual

238

Phonetica 2000;57:229–241

Davis/MacNeilage

input is not typically defined to include both extrinsic ambient effects and intrinsic effects of the infant’s own movements and acoustic output in production. This review suggests a more prominent role for influences from infant actions on both quality of output and growth of perceptual representation during this period when the infant is producing speech like actions. Three of four prominent production patterns examined showed little evidence of extrinsic perceptual effects for non-preferred production patterns compared to potential intrinsic movement-based constraints. In the case of the oral-nasal distinction, a role for extrinsic perception seems apparent, especially in light of data from hearing-impaired infants. In addition, an important aspect of early speech perception not previously remarked on might involve the various contingencies associated with frames based on mandibular oscillation, just as in production. The close temporal relation between acoustics of closing and opening movements resulting from oscillation cycles may provide an intrinsic, movement-driven, perceptual reference pattern, against which the complex relations between other articulatory movements and their perceptual consequences might eventually be understood. Considering intrinsic embodiment in actions as well as extrinsic perceptual inputs results may produce a more comprehensive picture of speech acquisition that incorporates interactions of perception with production fully.

Acknowledgment This work was supported in part by NICHD R-01 HD27733-07.

References Aslin, R.N.; Saffran, J.R.; Newport, E.L.: Computation of conditional probability statistics by 8-month-old infants. Psychol. Sci. 9: 321–324 (1998). Barton, D.: Phonemic perception in children; in Yeni-Komshian, Kavanaugh, Ferguson, Child phonology, vol. 2, pp. 97–114 (Academic Press, New York 1980). Bickley, C.: Acoustic evidence for phonological development of vowels in young children. Proc. 10th Int. Congr. Phonet. Sci., vol. 4, pp. 233–235 (Utrecht 1983). Buhr, R.D.: The emergence of vowels in an infant. J. Speech Hear. Res. 23: 73–94 (1980). Clark, A.: Being there: putting brain, body, and world together again (Cambridge University Press, Cambridge 1997). Cruttenden, A.: A phonetic study of babbling. Br. J. Disord. Commun. 5: 110–117 (1970). Damasio, A.: Descartes’ Error (Gossett Putnam, New York 1994). Davis, B.L.; MacNeilage, P.F.: The articulatory basis of babbling. J. Speech Hear. Res. 38: 1199–1211 (1995). Davis, B.L.; MacNeilage, P.F.; Matyear, C.L.: Acquisition of Serial Complexity in Speech Production: phonetic patterns in first words (in revision). Edelman, G.: Bright air, brilliant fire: on the matter of the mind (Basic Books, New York 1992). Edwards, M.L.: Perception and production in child phonology: the testing of four hypotheses. J. Child. Lang. 1: 205–219 (1974). Eilers, R.; Minifie, F.: Fricative discrimination in early infancy. J. Speech Hear. Res. 18: 158–167 (1975). Eimas, P.; Siqueland, E.; Jusczyk, P.; Vigorito, K.: Speech perception in infants. Science 171: 303–306 (1971). Friederici, A.D.; Wessels, J.M.I.: Phonotactic knowledge and its use in infant speech perception. Percept. Psychophys. 54: 287–295 (1993). Gerken, L.A.: Child phonology: past research, present questions, future directions; in Gernsbacker, Handbook of psycholinguistics, pp. 781–820 (Academic Press, New York 1994). Gildersleeve-Neumann, C.E.; Davis, B.L.; MacNeilage, P.F.: Frame dominance in babbling: implications for fricatives, affricates and liquids. Appl. Psycholing. (in press). Gleitman, L.R.; Wanner, E.: Language acquisition: the state of the art; in Wanner, Gleitman, Language acquisition: the state of the art, pp. 44–59 (Cambridge University Press, Cambridge 1982). Hawkins, H.; Stevens, K.N.: Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels. J. acoust. Soc. Am. 77: 1560–1575 (1985).


Phonetica 2000;57:229–241

239

Hirsch-Pasek, K.; Kemler-Nelson, D.G.; Jusczyk, P.W.; Wright-Cassidy, K.; Druss, B.; Kennedy, L.: Clauses are perceptual units for young infants. Cognition 26: 269–286 (1987). Hodge, M.M.: A comparison of spectral temporal measures across speaker age: implications for an acoustical characterization of speech acquisition; PhD diss. University of Wisconsin-Madison (unpublished, 1989). Hura, S.L.; Lindblom, B.; Diehl, R.L.: On the role of perception in shaping phonological assimilation rules. Lang. Speech 35: 59–72 (1992). Ingram, D.: Phonological disability in children (Arnold, London 1976). Johnson, M.: The body in the mind: the bodily basis of imagination, reason, and meaning (University of Chicago Press, Chicago 1987). Jusczyk, P.; Friederici, A.D.; Wessels, J.; Svenkerud, V.Y.; Jusczyk, A.M.: Infant’s sensitivity to segmental and prosodic characteristics of words in their native language. J. Mem. Lang. 32: 402–420 (1993). Jusczyk, P.W.: The discovery of spoken language (MIT Press, London 1997). Jusczyk, P.W.; Culter, A.; Redanz, N.: Preferences for the predominant stress patterns of English words. Child Dev. 64: 675–687 (1993). Jusczyk, P.W.; Pisoni, D.B.; Mullinix, J.: Some consequences of stimulus variability on speech processing by 2-month-old infants. Cognition 43: 253–291 (1992). Kent, R.D.: Reference manual for communicative science and disorders: speech and language (Pro-Ed, Austin 1994). Kent, R.D.: The speech sciences (Singular Publishing, San Diego 1997). Kluender, K.R.: Lessons from the study of speech perception. Beh. Br. Sci. 13: 739–740 (1998). Kluender, K.R.; Diehl, R.L.; Killeen, P.R.: Japanese quail can learn phonetic categories. Science 237: 1195–1197 (1987). Kuhl, P.: Perception of speech and sound in early infancy; in Salapatek, Cohen, Handbook of infant perception: from perception to cognition, vol. 2, pp. 224–238 (Academic Press, New York 1987). Kuhl, P.K.: Perception of auditory equivalence classes for speech in early infancy. Infant Behav. Dev. 6: 263–285 (1983). Lach, R.; Ling, D.; Ling, A.; Ship, H.: Early speech development in deaf infants. Am. Ann. Deaf 115: 522–526 (1970). Ladefoged, P.; Maddieson, I.: The sounds of the world’s languages (Blackwell, Oxford 1997). Lakoff, G.: Women, fire, and dangerous things: what categories reveal about the mind (University of Chicago Press, Chicago 1987). Lakoff, G.; Johnson, M.: Metaphors we live by (University of Chicago Press, Chicago 1980). Lakoff, G.; Johnson, M.: Philosophy in the flesh: the embodied mind and its challenge to Western thought (Basic Books, New York 1999). Lashley, K.: The problem of serial order in behavior; in Jeffress, Cerebral mechanisms in behavior (Wiley, New York 1951). Liberman, A.M.; Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985). Lieberman, P.: On the development of vowel productions in young children; in Yeni-Komshian, Kavanagh, Ferguson, Child phonology, vol. 1: production (Academic Press, New York 1980). Lieberman, P.: The biology and evolution of language (Harvard University Press, Cambridge 1984). Lindblom, B.L.: Phonological units as adaptive emergents of lexical development; in Ferguson, Menn, Stoel-Gammon, Phonological development: models, research, implications, pp. 131–164 (York, Timonium 1992). Locke, J.: Phonological acquisition and change (Academic Press, New York 1983). MacCaffrey, H.; Davis, B.L.; MacNeilage, P.F.; Von Hapsburg, D.: Effects of multi-channel cochlear implantation on the organization of early speech. Volta Rev. (in press). Macken, M.; Barton, D.: A longitudinal study of the acquisition of stop consonants. J. Child Lang. 7: 41–74 (1980). MacNeilage, P.F.; Davis, B.L.: Acquisition of speech production: frames, then content; in Jeannerod, Attention and performances XIII: motor representation and control, pp. 453–476 (LEA, Hillsdale 1990). MacNeilage, P.F.; Davis, B.L.: A motor learning perspective on speech and babbling; in Boysson-Bardies, Schoen, Jusczyk, MacNeilage, Morton, Changes in speech and face processing in infancy: a glimpse at developmental mechanisms of cognition, pp. 341–352 (Kluwer, Dordrecht 1993). MacNeilage, P.F.; Davis, B.L.: Deriving speech from non-speech: a view from ontogeny (this vol.). MacNeilage, P.F.; Davis, B.L.; Kinney, A.; Matyear, C.L.: The motor core of speech: a comparison of serial organization patterns in infants and languages. Child Dev. (in press). Maddieson, I.: Patterns of sounds (Cambridge University Press, Cambridge 1984). Matyear, C.L.; MacNeilage, P.F.; Davis, B.L.: Nasalization of vowels in nasal environments in babbling: evidence of frame dominance. Phonetica 55: 1–17 (1997). Mines, M.; Hanson, B.; Shoup, J.: Frequency of occurrence of phonemes in conversational English. Lang. Speech 21: 221–224 (1978). Morgan, J.L.; Saffran, J.R.: Emerging integration of sequential and supra-segmental information in pre-verbal speech segmentation. Child Dev. 66: 911–936 (1995). Redford, M.L.; MacNeilage, P.F.; Davis, B.L.: Perceptual and motor influences on final consonant inventories in babbling. Phonetica 54: 172–186 (1997). Rizzolatti, G.; Arbib, M.: Language within our grasp. Trends Neurosci. 21: 188–194 (1998). Rizzolatti, G.; Fadiga, L.; Fogassi, L.; Gallese, V.: The space around us. Science 277: 190–191 (1997).

240

Phonetica 2000;57:229–241

Davis/MacNeilage

Roug, L.; Landberg, I.; Lundberg, L.-J.: Phonetic development in early infancy: a study of four Swedish children during the first eighteen months of life. J. Child Lang. 16: 19–40 (1989). Sander, E.: When are speech sounds learned? J. Speech Hear. Disorder 37: 55–63 (1972). Shvachkin, N.Kh.: The development of phonemic perception in early childhood; in Ferguson, Slobin, Studies of child language development, pp. 222–280 (Holt, Rhinehart, & Winston, New York 1973). Smit, A.B.L.; Hand, J.J.; Freilinger, J.E.; Bernthal, J.; Bird, J.: The Iowa articulation norms project and its Nebraska replication. J. Speech Hear Disorders 55: 779–798 (1990). Stager, C.L.; Werker, J.: Infants listen for more phonetic detail in speech perception than in word learning tasks. Nature 388: 381–382 (1997). Stevens, K.N.; Keyser, S.J.: Primary features and their enhancement in consonants. Language 65: 81–106 (1989). Stoel-Gammon, C.: Phonetic inventories, 15–24 months: a longitudinal study. J. Speech Hear. Res. 28: 505–512 (1985). Stoel-Gammon, C.: Pre-linguistic vocalizations of hearing impaired and normally hearing subjects: a comparison of consonantal inventories. J. Speech Hear. Disorders 53: 302–315 (1988). Studdert-Kennedy, M.: The phoneme as a perceptuo-motor experience; in Allport, MacKay, Prinz, Schierer, Language perception and production, pp. 58–71 (Academic Press, New York 1987). Thelen, E.; Smith, L.: A dynamic systems approach to the development of cognition and action (MIT Press, Boston 1994). Varela, F.J.; Thompson, E.; Rosch, E.: The embodied mind: cognitive science and human experience (Cambridge University Press, Cambridge 1991). Velleman, S.: The role of linguistic perception in later phonological development. Appl. Psycholing. 9: 221–236 (1988). Vihman, M.; DePaolis, R.A.: The role of mimesis in infant development: evidence for phylogeny; in Hurford, Knight, Studdert-Kennedy, The evolutionary emergence of language (Cambridge University Press, Cambridge, in press). Vihman, M.; Macken, M.; Simmons, H.; Miller, J.: From babbling to speech: a re-assessment of the continuity issue. Language 61: 397–445 (1985). Vihman, M.M.: Phonological development: the origins of language in the child (Basil Blackwell, Oxford 1996). Vihman, M.M.; Ferguson, C.E.; Elbert, M.: Phonological development from babbling to speech: common tendencies and individual differences. Appl. Psycholing. 7: 3–40 (1986). Vihman, M.M.; Velleman, S.: The construction of a first phonology (this vol.). Wellman, B.L.; Case, I.M.; Mengert, E.G.; Bradbury, D.E.: Speech sounds of younf children. Univ. Iowa Stud. Child Welfare (University of Iowa Press, Iowa City 1931). Werker, J.; Tees, R.C.: Cross-language speech perception: evidence for perceptual re-organization during the first year of life. Infant Behav. Dev. 7: 49–63 (1984). Willerman, R.: The phonetics of pronouns: articulatory bases of markedness; PhD diss. University of Texas at Austin (unpublished, 1994).


Phonetica 2000;57:229–241

241


Received: February 7, 2000 Accepted: March 23, 2000

Speech to Infants as Hyperspeech: Knowledge-Driven Processes in Early Word Recognition Anne Fernald Department of Psychology, Stanford University, Stanford, Calif., USA

Abstract The intelligibility of a word in continuous speech depends on the clarity of the word and on linguistic and nonlinguistic contextual information available to the listener. Despite limited knowledge of language and the world, infants in the first 2 years are already beginning to make use of contextual information in processing speech. Adults interacting with infants tend to modify their speech in ways that serve to maximize predictability for the immature listener by highlighting focussed words and using frequent repetition and formulaic utterances. Infant-directed speech is viewed as a form of ‘hyperspeech’ which facilities comprehension, not by modifying phonetic properties of individual words but rather by providing contextual support on perceptual levels accessible to infants even in the earliest stages of language learning. Copyright © 2000 S. Karger AG, Basel

Introduction

Spoken language understanding by adults is influenced not only by the acoustic patterns associated with individual words, but also by how effectively the listener makes use of prosodic, syntactic, semantic, and discourse-level features of the context in which the words occur [see Altmann, 1997]. For infants first learning to recognize and understand spoken words, sophisticated knowledge of language structure is not yet available as a resource in speech processing. However, caretakers speaking to infants may intuitively organize their speech in ways that provide perceptual advantages for the inexperienced listener [e.g. Ferguson, 1977; Fernald, 1992]. The idea that infantdirected (ID) speech is a ‘listener-oriented’ speech mode which could help language learners to identify potentially ambiguous words was considered by Lindblom [1990] in the context of his hyper- and hypospeech (H and H) theory. But Davis and Lindblom [in press] later rejected this proposal as inadequate for explaining how infants solve the problem of recognizing words which are highly variable in their acoustic realization. I argue here that this rejection was too hasty: the H and H perspective provides a valu-



Anne Fernald Department of Psychology, Stanford University Stanford, CA 94305 (USA) Tel. +1 (650) 323-0832, Fax +1 (650) 725-5699 E-Mail [email protected]

able framework for investigating the contributions of ID speech to the development of speech processing and language understanding. If every phoneme in the language had a physical definition which was invariant across different talkers and in all phonetic contexts, the task of word recognition would in principle be straightforward. But the same intended speech sound varies considerably when spoken by different people [Johnson and Mullenix, 1997], and a particular phoneme is physically different in different phonetic contexts even when spoken by the same person [see Perkell and Klatt, 1986]. Lindblom [1990] has proposed that experienced listeners can recognize ambiguous words despite extensive phonetic variability because they also make use of many other kinds of information beyond the words themselves. According to H and H theory, a rapidly spoken word which is phonetically underspecified will still be understood by the knowledgeable listener if sufficient ‘signal-independent’ contextual information is available to enable lexical access. Moreover, speakers collaborate with listeners to provide sufficient information by monitoring the listener’s state of knowledge and other aspects of the communicative situation and adjusting their speech output appropriately to maximize intelligibility. Lindblom [1990] predicts that when contextual support is minimal, speakers will tend to compensate by articulating more clearly, using a kind of ‘hyperspeech’. Thus the H and H argument goes well beyond the observation that top-down information can facilitate word recognition, by focusing on the speaker’s role in enhancing the listener’s understanding within the dynamic context of conversation. When the listener is a linguistically inexperienced infant, however, it is not clear how signal-complementary information could play such an important role. For the child learning language, the extensive phonetic variability in continuous speech would seem to present a problem that could not possibly be offset by access to linguistic information at other levels. In addition to the lack of acoustic invariance in the speech signal, infants also face the challenge of how to identify word and phrase boundaries which are generally not physically demarcated in continuous speech among adults [Fernald and McRoberts, 1996]. The preverbal infant could be seen as the limiting case of a listener without access to higher levels of language structure as a resource in disambiguating spoken words. H and H theory would predict, first, that infants must rely much more heavily than adults on explicit signal information in speech if they cannot yet make use of signal-complementary information. And second, that adults speaking to infants should compensate for these limitations by using a hyperspeech mode, in order to reduce phonetic variability and provide clearer exemplars for the inexperienced listener. To explore the hypothesis that phonetic variability is lower in ID than in adultdirected (AD) speech, Davis and Lindblom [in press] analyzed vowel formant frequencies in speech to a 6-month-old infant. Contrary to the prediction that mothers articulate more clearly in ID speech, they found substantial phonetic variability in speech to the child. Davis and Lindblom [in press] then concluded that H and H theory is not useful in explaining the variation in ID speech, based on their assumption that preverbal infants could not possibly have access to signal-independent information to help them identify words in the face of extensive phonetic ambiguity. ‘Since it goes without saying that such knowledge is largely still to be developed by infants, phonetic variation in BT [ID speech] directed to a 6-month-old child cannot be successfully accounted for by invoking H and H theory’ [Davis and Lindblom, in press, p. 15].

ID Speech as Hyperspeech

Phonetica 2000;57:242–254

243

However, recent research on infant speech processing suggests that some of the central insights motivating H and H theory are indeed relevant to understanding how ID speech facilitates speech perception and word recognition by infants. In light of new experimental evidence on infants’ attention to structure in speech [e.g. Jusczyk and Aslin, 1995; Saffran et al., 1996], on the one hand, and on the role of ID speech in facilitating word recognition [Fernald et al., in press], on the other, Davis and Lindblom’s [in press] assumption that infants cannot exploit signal-complementary information seems (happily) too pessimistic. It no longer goes without saying that preverbal infants rely exclusively on explicit signal information in recognizing spoken words. In making this case, I take liberties with the original formulation of H and H theory by focusing on only one half of the argument, exploring the notion of listener-oriented hyperspeech in interactions with infants without discussing Lindblom’s [1990] complementary notion of production constraints leading to the use of hypospeech. Moreover, while H and H theory focuses on speaker’s accommodations to listeners at the segmental level, the idea of hyperspeech proposed here is extended to include other kinds of modifications which may enhance intelligibility. Although the new findings from infancy research are still a long way from resolving the issue of how inexperienced listeners cope with variability at the phonetic level, they suggest that even very young infants are adept at discerning structure on many levels as they listen to spoken language, and that ID speech can indeed be seen as an effective form of hyperspeech which serves functions related to those proposed by Lindblom [1990].

Signal-Complementary Information in Speech Signal Processing by Adults

The fact that adult listeners do not rely only on acoustic information in the signal to identify words in continuous speech has been documented extensively in experiments exploring the role of contextual factors in spoken language understanding. Early research by Pickett and Pollack [1963a, b] showed that when words are excised from their context, they are often phonetically ambiguous and difficult to understand in isolation. Warren’s [1970] parallel demonstration of the ‘phoneme restoration effect’ showed that when sufficient context is available, listeners can understand a word even when a segment within the word is replaced by noise, often failing to notice that the segment is missing. What Lindblom [1990] refers to as ‘signal-complementary’ or ‘signal-independent’ information includes such classic examples of contextual cues used to compensate for incomplete phonetic specification. But these terms can also be extended to a broader range of sources of linguistic and nonlinguistic knowledge which the experienced listener can exploit in making judgments and predictions about word identity. The many forms of organization in spoken language which are potentially informative sources of signal-complementary knowledge include the following: Coarticulation Effects. Since the vocal tract changes shape continuously and anticipates what is coming next, a speech sound reflects not only the intended sound but also previous and subsequent sounds in the sequence. Thus the vowel in the word job is acoustically different from the vowel in either sob or jog. By splicing the initial consonant and vowel from job to the final consonant from jog, Marslen-Wilson and Warren [1994] created a word which was perceived as jog but in which there was a mismatch between the place of articulation indicated by the vowel transition and the

244

Phonetica 2000;57:242–254

Fernald

place of articulation indicated by the postvocalic consonant release. They found that subjects in a lexical decision task responded more slowly and less accurately to words in which the coarticulation cues were misleading, indicating that listeners make use of the smallest detail possible to distinguish among words in the mental lexicon. Phonotactics. Languages differ not only in the sounds they use but also in restrictions on how these sounds can be ordered to form syllables and words. For example, in English the consonant cluster [ft] can occur at the end of a word, as in raft and theft, but not at the beginning of a word, while the opposite pattern holds for the consonant cluster [fr]. Knowledge of such language-specific phonotactic rules can guide the listener in the correct segmentation of words in continuous speech. Prosodic Regularities. Languages also have characteristic rhythmic properties such as the strong/weak stress pattern characteristic of many words in English [Cutler and Carter, 1987]. Cross-linguistic research shows that listeners make use of such prosodic regularities in identifying word boundaries in continuous speech [see Cutler et al., 1997], and that listeners’ segmentation strategies reflect the rythmic structure of their native language [e.g. Otake et al., 1993]. Lexical Patterns. Syllables that co-occur in a fixed order are likely to constitute words, and listeners can use the first part of a syllable sequence to make accurate predictions about what follows. Research on the time course of lexical access shows that adult listeners process speech incrementally, generating and rejecting hypotheses about the identity of a word based on what they have heard so far [e.g. Marslen-Wilson and Zwitserlood, 1989]. For example, a syllable sequence such as /ele/ activates several English words including elegant, elevator, and elephant, all consistent with the wordinitial phonetic information. However, when the listener hears that the next segmant is /f/, the word elephant can be uniquely identified even before the final syllable is completed. Such distributional regularities in patterns of syllables provide another kind of signal-complementary information which listeners can use in word recognition. Semantic Relations. Semantic priming can fundamentally shape the perception of a physically ambiguous speech signal. To use Lindblom’s [1990] example, the phrase less’n five is interpreted quite differently when preceded by the question What was your homework assignment? than when preceded by the question How many people came today? Although the pronunciation is identical in both cases, less’n five would be perceived unambiguously as lesson five in the first case, and as less than five in the second. This phenomenon is just one of many in the substantial literature in cognitive psychology on concept-driven or top-down processing in both the auditory and visual domains [e.g. Stanovich and West, 1983]. Syntactic Cues. While the English word to could in principle be followed by a word in several classes, in the context of the phrase beginning to, it can only be followed by a verb. Listeners take such grammatical cues into account when interpreting an ambiguous sequence such as /˜tæk/, which is perceived differently in different syntactic frames – as a verb in wants to attack or as a determiner plus noun in sit on a tack. Thus knowledge of syntactic rules can also constrain alternatives in word recognition, although context effects due to syntactic congruity are weaker than effects due to semantic congruity [e.g. Tyler and Wessels, 1983]. Discourse Level Cues. The content and coherence of ongoing discourse provides another source of contextual information which can facilitate recognition of words without exclusive reliance on sensory information. Numerous studies have shown how real-world knowledge is brought to bear in language comprehension, as when well-


Phonetica 2000;57:242–254

245

known scripts are activated and increase the speed of recognition of related words [e.g. Sharkey and Mitchell, 1985]. Nonlinguistic Information. Listeners also integrate information from the visual context with linguistic information in the process of identifying spoken words [e.g. Tanenhaus et al., 1995]. For example, when listeners follow spoken instructions to manipulate real objects, they closely monitor the scene before them. Using an eyetracking camera, Tanenhaus et al. [1995] showed that listeners’ eye movements to the objects named are time-locked to the referring words, and that visual information is continuously integrated in the course of arriving at a correct interpretation of the utterance. These few examples make the point that in natural discourse, adult listeners have numerous and diverse sources of signal-complementary information which they can rely on simultaneously in transforming an ambiguous acoustic signal into a meaningful representation.

Do Infants Have Access to Signal-Complementary Information?

Until recently, it was difficult to imagine how preverbal infants who show no signs of understanding the words they hear could possibly be making use of top-down information in speech processing. The pioneering early studies on the development of speech perception tested infants’ ability to discriminate isolated syllables [see Aslin et al., 1998]. There was initially little communication between the researchers trained in psychoacoustics who studied speech processing skills evident in the first months of life and the language acquisition specialists, typically trained in linguistics, who studied the development of comprehension abilities that start to emerge in spontaneous behavior at the end of the 1st year. In more recent research, however, these two traditions have begun to converge on common questions about the early development of receptive language skills. The focus on infants’ perception of isolated syllables has shifted toward explorations of infants’ ability to discern patterns in continuous speech which are relevant to several different aspects of linguistic structure. The forms of signal-complementary information potentially useful to adults in word recognition draw on knowledge of the phonological, semantic, syntactic, and discourse organization of the language as well as on nonlinguistic knowledge. Because infants are still in the process of learning about the phonology of the ambient language, and have just begun to build a lexicon and to figure out syntactic relations, they have limited resources for disambiguating spoken words using signal-independent knowledge. However, even in the early months, infants’ attention to regularities in the speech they hear provides a foundation for emerging linguistic capabilities which will become more evident in the 2nd year of life. Long before they understand word meanings, infants perceive speech sounds in categories shaped by exposure to the ambient language [e.g. Kuhl et al., 1992; Werker and Tees, 1983]. By 7 months of age, infants are also sensitive to phonotactic patterns typical of the language [e.g. Jusczyk et al., 1994]. That is, they show evidence of recognizing not only which speech sounds are common in the language they are hearing but also which sequences of speech sounds are legitimate. Around this age, infants also attend differentially to language-specific stress patterns. Jusczyk et al. [1993] found that English-learning infants show a listening preference for the strong-weak stress pattern typical of the majority of English words. Other

246

Phonetica 2000;57:242–254

Fernald

studies show that infants recognize recurrent patterns in strings of speech sounds that more closely resemble natural speech. Jusczyk and Aslin [1995] found that when 8month-olds are presented repeatedly with a bisyllabic word embedded in continuous speech, they can later recognize the word as familiar when it is presented in isolation. Even when familiarized only briefly with nonsense syllable strings that do not represent actual language samples, 9-month-olds appear to extract wordlike units by noticing which syllables co-occur [Saffran et al., 1996]. These studies indicate that over the 1st year, infants become increasingly skilled in making detailed distributional analyses of acoustic-phonetic features of spoken language. Given that the abilities to recognize familiar sequences of speech sounds and to parse the speech stream into wordlike units are emerging over the 1st year, can these be viewed as potential sources of signal-complementary knowledge in the sense intended by Lindblom [1990]? For example, does early awareness of phonotactic regularities or typical stress patterns in English actually help the infant to identify ambiguous words? Research directly addressing this question is just beginning, but Jusczyk et al. [1993] have found that English-learning infants more readily segment embedded words which have the strong-weak stress pattern prevalent in English than words with the less common weak-strong stress pattern [Jusczyk, 1997]. This finding suggests that infants’ segmentation strategies become more efficient through early experience, as they are adapted to exploit recurrent patterns which provide cues to word boundaries in the ambient language. It is important to note, however, that the ‘word segmentation’ skills demonstrated at this age are more appropriately viewed as an increasingly fine-tuned ability to recognize familiar phonetic patterns. Although competence in identifying sequences of sounds as coherent acoustic patterns is obviously prerequisite for comprehension, this can occur without any association between a particular sound sequence and a word meaning. For example, Hallé and de Boysson-Bardies [1994] found that 10-month-old French-learning infants listened longer to words likely to be familiar to them than to less common words, showing that infants can have some kind of acoustic-phonetic representation for frequently heard words before there is any evidence for comprehension. While the 10-month-old infant’s ability to identify a spoken word as an exemplar of a familiar sound sequence constitutes word recognition in only a limited sense, it is nevertheless an essential step in the process. Infants’ rudimentary awareness of phonological regularities and recurrent lexical patterns functions as signal-complementary knowledge in facilitating segmentation, even at this early stage of language development. Toward the end of the 1st year infants begin to learn meanings for words, at first very gradually. Around the age of 18 months, many infants pick up speed in word learning, a shift in rate of acquisition which is often referred to as the ‘vocabulary burst’. During this period they also pick up speed in another important domain, recognizing familiar spoken words more quickly and reliably. Research in our laboratory has shown that infants make dramatic gains in speech processing efficiency between the ages of 15 and 24 months [Fernald et al., 1998]. We tested infants in a procedure where they looked at pictures of familiar objects while listening to speech naming one of the objects. By closely monitoring their eye movements, we could assess the speed and accuracy of word recognition. When listening to a sentence such as Where is the baby?, 15-month-old infants shifted their gaze to the matching picture only after the target word has been completely spoken. However, 24-month-old infants typically responded about 300 ms faster, shifting their gaze to the correct picture before the end of the tar-


Phonetica 2000;57:242–254

247

get word. Thus around the time of the vocabulary burst, infants not only increase the rate of learning new words, but also increase the speed and efficiency with which they recognize and understand familiar words in continuous speech. In two further studies, we asked whether infants are able to process speech incrementally, an ability central to the speed and efficiency of spoken language understanding by adults. Swingley et al. [1999] showed 24-month-old infants pairs of pictures of objects whose names have either substantial phonetic overlap at word onset (doggie/doll) or no overlap (e.g. doggie/tree). While doggie is distinguishable from tree at the beginning of the word, doggie and doll could not be discriminated by adults until the occurrence of the second consonant. The question of interest was whether infants hearing the word doggie would be slower to shift to the correct picture on dog/doll trials than on dog/tree trials. As with adults, 2-year-olds rapidly distinguished doggie from tree, but were slower to distinguish doggie from doll, suggesting that they were monitoring the speech signal continuously and taking advantage of word-initial information to identify these words. In the next study we presented 18- to 21-monthold infants with truncated target words in which only the first part of the word was available. If infants can identify familiar words when only the word-initial information is presented, this would provide even more convincing evidence for the early development of incremental processing. Infants heard sentences containing either a whole target word (e.g. baby, doggie, kitty), or a partial target word constructed by deleting the final segments of a whole word (e.g. /beR/, /dÆ/, /kR/). We found that 18- and 21-month-old infants were able to recognize spoken words rapidly and reliably on the basis of partial segmental information. These findings show that by the age of 18 months, when English-learning infants on average can speak fewer than 100 words, they are already beginning to make use of signal-complementary knowledge in word recognition. In the experiments described above, infants chose the matching picture using only word-initial phonetic information, making their move before the signal was completely specified. This required integrating information from the visual context with stored knowledge of word meanings and of the sounds associated with those meanings. As with most experimental procedures, the context here was highly constrained as compared to more natural situations in which language is spoken and understood, since the object named was visually present and the infant saw only two objects at a time. However, by imposing such experimental constraints we revealed aspects of infants’ developing competence in language understanding which could not be seen in spontaneous behavior at this age. The finding that infants increase the speed of recognizing familiar spoken words by 300 ms between 15 and 24 months is very likely related to parallel developments in the ability to remember and produce longer strings of words, an ability which is essential for learning syntax. Similarly, the finding that infants, like adults, can recognize some words without complete signal specification shows that young language learners are already integrating information from the acoustic signal with linguistic and nonlinguistic contextual information as they listen to continuous speech. Although the sources of signal-complementary knowledge available are minimal compared to those used by adults, infants take advantage of whatever sources they have, even in the very early stages of developing competence in spoken language understanding.

248

Phonetica 2000;57:242–254

Fernald

Adult Speech to Infants as ‘Hyperspeech’

Although I am ignoring central issues in the H and H argument by focusing on infant speech processing, it is important to see how early language input fits into the larger framework of Lindblom’s [1990] theory. The goal of H and H theory is to provide a solution for the so-called invariance problem, which refers to the absence of consistent associations between phonemes and particular patterns of acoustic energy in the speech signal [see Perkell and Klatt, 1986]. The invariance problem turns out to be more problematic for theorists than it is for listeners, since word recognition is in fact generally effortless and unproblematic for adults. Lindblom [1990] argues that phonetic variability exists for good reason, and that the quest for acoustic invariance is misguided; rather than searching for a prototypical pattern of physical correlates for each speech sound, the quest should be redirected toward discovering how speech signals vary as a function of the fluctuating demands of the conversational situation. According to H and H theory, speakers continually adjust their productions to accommodate the listener. In situations where the listener has difficulty understanding, the speaker provides a clear signal by using more precisely articulated hyperspeech. In the converse situation, where the listener has access to sufficient signal-complementary information to follow the speaker’s meaning, the speaker reverts to use less clearly articulated hypospeech. Thus phonetic variability reflects the lawful covariation between modulations in the acoustic signal along the hyper-/hypospeech continuum and the communicative pressures that induce these modulations. Speech to infants is relevant in the context of H and H theory for two related reasons: First, Lindblom [1990] is interested in the nature and early development of the vowel and consonant categories formed by infants. And second, ID speech may function as a form of hyperspeech which enhances intelligibility for the naïve listener. I will focus here only on the second issue, which Davis and Lindblom [in press] rejected based on their finding of extensive phonetic variability in mothers’ speech. ID speech has been characterized as listener-oriented because adults modify their speech in various ways which might be particularly effective when interacting with immature listeners. Caretakers speaking to infants in many cultures tend to use shorter utterances, longer pauses, higher pitch, and wider pitch excursions than when talking to an adult [e.g. Fernald and Simon, 1984; Fernald et al., 1989]. The intonation of ID speech is effective in engaging the infant’s attention and interest [e.g. Fernald, 1985; Werker and McLeod, 1990] and in eliciting emotional responses [e.g. Fernald, 1992, 1993]. In addition to these attentional and affective functions, ID speech may also enhance the intelligibility of speech for the child [Ferguson, 1977], a prediction consistent with the idea that ID speech is a hyperspeech mode. Several researchers have analyzed corpora of speech addressed to infants and adults to see if ID and AD speech differ in phonetic and prosodic features which might increase intelligibility [e.g. Bernstein-Ratner, 1986]. In a cross-language comparison of ID and AD speech in English, Swedish and Russian, Kuhl et al. [1997] found that vowels were somewhat more peripheral in the vowel space in ID speech. Although Kuhl et al. [1997] interpreted these findings as evidence for the enhanced distinctiveness of vowels in speech to infants, it is important to note that there was substantial acoustic variability among ID vowels in the cross-language sample, comparable to the variability Davis and Lindblom [in press] found in their analysis of mothers’ speech in English. Thus these results do not support the idea that ID speech helps infants by providing them with speech that is less variable than AD speech at the segmental level.


Phonetica 2000;57:242–254

249

Davis and Lindblom [in press] concluded that mothers do not use clearly articulated hyperspeech with infants. Their assumption that infants have no access to the kinds of signal-complementary knowledge available to adults then led to their conclusion that the H and H account is not relevant to ID speech. However, it is not the case that infants rely only on the quality of the signal in word recognition, as we saw in the research reviewed above. Even in the 1st year, infants make use of their rudimentary knowledge of regularities in the speech they are hearing to identify potential linguistic units. And by the middle of the 2nd year, they are using lexical knowledge in combination with visual context and incomplete acoustic information to identify spoken words with impressive speed and accuracy. Because Davis and Lindblom [in press] underestimated the ability of young infants to make use of signal context, the hyperspeech model should not be discredited on these grounds. The evidence does show that ID speech is not a hyperspeech form as originally defined in H and H theory, i.e. speech in which vowels and consonants are more clearly articulated and more homogeneous than in AD speech. However, the idea that adults intuitively and dynamically accommodate speech to infants in order to make their meanings more accessible to inexperienced listeners is still plausible. My proposal is that the hyperspeech notion should not be confined to articulatory factors at the segmental level, but should be extended to a wider range of factors in speech that facilitate comprehension by the infant. Lindblom [1990] describes hyperspeech as dynamic because speakers increase signal quality to compensate for the lack of signalcomplementary information available to the listener. This account implies that speakers have control over the signal but have no control over the context in which the signal occurs. An even more dynamic account of hyperspeech would explore both how speakers articulate speech sounds more or less clearly in different contexts, and also how speakers continually adjust the kinds and extent of contextual information they provide to the listener. If the hyperspeech notion is expanded to include not only ways of enhancing signal clarity but also ways of increasing the availability of signal-complementary information, all in the service of successful communication between speaker and listener, then ID speech provides some good examples of hyperspeech in action.

How ID Speech Facilitaties Word Recognition

In addition to the suprasegmental modifications in ID speech reviewed above, caretakers also arrange the words they speak to infants in ways which may facilitate comprehension. For example, focussed words are frequently spoken in isolation [Aslin et al., 1996; Fernald and Morikawa, 1993], a rare occurrence in AD speech. When focussed words are embedded in multiword utterances, English-speaking mothers often place them on pitch peaks at the end [Fernald and Mazzie, 1991]. English-speaking mothers also increase the duration of content words when talking to children [Albin and Echols, 1996; Swanson et al., 1992]. Another distinctive characteristic of ID speech is the high rate of repetition, as mothers either repeat themselves exactly or use a theme-and-variation style in successive utterances [e.g. Fernald and Morikawa, 1993]. Whether or not these ID speech modifications actually facilitate word recognition is a question that can only be addressed by testing infants’ understanding of spoken

250

Phonetica 2000;57:242–254

Fernald

words in controlled experiments. Using the procedure described earlier, we have begun to explore the effects of various characteristic features of ID speech on word recognition by infants in the 2nd year. In one study, we tested the hypothesis that familiar words occurring at the end of the utterance would be recognized more reliably than the same words when embedded in the middle of the utterance [see Fernald et al., in press]. Subjects were 15- and 19-month-old English-learning infants, who heard sentences containing a familiar word either in utterance-final position (e.g. Over there there’s a BALL) or in utterance-medial position (e.g. There’s a BALL over there). The younger infants were able to identify the target words when spoken in final position, but not in medial position; the older infants performed above chance on both final and medial words, but were still significantly better at recognizing words in final position. Thus even when a familiar word was easily recognized when it came at the end of a sentence, the same word embedded in the middle of the sentence presented a more difficult processing task. These findings suggest that the common ID speech strategy of positioning focussed words at the end of the utterance has perceptual advantages for the languagelearning infant. In another study we explored the effects of repetition of the carrier phrase on word recognition by 18-month-old infants [Fernald and McRoberts, 1999]. Infants heard familiar words positioned at the end of nonsense carrier phrases made up of permutations of the same four syllables (e.g. Ba li gu do a BALL, Gu do ba li a BALL, etc.). In the highly predictable fixed frame condition, infants heard target words in the same carrier phrase on every trial; in the less predictable variable frame condition, infants heard the target words in four different carrier phrases in random order across trials. Infants in the fixed frame condition performed better overall, and also showed a learning effect over trials. That is, within a few trials in which the same carrier phrase was used repeatedly, they were able to figure out where to listen for the target word in the meaningless string of syllables. However, when the carrier phrase was less predictable, infants were less successful in listening for the familiar word. These results provide another example of how characteristic patterns of redundancy in ID speech can serve as a kind of support system for the novice listener. When an adult speaking to a young child intuitively places the focussed word in a predictable and perceptually accessible position in the sentence, and uses short repetitive phrases to lead up to the word of interest, the word is easier for the child to recognize and understand. These common strategies in ID speech can be seen as ways of providing signal-complementary information which function more on a perceptual than a linguistic level in enabling the child to identify units in the stream of speech. When adults rely on contextual information in understanding speech, the expectations that inform their understanding are based on their knowledge of the language and on a sophisticated appreciation of the discourse situation. Infants can make only limited use of expectations based on linguistic knowledge, but they can make much greater use of expectations which are induced, so to speak, by the exaggerated redundancy of ID speech itself. That is, caretakers provide a form of signal-complementary information which does not depend on extensive knowledge of language and pragmatics in order to be accessible to the infant, but rather depends on more general auditory pattern recognition abilities. By maximizing predictability through frequent repetition and the use of formulaic speech patterns which highlight focussed words, ID speech provides contextual support on perceptual levels which are accessible even to preverbal infants.


Phonetica 2000;57:242–254

251

Summary and Conclusions

Lindblom [1990] points out that H and H theory provides a ‘presupposition’ account of intraspeaker phonetic variation. In linguistics, this term is used with reference to how speakers vary the lexical, grammatical, and discourse-level information in their productions, based on assumptions about what the listener knows or needs to know in order to understand the message. Many experiments show how the intelligibility of a word in continuous speech depends partly on the clarity of the word itself, but also on the availability of contextual information which engages the listener’s stored linguistic knowledge and knowledge about the world. Such signal-complementary knowledge facilitates word recognition, and may be especially important when the signal is ambiguous. Viewing the phonetics of speech production as analogous to the generation of lexical and grammatical forms, Lindblom [1990] coined the term ‘hyperspeech’ to refer to the speakers’s tendency to adjust the quality of the signal to accommodate the listener’s needs. Although ID speech has been characterized as a listener-oriented speech register, Davis and Lindblom [in press] found extensive phonetic variability in speech to infants and thus concluded that ID speech is not a form of hyperspeech. They also reasoned that since infants have not developed the signal-complementary linguistic knowledge which would enable them to compensate for the inherent ambiguity of speech sounds, the variability in ID speech could not be understood in terms of H and H theory. Davis and Lindblom’s [in press] conclusion makes sense if the definition of hyperspeech is limited to speech modulations at the phonetic level. However, if hyperspeech is also understood to include speech modulations of other kinds which serve to increase predictability and to generate expectations about where focussed words will occur in the sequence of syllables, then ID speech can indeed be viewed as a hyperspeech form with beneficial consequences for the language-learning infant. Although phonetic variability is extensive in ID speech, infants have two sources of signal-complementary knowledge to exploit in word recognition. The first is their early sensitivity to regularities in the speech they hear which are correlated with structural aspects of the ambient language. The second is their attention to the local redundancies provided by caretakers who use a special form of hyperspeech when speaking to the infant, i.e. who repeat themselves frequently, emphasize focussed words by placing them in perceptually prominent positions, and lighten the load of monitoring continuous speech by presenting new and focussed information in predictable formats. Hyperspeech in this broader view is not used just to clarify the signal when contextual knowledge is lacking. ID speech also functions to increase the amount of contextual information available, by organizing the speech stream in such a way that focussed words are likely to be salient even to a young listener with minimal linguistic experience. This account differs substantially from Lindblom’s [1990] formulation of hyperspeech in H and H theory and skirts the difficult question of how infants cope with phonetic variability. However, the functional view of ID speech proposed here suggests that the notion of hyperspeech can deepen our understanding of how signal-complementary information on many different linguistic and nonlinguistic levels influences word recognition even in the earliest stages of language learning.

252

Phonetica 2000;57:242–254

Fernald

References Albin, D.D.; Echols, C.H.: Stressed and word-final syllables in infant-directed speech. Inf. Behav. Dev. 19: 401–418 (1996). Altmann, G.T.M.: The ascent of Babel: an exploration of language, mind, and understanding (Oxford University Press, Oxford 1997). Aslin, R.N.; Jusczyk, P.W.; Pisoni, D.B.: Speech and auditory processing during infancy: constraints on and precursors to language; in Kuhn, Siegler, Cognition, perception, and language. Child Psychol., vol. II (Wiley, New York 1998). Aslin, R.N.; Woodward, J.Z.; LaMendola, N.P.; Bever, T.G.: Models of word segmentation in fluent maternal speech to infants; in Morgan, Demuth, Signal to syntax: bootstrapping from speech to grammar in early acquisition (Erlbaum, Hillsdale 1996). Bernstein-Ratner, N.: Durational cues which mark clause boundaries in mother-child speech. J. Phonet. 14: 303–309 (1986). Cutler, A.; Carter, D.M.: The predominance of strong initial syllables in the English vocabulary. Computer Speech Lang. 2: 133–142 (1987). Cutler, A.; Dahan, D.; Donselaar, W.: Prosody in the comprehension of spoken language: a literature review. Lang. Speech 40: 141–201 (1997). Davis, B.L.; Lindblom, B.: Phonetic variability in baby talk and development of vowel categories; in Lacerda, von Hofsten, Heineman, Emerging cognitive abilities in infancy (Cambridge University Press, Cambridge, in press). Ferguson, C.A.: Baby talk as a simplified register; in Snow, Ferguson, Talking to children: language input and acquisition (Cambridge University Press, Cambridge, 1977). Fernald, A.: Four-month-old infants prefer to listen to motherese. Inf. Behav. Dev. 8: 181–195 (1985). Fernald, A.: Human maternal vocalizations to infants as biologically relevant signals: an evolutionary perspective; in Barkow, Cosmides, Tooby, The adapted mind: evolutionary psychology and the generation of culture (Oxford University Press, Oxford 1992). Fernald, A.: Approval and disapproval: infant responsiveness to vocal affect in familiar and unfamiliar languages. Child Dev. 64: 657–674 (1993). Fernald, A.; Mazzie, C.: Prosody and focus in speech to infants and adults. Devl. Psychol. 27: 209–221 (1991). Fernald, A.; McRoberts, G.W.: The prosodic bootstrapping analysis: a critical analysis; in Morgan, Demuth, Signal to syntax: bootstrapping from speech to grammar in early acquisition (Erlbaum, Hillsdale 1996). Fernald, A.; McRoberts, G.W.: Listening ahead: how repetition enhances infants’ ability to recognize words in fluent speech. 24th Annu. Boston Univ. Conf. on Lang. Dev., Boston 1999. Fernald, A.; McRoberts, G.W.; Swingley, D.: Infants’ developing competence in recognizing and understanding words in fluent speech; in Weissenborn, Hoehle, Approaches to bootstrapping in early language acquisition (Benjamins, Amsterdam, in press). Fernald, A.; Morikawa, H.: Common themes and cultural variations in Japanese and American mothers’ speech to infants. Child Dev. 64: 637–656 (1993). Fernald, A.; Pinto, J.P.; Swingley, D.; Weinberg, A.; McRoberts, G.W.: Rapid gains in speech of verbal processing by infants in the 2nd year. Psychol. Sci. 9: 228–231 (1998). Fernald, A.; Simon, T.: Expanded intonation contours in mothers’ speech to newborns. Devl. Psychol. 20: 104–113 (1984). Fernald, A.; Taeschner, T.; Dunn, J.; Papousek, M.; et al.: A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverval infants. J. Child Lang. 16: 477–501 (1989). Hallé, P.A.; de Boysson-Bardies, B.: Emergence of an early receptive lexicon: infants’ recognition of words. Inf. Behav. Dev. 17: 119–129 (1994). Johnson, K.; Mullenix, J.: Talker variability in speech processing (Academic Press, New York 1997). Jusczyk, P.W.: The discovery of spoken language (MIT Press, Cambridge 1997). Jusczyk, P.W.; Aslin, R.N.: Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychol. 29: 1–23 (1995). Jusczyk, P.W.; Cutler, A.; Redanz, N.: Preference for the predominant stress patterns of English words. Child Dev. 64: 675–687 (1993). Jusczyk, P.W.; Luce, P.A.; Charles-Luce, J.: Infants’ sensitivity to phonotactic patterns in the native language. J. Mem. Lang. 33: 630–645 (1994). Kleiman, G.M.: Sentence frame contexts and lexical decisions: sentence-acceptability and word-relatedness effects. Mem. Cogn. 8: 336–344 (1980). Kuhl. P.K.; Andruski, J.; Chistovich, L.; Kozhevnikova, E.; Ryskina, V.; Stolyarova, E.; Sundberg, U.; Lacerda, F.: Cross-language analysis of phonetic units in language addressed to infants. Science 277: 684–686 (1997). Kuhl. P.K.; Williams, K.A.; Lacerda, F.; Stevens, K.N.; Lindblom, B.: Linguistic experience alters phonetic perception in infants by 6 months of age. Science 255: 606–608 (1992). Lindblom, B.: Explaining phonetic variation: a sketch of the H and H theory; in Hardcastle, Marchal, Speech production and speech modeling (Kluwer, London 1990). Marslen-Wilson, W.; Warren, P.: Levels of perceptual representation and process in lexical access: words, phonemes, and features. Psychol. Rev. 101: 653–675 (1994).


Phonetica 2000;57:242–254

253

Marslen-Wilson, W.; Zwitserlood, .:  1989. Otake, T.; Hatano, G.; Cutler, A.; Mehler, J.: Mora or syllable? Speech in Japanese. J. Mem. Lang. 32: 358–378 (1993). Perkell, J.S.; Klatt, D.H.: Invariance and variability in speech processes (Erlbaum, Hillsdale 1986). Pickett, J.M.; Pollack, I.: Intelligibility of excerpts from fluent speech: effects of rate of utterance and duration of excerpt. Lang. Speech 6: 151–164 (1963a). Pollack, I.; Pickett, J.M.: Intelligibility of excerpts from fluent speech: auditory vs. structural content. J. verbal Learn. verbal Behav. 3: 79–84 (1963b). Saffran, J.R.; Aslin, R.N.; Newport, E.L.: Statistical learning by 8-month-old infants. Science 274: 1926–1928 (1996). Sharkey, N.E.; Mitchell, D.C.: Word recognition in a functional context: the use of scripts in reading. J. Mem. Lang. 24: 253–270 (1985). Stanovich, K.E.; West, R.F.: On priming by a sentence context. J. exp. Psychol. gen. 112: 1–36 (1983). Swanson, L.A.; Leonard, L.B.; Gandour, J.: Vowel duration in mothers’ speech to young children. J. Speech Hear. Res. 35: 617–625 (1992). Swingley, D.; Pinto, J.P.; Fernald, A.: Continuous processing in word recognition at 24 months. Cognition 71: 73–108 (1999). Tanenhaus, M.; Spivey-Knowlton, M.; Eberhard, K.; Sedivy, J.: The interaction of visual and linguistic information in spoken language comprehension. Science 268: 1632–1634 (1995). Tyler, L.K.; Wessels, J.: Quantifying contextual contributions to word recognition processes. Percept. Psychophys. 34: 409–420 (1983). Warren, M.: Perceptual restorations of missing speech sounds. Science 167: 392–393 (1970). Werker, J.F.; McLeod, P.J.: Infant preference for both male and female infant-directed-talk: a developmental study of attentional and affective responsiveness. Can. J. Psychol. 43: 230–246 (1990). Werker, J.F.; Tees, R.C.: Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Inf. Behav. Dev. 7: 323–333 (1983).

254

Phonetica 2000;57:242–254

Fernald


Received: October 5, 1999 Accepted: April 14, 2000

The Construction of a First Phonology Marilyn M. Vihman a Shelley L. Velleman b a

School of Psychology, University of Wales at Bangor, Gwynedd, UK; of Communication Disorders, University of Massachusetts at Amherst, Mass., USA b Department

Abstract Although it is generally accepted that phonological development is grounded in phonetic learning, there is less agreement for the proposition supported here, that the first phonological structuring constitutes a developmental discontinuity. Data from the phonetic and lexical learning of Finnish consonant duration are presented to illustrate the role of (1) child selection of adult words for early context-supported production based on phonetic learning and (2) child adaptation of adult words to an idiosyncratic template for later production as part of an incipient system. We argue that the latter, but not the former, reflects the construction of a first phonology. Copyright © 2000 S. Karger AG, Basel

Introduction

The chief difficulty for a constructivist account of language development is explaining the origin of specifically linguistic structure. We will argue that the phonetic grounding of phonology provides an entry point into linguistic structure, but that there is discontinuity as well as continuity in the course of the transitions from babbling to word production and the first phonological organization. Our main goal here will be to establish that an incipient phonological system can be identified within the singleword period and that it can be seen to emerge out of – yet remain qualitatively distinct from – phonetic structure. The view of phonological development presented in Lindblom [1999, p. 13] is one which emphasizes continuity and emergence. Specifically, development is guided by a physiological economy principle which helps the child solve the degrees of freedom problem …and spontaneously discover many patterns in the ambient language (motor boot-strapping)… There is no split between analog phonetics and digital phonology because, from the developmental point of view, phonology remains behavior and continues to be analog. Phonology differs qualitatively from phonetics in that it represents a new, more complex and higher level of organization of that behavior. For the child, phonology is not abstract. It represents an emergent patterning of phonetic substance.

But what is this emergent patterning? What is the process and, more importantly, what is the ‘qualitatively different’ outcome of this ‘self-organization’? If we accept



Marilyn M. Vihman School of Psychology, University of Wales at Bangor Gwynedd LL57 2DG (UK) Tel. +44 1248 383 775, Fax +44 1248 382 599 E-Mail [email protected]

that phonological structure arises gradually out of the phenomena of phonetics, we should expect a continuous incremental improvement in the child’s ability to approximate adult word targets. Lindblom [1998, 2000], for example, proposes a gradual filling in of a phonetic/gestural grid, in which words that share movement patterns with existing words are favored. By incrementally including novel combinations of new and old movement patterns the child eventually masters all of the phonetic possibilities of the language. Under this proposal no discontinuities need be posited. We take a different position here. We suggest that discontinuity as well as continuity are evident in speech development. The first words, which typically show little interrelationship in production and which closely resemble the repertoire of babbling patterns of the individual child [Vihman et al., 1985], are seen as the product of the child’s matching of own production patterns to input [‘the articulatory filter’: Vihman, 1993], resulting in the selection of words to say on phonetic grounds [Ferguson and Farwell, 1975]. To this point there is little difference between our view and Lindblom’s [1998] ‘ambient’ and ‘lexical recalibration’. However, new word forms are more than a mechanical or automatic extension of any or all existing phonetic structures. Rather, certain phonetic structures are exploited and generalized (abstracted) while others are set aside, even seemingly forgotten. Different children arrive at different solutions to (1) the conflict between their phonetic skills and the demands of the ambient language and (2) the representational problem posed by the learning of a large and arbitrary set of patterned sound-meaning pairs (words and/or phrases). In many children, these solutions consist of recognizable word templates [Menn, 1978], the ‘rationalization’ and extension of well-practiced word patterns developed in first word production. Over time, these templates – nascent phonological systems – break (partially) free of their phonetic foundations and impose themselves upon lexical items that do not fit the structural description of the template, leading to the adaptation of adult words, that is, patterns that were originally phonetically motivated are applied to contexts in which phonetic motivation is lacking. These ‘distortions’ of adult word shapes, sometimes resulting in ‘regressions’ – in which words previously produced more or less accurately now become less accurate in terms of the adult model but more in line with the child’s emergent phonological patterning – defy direct phonetic explanation. As noted by Ohala [1990, p. 163], ‘the natural patterns which arise in language can become fossilized in the mental grammar or lexicon of speakers without their natural roots being preserved’. The emergence of phonological organization in this sense can only be detected in a child-based distributional analysis, in which the extent of parallel patterning across different word forms is established and tracked longitudinally [Vihman and Velleman, 1989].

A Developmental Study of the Shift from Phonetics to Phonology

The natural variation across languages in the use of segmental quantity as an element in phonological structure provides a good testing ground for the emergence of phonology. One advantage of the study of the acquisition of quantitative contrast is that it combines both segmental and suprasegmental or prosodic aspects, and affects phonotactic structure as well. Another advantage is that the issue readily lends itself to both phonetic and phonological analysis. Here we focus on the acquisition of geminate consonants in Finnish, with data from English and French as points of comparison.

256

Phonetica 2000;57:255–266

Vihman/Velleman

Table 1. Mean age of chil-

dren at sampling points American Finnish French a

4-word point

25-word point

11.5 15 11.5

16a 18 17

Each language group included 5 children; mean ages are in months. Based on the first 5 participants (out of 10) to reach the 25-word point.

Acoustic analysis highlights the children’s increasingly language-specific phonetic skill, while phonological analysis illuminates the emergence of more abstract, less phonetically bound patterning. The basic question for an empirical phonetic study of the development of geminates is what kind of evidence we can accept as indicating emergent phonological opposition. In principle, we expect to find that an initially undifferentiated ‘durational space’ for child medial consonants will divide into a bimodal short: long pattern, and also that a tie-in with the adult lexicon will become evident, such that tokens of words with medial singletons in the target will fall into the short range while words with target medial geminates will fall into the long range. (Note that for the sake of a focussed analysis we are omitting any study of the equally contrastive long vs. short vowels of Finnish.) Our phonetic analyses will concentrate on tracking the length of medial consonants in both the word and nonword vocalizations of 5 children in each of the three language groups at two points in time, one at the outset of lexical development, the other toward the end of the single-word period. We will then look at the specific lexical patterns of individual children acquiring Finnish. We ask whether there is evidence that geminates play a particular role in the initial organization of phonology for 1 or more of the children, and if so, how that is manifested. Finally, we consider the relationship of the first phonological patterning to the phonetic learning out of which it emerges.

Method Data at two word points from 5 children each acquiring English, French or Finnish were analyzed acoustically (see table 1 for child age at each session). Data for English were collected in California, USA, for French in Paris, France. The word points are defined on the basis of the number of words produced spontaneously in a 30-min audio- and video-recorded session of unstructured play between mother and child; a small microphone was hidden inside a soft vest worn by the child. Words were identified according to the procedures described in Vihman and McCune [1994]. The earlier developmental point is the first one or two sessions in which the child produces at least four different word types spontaneously; the later point is based on approximately 25 spontaneous words produced in a session. Each of these lexically based developmental points corresponds to a cumulative lexicon of approximately twice as many words by parental record [Kunnari, 2000] so 8–10 and 50+ words, respectively. The Finnish data were collected by Sari Kunnari [2000] as part of her PhD thesis. Ten first-born children acquiring Finnish in Oulu, 5 girls and 5 boys, were video-recorded monthly from age 5 months until they had a cumulative vocabulary of about 50 words according to parental diary records. A high-quality microphone was placed near parent and child as they interacted in unstructured play sessions. In addition to the transcription of infant vocalizations, 5 mothers’ speech to their child

A First Phonology

Phonetica 2000;57:255–266

257

Table 2. Distribution of Finnish geminates

A B C

Words with geminates

Total words

Proportion geminate words (SD)

199 69 112

518 142 276

0.38 (0.05) 0.49 (0.10) 0.55 (0.13)

A = Content words in mother’s speech to 5 children at 4-word point; B = target words (words attempted by 5 children) at 25-word point; C = child words produced by 5 children at 25-word point.

Table 3. Medial consonant

length in English, French and Finnish English French Finnish

4-word point

25-word point

mean, ms (group SD)

mean, ms (group SD)

207.97 (82.5) 149.56 (43.68) 205.74 (46.99)

121.87 (28.81) 139.98 (8.18) 297.82 (96.07)

Each language group included 5 children.

was transcribed and analyzed for words used and for the distribution of consonants, all based on 4-word point sessions. The same word identification procedures were carried out for Finnish as for English and French [Kunnari, 2000]. We restrict our acoustic analyses to disyllables (both identifiable words and babble or unidentifiable words); these form the overwhelming majority of multisyllabic productions in all three languages. Utterances selected for inclusion minimally contained two open (vocalic) phases separated by a closed (consonantal) phase; we included every disyllable which lent itself to objective analysis by the methods available. Disyllables with interfering talking or other noise were not used. We also included no more than three successive repetitions of a single vocal type (segmentally consistent child word shape or babbled sound sequence), on the grounds that a ‘prosodic set’ could be inferred in such cases, which might bias the results. Phonetic transcriptions of the sessions were consulted for information as to word identity and perceived segmental sequence; the tokens were retranscribed as necessary on the basis of the additional acoustic information, especially as regards voicing. The children’s disyllabic vocalizations were extracted from the original recordings and digitally recorded using a 16-bit Audiomedia board at a sampling rate of 22.2 Hz. They were acoustically analyzed using the SoundScope Speech Analysis Package implemented on a Power PC. Segmentation followed rules devised as part of an earlier study [Vihman et al., 1998]. Medial consonant duration measurements were made using concurrent information from the amplitude trace, narrow- and wideband spectrograms, and intensity curve. Additionally, time scales were expanded on separate screens, to allow closer examination of transition points from C to V or V to C [Vihman et al., 1998, fig. 2].

Results Finnish Geminates Table 2 provides an overview of the distribution of geminates in our Finnish data, in content words in 5 mothers’ speech (A), in words attempted or ‘targeted’ by 5 children (i.e. adult word types, identified as the source of child word productions rather than through transcription of maternal input, B), and in word shapes as produced by

258

Phonetica 2000;57:255–266

Vihman/Velleman

Fig. 1. Developmental change in range of medial consonant durations.

those children (C). As in other studies of the phonetics of mothers’ speech in different language groups [Vihman et al., 1994], the mothers are similar (SD 0.05), while the children are more diverse, both in the percentage of geminate words attempted (target words: SD 0.10) and in words produced with a geminate in at least one token (based on transcription: SD 0.13). About half the words attempted included geminates. The 5 children included in this study were selected from the 10 participants in Kunnari’s [2000] sample. They include 3 who attempted many words with geminates (Atte, for whom 58% of all words attempted had medial geminates, Eelis, 55% such words, and Eliisa, 53%) and 2 who attempted fewer such words (Venla, 42% and Mira, 34%). Acoustic Analyses The group results are given in table 3. The children produce a wide range of medial consonant durations at the 4-word point. At the 25-word point the groups diverge in clear relationship to the structure of the adult languages. The children acquiring English and French, languages which make no contrastive use of consonantal length, show a decrease in both mean length and group heterogeneity with lexical advance. Furthermore, the individual children’s mean variability also drops (from an SD of 72 to 50 for English and 111 to 54 for French). The Finnish data show the oppo-

A First Phonology

Phonetica 2000;57:255–266

259

Fig. 2. Medial consonant

lengths at 4-word point: Atte and Eliisa.

site trend in every respect: The mean consonantal duration and individual child variability increase while the group of children also becomes more diverse. As a result, the Finnish children's mean durations differ significantly from those of the French and English children at the 25-word point (t test for independent means: p = 0.0001). Figure 1 displays these results from a different perspective. Here the range (maximum minus minimum) of each chiId’s medial consonant durations is shown for the two developmental points in all three language groups. (Range is used here explicitly because of the influence that outliers have on this measure. It is the distribution of values, rather than their means, which is the focus of this figure.) Although there is overlap among languages at the 4-word point, we see that the most extreme ranges are already attributable to Finnish children.1 At the 25-word point, on the other hand, all children in each group are converging on a range of medial consonant values. Thus, the American and French children cluster at the lower duration range values, reflecting the absence of geminates in the adult languages, and the Finnish children cluster at higher values, apparently reflecting the salience of long consonants. Some of the Finnish children develop sufficient sensitivity to the duration-based contrast in the adult language to begin to have what could be seen as a ‘bimodal’ profile by the 25-word point, prefiguring a full-blown phonological contrast, while others continue to explore a full range of consonantal durations. To illustrate this we will now consider the individual developmental profiles of 2 Finnish children, then the relationship of those profiles to the words attempted. Lastly we will turn to incipient phonological organization. Figure 2 shows different starting points for 2 Finnish children. Atte shows separate peaks, at about 100–149 and about 300–400 ms. Eliisa also shows two peaks, but both are at the short end of the range; she also has a few vocalizations with extremely

1 At this point, t tests for independent means indicate that the Finnish children’s means are already significantly different from those of the French children (p = 0.0001), but the English children's are significantly different from the French as well (p = 0.0033); English and Finnish do not differ statistically.

260

Phonetica 2000;57:255–266

Vihman/Velleman

Fig. 3. Medial consonant lengths at 25-word point: Atte and Eliisa.

a

b

A First Phonology

Fig. 4. Distribution of medial consonant length in relation to the length status of the medial consonant in the target. a Atte, 25-word point. b Eliisa, 25word point.

Phonetica 2000;57:255–266

261

long medial consonants. At the 25-word point both children have begun to produce more long medial consonants (fig. 3). Although Eliisa has something of a bimodal distribution, Atte’s profile now appears less bimodal and also less organized than at the 4word point. To supplement these profiles based on proportions of vocalizations at each medial consonant duration range, in figure 4 we plot for each child the distribution of consonantal Iength in relation to the length status (singleton vs. geminate) of the target medial consonant (for the later lexical point only, when more vocalizations are identifiable as words). Eliisa, but not Atte, appears to have largely sorted out words with singletons from words with geminates by the 25-word point. All of her longer productions have a geminate word as target, while target words with short medial consonants are produced within the shorter range – as are a few geminate words as well, however. She appears to be responding to the categorical nature of medial consonant duration in Finnish, and has made a preliminary, quasi-appropriate mapping of these categories to specific lexical items. Word Templates We have seen that by the 25-word point the Finnish children's phonetic production is sharply distinguishable from that of children acquiring languages with no phonological quantity contrast, showing a mean length double that of the other two language groups. Yet we have also seen that individual children respond differently to the challenge posed by the adult phonological opposition. That is, the general increase in medial consonant length suggests that all of the Finnish children have apparently ‘noticed’ the adult long medial consonants, which are perceptually very distinctive.2 Furthermore, long consonant production is not difficult, as is apparent from the relatively longer medial consonants produced at the 4-word point by children acquiring English and French as well as the greater proportion of singleton consonants produced long by the Finnish children at the 25-word point (23%) than vice versa (10%), based on review of the transcribed words. Thus, the phonetic groundwork has been laid. However, it is apparent that not all of the Finnish children are able, by the 25-word point, to coordinate appropriately the (phonological) representation of specific longand short-consonant words with the (phonetic) production plans needed to match them. Achieving the ability to produce differing consonants in a sequence of syllables is one of the key developmental challenges for both phonetic and phonological learning. Implicit phonetic learning in the period of babbling and first words results in the subtle shaping of child vocalizations in the direction of the adult language [shifting frequencies of production of segmental and prosodic categories toward ambient language values; de Boysson-Bardies and Vihman, 1991]. However, rather than gradually adding compatible new words to all existing movement patterns, many children exploit and generalize certain phonetic structures only, setting others aside despite the fact that the latter are already phonetically ‘in place’. This next step, towards representation-based and categorical or ‘phonological’ patterning, can be seen to lead in some cases to radical restructuring of adult target forms to fit the child’s preferred word template(s) as the child attempts to match a variety of target words to this subset of the patterns over

2 Richardson [1998] has demonstrated that 6-month-olds exposed to Finnish discriminate the synthetically designed nonsense forms [ata] from [atta] at a category boundary similar to that of adults.

262

Phonetica 2000;57:255–266

Vihman/Velleman

Table 4. Early word templates in Finnish

Child

Single consonant (C)VV(C)

Atte Eelis Eliisa Mira Venla Mean

11/33 (0.33) 3/31 (0.10) 4/32 (0.13) 2/24 (0.08) 0.19

(C)V.V

4/31 (0.13)

0.02

Consonant sequence VCV

C1VC1V

16/26 (0.61) 14/33 (0.39) 7/31 (0.23) 1/32 (0.03) 7/24 (0.29) 0.31

10/26 (0.38) 8/33 (0.27) 17/31 (0.55) 22/32 (0.69) 15/24 (0.63) 0.39

C1VC2V

5/32 (0.16) 0.09

C1…C1 indicates a consonant harmony constraint. Underlining indicates that the template applies to words with geminates only. The proportion of each child’s distinguishable tokens accounted for by each pattern is indicated in parentheses. Italics are used for all patterns that account for more than 33% of the child’s tokens.

which s/he has phonetic control [for a review of the literature supporting this point, see Vihman, 1996]. An illustration from our Finnish data of the shift from earlier accurate word production to later ‘regression’ is provided in 1 child’s attempt at saying /kuk:a/ ‘flower’, produced variably at 1;1.10 as [kukˆka], [kuk:o] (whispered) but several times at 1:2.8 as [ak:u] (the second syllable whispered), with omission of the initial consonant and also metathesis of the vowels (leading the child's mother to comment, ‘well, the vowels got a little mixed up…’). Typically, the majority of tokens of all word types conform to just one or two templates, with a small residue of forms falling outside the systematized pattern(s). We begin our consideration of incipient phonological organization in Finnish by summarizing the results of an analysis of the word tokens transcribed at the 25-word point. In the second row of the matrix in table 4 are indicated the five phonological patterns found in the children’s identifiable word productions at the 25-word point. In each case the child’s word types and tokens could be exhaustively categorized as falling into one or more of these patterns – namely, monosyllables (e.g. Eelis /pois/ ‘off, away’ [poi]), disyllables lacking a medial consonant (Eliisa /kirja/ ‘book’ [ki:a]), disyllables lacking an initial consonant (‘null onset’: Atte /a\k:a/ ‘duck’ [ak:a];), disyllables with harmonizing consonantal onsets (Mira /nenæ/ ‘nose’ [nenæ]), and disyllables with differing consonants in syllable onset (Mira /pel:e/ ‘clown’ [pel:e]). These patterns are based on child forms only. It is also informative to consider them in relation to the adult target. Here we distinguish two subtypes: (a) selected forms are those that derive directly from the target (allowing for changes in vowel or consonant that do not affect the basic structure; we take ‘selection’ to result from the child’s implicit mapping of perceived word forms onto his or her existing production patterns) while (b) adapted forms are those that reflect more radical changes to the adult target. Specifically, adaptation includes consonant assimilation, also known as consonant harmony, consonant omission, which results in V.V or VCV patterns, and syllable omission, which results in reducing adult disyllables to monosyllabic child forms. The illustrations provided above are all ‘selected’ except that for missing medial consonant, which does not occur in the target words. Examples of ‘adapted’ forms are Eelis /ki:n:i/ ‘closed’ [ki:] (final syllable omitted), Mira /lop:u/ ‘end, finished, all done’ [pop:u] (consonant harmony), and Atte /pal:o/ ‘ball’ [al:o] (initial consonant omitted).

A First Phonology

Phonetica 2000;57:255–266

263

Table 5. The role of geminates in Finnish early word templates

Adult word type medial C select CH

10

medial CC adapt 9

adapt

23

19

61

14

40

19 (0.31) Null onset

15

42 (0.69) 3

8

18 (0.45) Totals

25

total

select

22 (0.55) 12 (0.32)

37 (0.38)

31

33 (0.52)

101

64 (0.62)

Adult word targets are categorized by medial consonant: singleton or geminate. The children’s forms are tallied for the most salient pattern found in these data, consonant harmony (CH) and null onset (missing onset consonant). Numbers of word types exhibiting each pattern are tallied; proportions are in parentheses.

In short, adapted forms depart from the phonotactic structure of the adult word, resulting in each case in a child form closer to other forms currently being produced by the child in question but farther from the adult model than ‘selected’ forms, which show minimal vowel and consonant changes only. Note that the distinction between select and adapt is not primarily based on developmental level, since although we primarily find selected forms among first words, both selected and adapted forms occur later in the single-word period, as illustrated above. Table 5 indicates the extent to which the singletons vs. geminates enter into patterns which are selected (derive from existing adult patterns) vs. adapted (generalized from preexisting child patterns) to yield the child forms.3 We see that adaptation affects only about one third of the word types attempted with medial singletons but fully half of the word types with medial geminates. Looked at from the point of view of the individual patterns, (1) of the 61 child forms which harmonize, 69% derive from adult targets with geminates and (2) of the 40 onsetless child disyllables, 55% derive from geminate targets. It is important to note that geminate words are overrepresented in these patterns compared to their incidence overall (cf. table 2: only 38% geminate words in the input, 49% geminate words attempted, 55% geminate words produced, but 62% geminate words fitting the two child template patterns.) The steady ‘augmentation’ of the geminate pattern provided in adult words – first by child selection of geminate targets, then by child production of geminates, and finally by child construction of templates around geminates – provides strong evidence of the tendency of individual children, in constructing a first phonological system, to move beyond (phonetically gradual) accommodation to the adult patterns as they capitalize on phonetically salient features (in this case, geminates) that are also within their production capacities, adapting adult forms to their own production patterns and thereby creating phonological structures.

3 We omit the 18 words with medial heterogeneous consonant clusters from this tally (amounting to 22% of the clusters in target words), since they are variously produced as singletons, geminates, or clusters by the children.

264

Phonetica 2000;57:255–266

Vihman/Velleman

Discussion

The findings reported in this paper have demonstrated, once again, the role of the ambient language in shaping the phonetics of early word forms. Whereas the first words tend to show wide variability in length of medial consonants, both within and across children in each of the three language groups (as evidenced by the standard deviations in table 3 and the high mean duration ranges at the 4-word point in figure 1), by the time the children have a cumulative vocabulary of 50 words or more they have begun to align their production of medial consonants with the prevailing values heard in the input. To this extent, advances reflect ambient language influence and developmental continuity. This is not the whole story, however. The evidence we adduced from the patterning of the Finnish children’s word productions at the later developmental point shows the increasingly important role of linguistic structure. First, the geminates provide a salient phonetic ‘pull’, resulting in a higher proportion of words with geminates than is found in the input. Secondly, medial geminates characterize 62% of the word types fitting into the two predominant word patterns for Finnish children. In effect, it appears that geminate consonants are exerting a strong influence on the shape of early word production templates, overriding the role of Finnish trochaic stress, for example, which typically results in initial consonant preservation in children acquiring the best documented language, English [Hammond, 1997]. Although the children’s knowledge of geminates (based on their exposure to the input, which supports production of long consonants beyond the period in which it is found in English and French) is clearly the source of the observed increase in numbers of geminate words produced, the children go beyond this phonetic grounding, selecting this pattern from among those with which they have perceptual and articulatory experience, and then using it to organize their lexicons according to a small number of patterns, even to the extent of adapting formerly more phonetically accurate words into their phonological templates. The model we are proposing assumes the continuity of phonetics and an ‘intimate interaction’ between phonetics and phonology [Ohala, 1990, p. 156] throughout development. However, discontinuity is present as well: The onset of phonological systematization is superimposed upon ongoing phonetic learning, and in time this system begins to take a primary role in shaping output. In production the child takes advantage of phonetic grounding but focuses at any one time on only a subset of the word forms made available by prior phonetic experience, thereby developing a restricted, idiosyncratic first phonological system. Templates are typically based upon salient features of the ambient language but also reflect the particular patterns derived from the individual child’s prior phonetic learning. Not onIy are some of the phonetic forms previously practiced by the child no longer available, but also child adaptations of target forms sometimes result in these forms becoming less accurate. These nascent template-based systems signal the onset of truly phonological organization.

References Boysson-Bardies, B. de; Vihman, M.M.: Adaptation to language. Language 67: 297–319 (1991). Ferguson, C.A.; Farwell, C.B.: Words and sounds in early language acquisition. Language 51: 419–439 (1975). Hammond, M.: Optimality theory and prosody; in Archangeli, Langendoen, Optimality theory, pp. 33–58 (Blackwell, Malden 1997).

A First Phonology

Phonetica 2000;57:255–266

265

Kunnari, S.: Characteristics of early lexical and phonological development in children acquiring Finnish. Acta Univ. Oul. B 34 Humaniora (2000). Lindblom, B.: Systemic constraints and adaptive change in the formation of sound structure; in Hurford, StuddertKennedy, Knight, Approaches to the evolution of language, pp. 242–264 (Cambridge University Press, Cambridge 1998). Lindblom, B.: Emergent Phonology. Berkeley Linguistics Society, 25 (BLS, University of California at Berkeley 1999). Lindblom, B.: Developmental origins of adult phonology: the interplay between phonetic emergents and the evolutionary adaptations of sound patterns. Phonetica 57: 297–314 (2000). Menn, L.: Pattern, control, and contrast in beginning speech (Indiana University Linguistics Club, Bloomington 1978). Ohala, J.: There is no interface between phonology and phonetics: a personal view. J. Phonet. 18: 153–171 (1990). Richardson, U.: Familial dyslexia and sound duration in the quantity distinctions of Finnish infants and adults. Stud. philol. Jyväskyläensia 44: 9–211 (1998). Vihman, M.M.: Variable paths to early word production. J. Phonet. 21: 61–82 (1993). Vihman, M.M.: Phonological development (Blackwell, Oxford 1996). Vihman, M.M.; DePaolis, R.A.; Davis, B.L.: Is there a ‘trochaic bias’ in early word learning? Child Dev. 69: 935–949 (1998). Vihman, M.M.; Kay, E.; Boysson-Bardies, B. de; Durand, C.; Sundberg, U.: External sources of individual differences? Devl. Psychol. 30: 651–662 (1994). Vihman, M.M.; McCune, L.: When is a word a word? J. Child Lang. 21: 517–542 (1994). Vihman, M.M.; Macken, M.A.; Miller, R.; Simmons, H.; Miller, J.: From babbling to speech. Language 61: 397–445 (1985). Vihman, M.M.; Velleman, S.L.: Phonological reorganization. Lang. Speech 32: 149–170 (1989).

266

Phonetica 2000;57:255–266

Vihman/Velleman

Auditory Constraints on Sound Structures Phonetica 2000;57:267–274

Received: October 18, 1999 Accepted: March 12, 2000

Searching for an Auditory Description of Vowel Categories Randy L. Diehl Department of Psychology, University of Texas, Austin, Tex., USA

Abstract This paper examines three auditory hypotheses concerning the location of category boundaries among vowel sounds. The first hypothesis claims that category boundaries tend to occur in a region corresponding to a 3-Bark separation between adjacent spectral peaks. According to the second hypothesis, vowel category boundaries are determined by the combined effects of the Bark distances between adjacent spectral peaks but that the weight of each of these effects is inversely related to the individual sizes of the Bark distances. In a series of perceptual experiments, each of these hypotheses was found to account for some category boundaries in American English but not others. The third hypothesis, which has received preliminary support from studies in our laboratory and elsewhere, claims that listeners partition the vowel space of individual talkers along lines corresponding to relatively simple linear functions of formant values when scaled in auditorily motivated units of frequency such as Bark. Copyright © 2000 S. Karger AG, Basel

Introduction

A key claim within Björn Lindblom’s theory of Emergent Phonology [Lindblom, 1986, 1990, this issue] is that segment inventories are structured, among other things, to enhance auditory distinctiveness among speech sounds. This is generally accomplished by maximizing the distances among sounds within the more accessible regions of the talker’s phonetic space [Lindblom, 1986]. The dispersion principle predicts, for example, what categories are most likely to appear in vowel inventories of a given size. However, without further assumptions, the principle does not make detailed predictions about the likely location of perceptual boundaries between categories. The present paper examines three auditory hypotheses related to boundary placement between vowel categories. The first of these hypotheses is due to Syrdal [1985] and derives from earlier work of Chistovich and her colleagues [e.g., Chistovich and Lublinskaya, 1979]; the second, referred to as the tonotopic distance hypothesis, is due to Traunmüller [1984], and the third is based on preliminary dissertation work done in our laboratory by Molis [1999].



Randy L. Diehl Department of Psychology, 330 Mezes University of Texas at Austin, Austin, TX 78712 (USA) Tel. +1 (512) 475-7595, Fax +1 (512) 471-5935 E-Mail [email protected]

Chistovich/Syrdal Hypothesis

Chistovich and Lublinskaya [1979] found that in matching vowel-like sounds, listeners appear to average adjacent spectral peaks if they fall within about three critical bandwidths, or Bark units, of each other. If the peaks are separated by more than about 3.5 Bark, they remain auditorily distinct from each other. Syrdal [1985] examined two large databases of American English vowels produced by men, women, and children, and concluded that the line of separation between [+high] and [–high] vowels occurred at an F1 –f0 distance of 3–3.5 Bark, and that between [+back] and [–back] vowels occurred at an F3 –F2 distance of about 3–3.5 Bark. Syrdal [1985] proposed that the 3-Bark limit of spectral averaging yields a kind of quantal perceptual effect, in the sense used by Stevens [1989]. In this case, the quantal distinction is between two distinct peaks in a region of the auditory spectrum versus a single averaged peak. In Syrdal’s [1985] account, f0 has a status analogous to a formant peak and may be subject to a similar spectral averaging effect if it occurs within 3 Bark of F1. In our laboratory, we have carried out a series of studies designed to test the Chistovich/Syrdal hypothesis and related claims. In the first of these studies, Hoemeke and Diehl [1994] examined listeners’ judgments of vowel height among American English front vowels. We reasoned as follows. If the [+high]/[–high] distinction is positioned to exploit a quantal effect resulting from the operation of a 3-Bark integrator, then two outcomes would be predicted: (1) the [+high]/[–high] boundary should occur at an F1 –f0 distance of 3–3.5 Bark, and (2) the [+high]/[–high] boundary should be perceptually sharper (i.e., show less variance) than other vowel height category boundaries. In the study, listeners identified synthetic tokens of vowels varying orthogonally in F1 and f0 across three series ranging between the categories /i/-/R/, /R/-/£/, and /£/-/æ/. Figure 1 shows scatterplots of vowel labeling performance as a function of F1 –f0 distance in Bark. The middle frame (fig. 1b) corresponds to the /R/-/£/ distinction that spans the [+high]/[–high] boundary. Of the three vowel contrasts shown here, recall that this is the only one that Syrdal [1985] claims exploits the quantal boundary associated with the 3-Bark integrator. Two points are worth noting. First, as predicted by the hypothesis, the category boundary revealed in this middle scatterplot (fig. 1b) does indeed correspond to an F1 –f0 distance of 3–3.5 Bark. Second, also as predicted, this boundary is considerably sharper (i.e., less variable) than either of the other two. We conducted regression analyses of labeling performance as a function of F1 –f0 Bark distance and as a function of F1 in Bark. For the variable F1 –f0 in Bark, the greatest amount of variance explained, or r2, among the three vowel sets occurs for the middle set, with a value of 0.89. For the other two vowel sets, F1 alone explained more of the labeling variance than F1 –f0. Although these data are consistent with the Chistovich/Syrdal hypothesis, several follow-up studies were much less supportive. Fahey et al. [1996] attempted to replicate Hoemeke and Diehl [1994] findings on perceived vowel height, using American English back vowels. The logic of the predictions was the same as in the earlier study. In two experiments, F1 and f0 were orthogonally varied across three series: /u/-/T/, /T/-/˘/, and /˘/-/Æ/ (first experiment), and /u/-/T/, /T/-/$/, and /$/-/Æ/ (second experiment). For the first experiment (fig. 2), the scatterplots deviate from the predicted outcomes in two ways. First, the [+high]/[–high] boundary in figure 2b, as estimated from a regression analysis, occurs at a somewhat higher F1 –f0 distance value than 3.5 Bark, although the discrepancy is not very large. Second, and more importantly, the variability in boundary location shown in figure 2b is sizable. Indeed, the regression analysis indicated that

268

Phonetica 2000;57:267–274

Diehl

a

a

b

b

c

c

Fig. 1. Scatterplots of identification perfor-

Fig. 2. Scatterplots of identification perfor-

mance for front vowels as a function of F1–f0 distance [Hoemeke and Diehl, 1994].

mance for back vowels as a function of F1 –f0 distance [Fahey et al., 1996].

it is greater than that shown in figure 2a, which does not correspond to a putative quantal perceptual boundary. For the second experiment, the results were very similar. Thus, this study failed to offer any convincing evidence for the Chistovich/Syrdal hypothesis. (Below, however, these results will be reevaluated in terms of a different hypothesis owing to Traunmüller [1984]). In a recent study by Molis et al. [1998], we examined the Chistovich/Syrdal hypothesis with respect to the perception of the [+back]/[–back] distinction in Ameri-

Auditory Description of Vowels

Phonetica 2000;57:267–274

269

a

Fig. 3. Identification

results for [+back]/[–back] vowel series. a Percent /T/ responses as a function of F2 (with F3 as the parameter). b Percent /T/ responses as a function of F3 –F2 [Molis et al., 1998].

b

can English. If the [+back]/[–back] distinction is positioned to exploit a quantal effect resulting from the operation of a 3-Bark integrator in the region of F2 and F3 [Syrdal, 1985], then a shift in the value of F3 should produce a Bark-equivalent shift in the perceived [+back]/[–back] boundary along the F2 dimension. That is, the F3 –F2 Bark distance corresponding to the perceptual boundary should be constant. To test this prediction, we had listeners identify three series of synthetic /T/-/R/ stimuli. Within a series the stimuli varied only in F2 ; each series differed only in F3. The identification results are shown in figure 3. Figure 3a shows that the location of [+back]/[–back] boundary did not vary as a function of F3 but was determined by the value of F2 alone. The data are replotted in figure 3b, which shows identification performance as a function of F3–F2. According to the prediction of the Chistovich/Syrdal hypothesis, the three identification functions in figure 3b should be closely aligned. Plainly, the prediction is incorrect.

270

Phonetica 2000;57:267–274

Diehl

a

Fig. 4. Labeling variance explained by Bark b

distance as a function of category boundary position along that distance dimension. a Results for F2 – F1 distance. b Results for F1–f0 distance [Fahey et al., 1996].

In summary, although some initial studies tended to support the role of a 3-Bark integrator in determining perceptual boundaries among vowel categories, other evidence calls the hypothesis into question as a general account of vowel boundaries.

Tonotopic Distance Hypothesis

We next consider the tonotopic distance hypothesis, due to Traunmüller [1984]. According to this hypothesis, the phonetic quality of vowels and the location of category boundaries between vowel categories are determined by the Bark distances between any adjacent spectral peaks (e.g., F3–F2, F2 –F1, and F1–f0). Traunmüller [1984] obtained positive evidence for this hypothesis for cases in which the adjacent peak distances were smaller than about 6 Bark. Thus, a generalized version of the hypothesis is that all interpeak distances (among, say, the lower three formants plus f0) contribute to a category distinction, but the weight of each distance cue is negatively correlated with the magnitude of the distance. Recall the study by Fahey et al. [1996], described above, in which listeners judged height distinctions among back vowels. The results of that study failed to support the Chistovich/Syrdal hypothesis. However, the results are consistent with the Traunmüller hypothesis. Figure 4 plots the labeling variance explained by Bark distance as a function of category boundary position along that distance dimension. Figure 4a shows


Phonetica 2000;57:267–274

271

Fig. 5. Difference between the

labeling variance explained by F1 –f0 distance and the variance explained by F2 –F1 distance plotted as a function of difference between these distances at the category boundaries. Data are for height judgments for American English back vowels [Fahey et al., 1996].

Fig. 6. Difference between the

labeling variance explained by F1 –f0 distance and the variance explained by F2 – F1 distance plotted as a function of difference between these distances at the category boundaries. Data are for height judgments of American English front vowels [Wong and Diehl, 1998].

results for F1–f0 distance, while figure 4b shows results for F2 –F1 distance. Notice that, as predicted by Traunmüller [1984], the explanatory weight of each distance cue tends to fall as Bark distance corresponding to the category boundary gets larger. Figure 5 combines the data from both distance measures into a single description that reflects the assumptions of the Traunmüller hypothesis. The data clearly suggest that when two distance cues are available to signal phonetic identity, listeners give greater weight to the smaller distance. Accordingly, we can conclude that the Traunmüller hypothesis does a good job of predicting category boundaries at least in the case of back vowels varying in height. It is natural to ask: Does the hypothesis apply equally well to front vowels? Unfortunately, the answer appears to be ‘no’. Figure 6 shows results from a perceptual study

272

Phonetica 2000;57:267–274

Diehl

Fig. 7. Category boundary estimates for 1 native speaker identifying three five-formant American

English vowels varying in second- and third-formant frequencies [Molis, 1999].

of American English front vowels conducted by Wong and Diehl [1998]. The figure is analogous to the previous one in combining the two distance measures, F2–F1 and F1–f0, into a single description. Unlike the results for back vowels, there is no consistent relationship between the amount of variance explained by a distance cue and the magnitude of that distance. We conclude that neither the Chistovich/Syrdal hypothesis nor the Traunmüller hypothesis can account for more than a subset of the category boundaries that occur among vowels.

Molis’ Hypothesis

In our laboratory, Molis [1999] has recently examined listeners’ category judgments for vowel sets that vary in more than one dimension. She created a synthetic vowel set that varied in equal Bark steps in F2 and F3. The set ranged among the three categories /R/, /T/, and /fi/. The straight lines in figure 7 represent the category boundaries for 1 listener. For other listeners, the boundary lines tended to be parallel to ones shown here but were often displaced slightly. What is most striking about these results is that boundaries correspond to relatively simple linear functions. Thus, the boundary between /T/ and /R/ corresponds to the function F2 =11.5 Bark. The boundary between


Phonetica 2000;57:267–274

273

/T/ and /fi/ corresponds to the function F3–F2 =2.7 Bark, and finally the boundary between /fi/ and /R/ corresponds to the function (F2 +F3 )/2 =12.75 Bark. After Molis [1999] obtained these results, she searched the literature and found that Carlson et al. [1970] had reported quite similar results for Swedish using a Mel scale, and Karnickaya et al. [1975] found parallel results for Russian when frequency was logarithmically scaled [see also Hose et al., 1983]. It should be noted that these earlier studies used two-formant vowels. Molis’ [1999] is the first that we are aware of that has obtained such results with more natural five-formant stimuli. We are currently planning to examine vowel perception in other languages to evaluate the robustness of the generalization. If Molis’ [1999] results do generalize, the upshot may be that although no single auditory account can explain more than a subset of vowel distinctions, the range of possibilities is nevertheless quite restricted. Their values correspond to simple linear combinations of formant frequencies expressed in an auditorily motivated frequency scale. To Lindblom’s dispersion principle, we way be able to add the claim that boundaries between categories are expressible in auditorily quite simple terms.

Acknowledgments The author thanks Michelle Molis for her insightful comments and for help preparing the figures. The work was supported by Research Grant No. 5 R01 DC00427-11 from the National Institute on Deafness and Other Communication Disorders, National Institutes of Health.

References Carlson, R.; Granström, B.; Fant, G.: Some studies concerning perception of isolated vowels. Q. Prog. Status Rep., Speech Transm. Lab., R. Inst. Technol., Stockh., No. 2/3, pp. 19–35 (1970). Chistovich, L.A.; Lublinskaya, V.V.: The ‘center of gravity’ effect in vowel spectra and critical distance between the formants: psychoacoustical study of the perception of vowel-like stimuli. Hear. Res. 1: 185–195 (1979). Fahey, R.P.; Diehl, R.L.; Traunmüller, H.: Perception of back vowels: effects of varying F1–F0 Bark distance. J. acoust. Soc. Am. 99: 2350–2357 (1996). Hoemeke, K.A.; Diehl, R.L.: Perception of vowel height: the role of F1 –F0 distance. J. acoust. Soc. Am. 96: 661–674 (1994). Hose, B.; Langner, G.; Scheich, H.: Linear phoneme boundaries for German synthetic two-formant vowels. Hear. Res. 9: 13–25 (1983). Karnickaya, E.G.; Mushnikov, V.N.; Slepokurova, N.A.; Zhukov, S.J.: Auditory processing of steady-state vowels; in Fant, Tatham, Auditory analysis and perception of speech, pp. 37–53 (Academic Press, New York 1975). Lindblom, B.: Phonetic universals in vowel systems; in Ohala, Jaeger, Experimental phonology, pp. 13–44 (Academic Press, Orlando 1986). Lindblom, B.: Explaining phonetic variation: a sketch of the H&H theory; in Hardcastle, Marchal, Speech production and speech modeling, pp. 403–439 (Kluwer, Dordrecht 1990). Lindblom, B.: Developmental origins of adult phonology: the interplay between phonetic emergents and the evolutionary adaptations of sound patterns. Phonetica, this issue. Molis, M.: Perception of vowel quality in the F2/F3 plane. Proc. ICPhS 99, pp. 191–194 (San Francisco 1999). Molis, M.; Diehl, R.L.; Jacks, A.: Phonological boundaries and the spectral center of gravity. J. acoust. Soc. Am. 103: 2981 (1998). Stevens, K.N.: On the quantal nature of speech. J. Phonet. 17: 3–45 (1989). Syrdal, A.K.: Aspects of a model of the auditory representation of American English vowels. Speech Commun. 4: 121–135 (1985). Traunmüller, H.: Articulatory and perceptual factors controlling the age- and sex-conditioned variability in formant frequencies of vowels. Speech Commun. 3: 49–61 (1984). Wong, P.C.M.; Diehl, R.L.: Effect of spectral center of gravity on vowel height perception. J. acoust. Soc. Am. 103: 2981 (1998).

274

Phonetica 2000;57:267–274

Diehl

Commentary Phonetica 2000;57:275–283


Imitation and the Emergence of Segments Michael Studdert-Kennedy Haskins Laboratories, New Haven, Conn., USA

Abstract The paper argues that the discrete phonetic segments on which language is raised are subjective gestural structures that emerge ontogenetically (and perhaps emerged evolutionarily) from the process of imitating a quasi-continuous acoustic signal with a neuroanatomically segmented and somatotopically organized vocal machinery. Evidence cited for somatotopic organization includes the perceptual salience in the speech signal of information specifying place of articulation, as revealed both by sine wave speech and by the pattern of errors in children’s early words. Copyright © 2000 S. Karger AG, Basel

‘Almost every insight gained by modern linguistics, from Grimm’s Law to Jakobson’s distinctive features, depends crucially on the assumption that speech is a sequence of discrete entities.’ Morris Halle [1964, p. 325]

In his position paper for this volume Lindblom proposes an original and persuasive account of how ‘discrete entities’ (gestures, segments) may emerge from the continuous processes of speaking and listening. Supplementing two earlier papers [Lindblom, 1992, 1998], the account is unique in its attempt to solve the central issue of modern speech research without programmatic appeal either to the as yet undiscovered invariants of direct realism [Fowler, 1986] or to the specialized decoding devices concealed in a phonetic module [e.g. Liberman and Mattingly, 1985]. Since I fully approve of Lindblom’s goals, agree whith his proscription of nativism, and admire the elegance of his arguments, the following comments are largely those of an amicus curiae.

The Intersubjectivity of Speech

The ‘discrete entities’ of Halle’s observation are not simply a heuristic ‘assumption’. Such entities are physical prerequisites of all systems that ‘make infinite use of finite means’ [von Humboldt, 1836/1972, p. 70]. Such systems (e.g. physics, chemistry, genetics, language) necessarily conform to the ‘particulate principle of self-diversifying systems’ [Abler, 1989; Studdert-Kennedy, 1998, 2000] by which discrete units from a finite set of meaningless elements (e.g. atoms, chemical bases, phonetic segments) are repeatedly sampled, permuted, and combined to yield larger units (e.g. mol-



Michael Studdert-Kennedy Haskins Laboratories 270 Crown Street New Haven, CT 06511-6695 (USA)

ecules, genes, words) higher in a hierarchy and both different and more diverse in structure and function than their constituents. Duality of patterning, the two-level hierarchy of phonology and syntax on which the unbounded semantic scope of language rests, is a special case of the particulate principle common to every physical system of unbounded diversity. Discrete units are therefore logically necessary postulates for the description of linguistic function. What, then, is the empirical evidence for such units? Every phonetician is familiar with the fact that spectrograms do not divide the acoustic flow of speech into a sequence of discrete, invariant segments corresponding to the segments of linguistic description [Fant, 1962; Liberman et al., 1967]. Yet, perhaps because ‘the sounds of the world’s languages’ are physical events amenable to increasingly sophisticated acoustic analysis, speech scientists have been reluctant to accept that ‘... there is no way to avoid the traditional assumption that the speaker-hearer’s linguistic intuition is the ultimate standard that determines the accuracy of any proposed grammar, linguistic theory, or operational test...’ [Chomsky, 1965, p. 21]. Many speech scientists have continued to hope that advances in speech technology or behavioral analysis may enable them to shed the introspective methods still burdening their colleagues in syntax. There may indeed be some prospect of demonstrating the segmented structure of speech through analysis of phonetic segments into discrete gestures [Browman and Goldstein, 1992], as Lindblom [this vol.] argues (see below). But until such an analysis can be agreed upon, the only objective evidence for the discrete segments of speech are the discrete letters of the alphabet that represent a speaker-hearer’s intuitions. That the fluent speech of any language can be transcribed as a string of letters and can then be recovered from that string by a reader is, in my view and despite the scholarly scepticism of some linguists [e.g. Faber, 1992], unequivocal evidence for the reality of the segment at some level of linguistic function. Spoonerisms and other segmental speech errors may lend insight into likely biophysical constraints on speaking [e.g. MacNeilage, 1985], but they add little to the alphabet by way of objective support for the segment, because [with the single exception of an electromyographic study by Mowrey and Mackay, 1990] they depend on alphabetic records of speaker-hearers’ subjective judgments. Speech segments, then, are subjective psychophysical entities, their characteristic perceptual properties (as described by distinctive feature theory, for example) analogous to loudness and pitch, or brightness and color. Yet segments are perhaps even more deeply subjective than such auditory and visual dimensions, insofar as their defining margins seem to have no reliable correlates in the acoustic signal. Acoustic discontinuities often occur, to be sure, with more or less abrupt changes in amplitude or spectrum. These discontinuities may serve as ‘landmarks’ around which important information concerning segments is distributed [Liu, 1996; Stevens, 1985]. But not every discontinuity marks a segment, nor is every segment marked by a discontinuity. The nature of the difficulty is thrown into relief by attempts to transform the speech waveform into a conceptually and experimentally more tractable auditory representation. The spectrograph, although sometimes billed as an instrument for purely acoustic analysis, was, in fact, devised as a rough model of the human ear viewed as an adaptive biophysical mechanism for analyzing a waveform into its component frequencies. More sophisticated models, intended to capture nonlinearities of audition by weighting the frequency output of the analyzer according to the estimated frequency response curve of the cochlea [e.g. Evans, 1982] or the general auditory system [e.g. Schroeder et al., 1979], have clarified certain issues in speech research, and Lindblom has been a

276

Phonetica 2000;57:275–283

Studdert-Kennedy

leader in their use [e.g. Liljencrants and Lindblom, 1972; Lindblom, 1986]. But whatever such models may contribute to the solution of the invariance problem, they have little to contribute to the problem of segmentation, because segments are no better marked in auditory transforms than in standard spectrograms. In this regard, the success of Victor Zue in learning to read spectrograms is instructive [Cole et al., 1980]. As the reader will recall, Zue learned by assiduous practice for over 2,000 h to do what few, if any, others have done, namely, to transcribe the spectrograms of unknown (even nonsensical, and therefore syntactically and semantically unpredictable) utterances with remarkable accuracy. His central task was to see through the variability of differing phonetic contexts to the underlying sequence of discrete phonetic segments, without top-down support from syntax or meaning. That he succeeded demonstrates unequivocally that the informed eye (and, presumably, the informed ear) can indeed find reliable, even if context-conditioned, segmental markers in the fluent spectral sequence. I use the epithet ‘informed’ because two pieces of information, absent from the spectrogram, were critical to Zue’s success: (1) that the language was English, (2) that there were segments to be found. Even for the already segmented, but undeciphered, transcriptions of Minoan Linear B [Chadwick, 1958] and old Mayan [Coe, 1992], the identity of the language and the nature of the discrete entity encoded (word, syllable, segment) were postulates essential to their decoding. For the child learning to talk, the identity of its native language gradually unfolds in its characteristic structural regularities. But how does the child discover that fluent speech is composed of discrete, though intricately overlapping, segments? I shall argue shortly that the child does so by discovering how it must engage its vocal machinery in order to speak, and that it speaks by repeatedly combining the discrete actions (gestures) of six functionally independent (even when mechanically interacting) articulators: lips, tongue blade (tip), tongue body, tongue root, velum, and larynx. In other words, the child discovers segments in its own body and behavior. It is the behavior implicit in an animal’s morphology and environment at any given stage that drives development, just as it drives evolution [Mayr, 1982, p. 612; Studdert-Kennedy, 1991, p. 8]. We should not, however, identify a gesture too closely with an individual articulator, because at each instant of speech every articulator is engaged in action (or in inaction) complementary to the current primary articulator, and is thereby contributing to the overall vocal tract configuration [Mattingly, 1991]. The rise and fall of each configuration then specifies the domain of a given gesture [Carré and Chennoukh, 1995; Fowler and Smith, 1986]. Because it is the vocal tract configuration that structures the acoustic signal, gestures implicit in a configuration are accessible, in principle, to any listener, even one who cannot talk, such as a cerebral palsied child. Provided afferent and central cognitive processes are unimpaired and damage is confined to efferent processes, a child can evidently apprehend a gesture’s contrastive phonological function [e.g. Cossu et al., 1987]. Thus, even a child who cannot talk may discover segments implicit in its own body and behavior, and is thereby launched into language.1 1 Kluender et al. [1987] conditioned Japanese quail to discriminate /dv/ syllables from /bv, gv/ syllables, spoken by an adult male, and to generalize their responses across novel vowel contexts. The quail were not tested, however, for generalization across syllable position from CV to VC. We have no reason therefore to suppose that they discriminated anything more than one class of holistic syllables differing in onset from two others. Whatever bearing the study may have on the issue of invariance, it has little bearing on the emergence of segments.

Imitation and Segments

Phonetica 2000;57:275–283

277


Vocal imitation, unique among primates to humans [Hauser, 1996, pp. 650–651], is the mechanism by which a child builds its lexicon. Imitating a word entails analysis of the sound pattern into its underlying articulatory components and reassembly of the components in their correct temporal sequence, a process that introduces segmentation by transducing continuously variable sound into a pattern of discrete gestures. Lindblom [this vol.], correctly in my view, grounds the segmental mechanism in the neuroanatomical segmentation of the vocal machinery and the likely somatotopic organization of its neural control. But he has little to say about how the perceptuomotor link between sound and gesture is established. How does sound get into the muscles? There have been two main classes of answer: (1) by conditioning, (2) by a specialized (not necessarily innate) mechanism. Let us briefly consider examples of each of these. Exemplar Theory In his exemplar model of speech perception, Johnson [1997, p. 147] defines an exemplar as ‘... an association between a set of auditory properties and a set of category labels... [T]he set of category labels includes any classification that may be important to the perceiver, and which was available at the time the exemplar was stored – for example, the linguistic value of the exemplar, the gender [sic] of the speaker, the name of the speaker, and so on’ [my italics]. Presumably, the ‘linguistic value’ of an exemplar includes its component phonetic categories. The question then is whether the auditory properties of a spoken exemplar can be labeled phonetically without specifying its articulation. I believe the answer to this question is ‘no’. The name of a speech sound is not an arbitrary label like a letter of the alphabet or a word. We can certainly change the written label for /b/ to /d/, or the name (‘category label’) for a canine creature from dog to cat. But we cannot change the name of the sound we transcribe as /b/ to the name of the sound we transcribe as /d/, because the name of a speech sound is the spoken sound itself. If a listener is to store an exemplar speech sound as a phonetic segment, it would seem that the ‘category label’ specifying the ‘linguistic value’ associated with the auditory properties of the sound must include its articulatory value. In this way a speech sound exemplar becomes a segmentally structured phonetic event, simultaneously both auditory and articulatory. Perhaps Johnson [1997, p. 153] himself believes this, since he goes on to explain how ‘... an exemplar model can, in principle, also be used to give an account of the production-perception link’. He proposes a conditioning, or associationist, account of the link as ‘... based on one’s own speech’ [p. 154]. Apparently, among the category labels the infant associates with the auditory properties of its own speech are the gestures it makes in pronouncing the sounds. (Here, it would seem, is the origin of the subjectivity of speech in exemplar theory.) The infant’s own speech sounds are then ‘ego exemplars’, and ‘... the gestural knowledge derived or generated while listening to others is based on ego exemplars’ [p. 154.]. If this is so, the child learning to talk derives articulatory instructions (gestures) for pronouncing new words by recognizing an auditory match between its own sound exemplars and the sounds that compose the adult word it intends to imitate. This may seem plausible, but there are several difficulties. First, given the lack of spectral overlap between the speech of an adult and the speech of, say, a 1-year-old child, how can

278

Phonetica 2000;57:275–283

Studdert-Kennedy

the child match its own sounds with an adult’s without some process of the normalization, or demodulation [Traunmüller, 1994], that exemplar theory rejects? Second, if gestural ‘category labels’ only become associated with the auditory properties of a speech sound exemplar when the child hears its own speech, how do new sounds enter the child’s spoken repertoire? Only by random search, it would seem. Indeed, Lindblom [this vol.] writes of children who ‘... stumble on motorically motivated phenomena in the ambient language’. Random search within a motorically constrained articulatory space seems plausible enough for the hominid evolutionary path into speech. For the modern child, however, we have evidence that the search is not random, but rather is guided by an early developing somatotopically organized mode of perception. Before I consider this, let me turn briefly to an alternative approach to the perceptuomotor link. Facial Imitation How infants imitate facial gestures that they cannot see themselves perform is a question about which we have learned a great deal over the past two decades, largely through the sustained research program of Meltzoff and Moore [for a fairly recent review, see Meltzoff and Moore, 1997]. I have no space to consider this work in detail, but several aspects of their findings and current model are of interest for an understanding of vocal imitation. In particular, Meltzoff and Moore propose a supramodal representation system mediating between perception and action, and an intermodal mechanism (as opposed to a conditioned association) for generating an imitative response. Among the theoretical concepts they invoke are: (1) organ identification, a mechanism for identifying the body part(s) to be moved; (2) body babbling, a process analogous to vocal babbling, that maps muscle movements onto ‘organ-relation end states’, analogous to vocal tract configurations; (3) a cross-modal metric of equivalence based on ‘organ relations [that] render commensurate the seen but unfelt act of the adult and the felt but unseen facial act of the infant’. How much of the system is innate, and how much develops epigenetically in response to the social environment, is as much a question for facial imitation as for speech. But at the heart of the facial system is the infant’s (and the adult’s) capacity to recognize correspondences in organ relations between self and conspecific other. This would seem necessarily to depend on some demodulating mechanism that renders self and conspecific other structurally and functionally isomorphic. Among the many possibilities the facial work raises for speech, then, is that the cross-modal (auditoryto-articulatory) metric of equivalence for speech may be mediated by vocal tract configuration rather than by normalization of the auditory signal. Perhaps this could be accomplished by auditory-to-body-part neural links analogous to the visual ‘mirror neurons’ discovered by Rizzolatti et al. [1996] at the University of Parma. Mirror Neurons Recently, Rizzolatti et al. [1996] have reported what they call ‘mirror neurons’ in macaque cortex: neurons that fire not only when a monkey grasps or manipulates food, but also when it sees a human experimenter do the same. Firing is specific to the act of grasping, and does not occur when the monkey sees an experimenter pick up food with a tool. These perceptuomotor neurons lie in an area of macaque cortex arguably homologous with Broca’s area. Data from transcranial magnetic stimulation and positron emission tomography studies demonstrate a mirror system for manual grasping also in


Phonetica 2000;57:275–283

279

humans. Rizzolatti and Arbib [1998, p. 190], reviewing this evidence, postulate ‘a fundamental mechanism for action recognition’ in both monkey and human. Obviously, mirror neurons must be part of a complex network engaged both in acting and in monitoring the acts of others – not only of conspecifics, it would seem, but of other animals, such as humans, with similar gross morphology. But these neurons offer no solution to the problem of ‘how light gets into the muscles’, because they give no hint of how the perceptuomotor link is made. Nor, I should emphasize, is there yet evidence for such neurons in the speech system. Nonetheless, work on mirror neurons takes another experimental step toward understanding the neural basis for the social empathy characteristic of primate species [cf. Brothers et al., 1990]. And these neurons are of particular interest for vocal imitation because they seem to be organized not only by function – grasping, manipulating, eating, and so on –, but also somatotopically. With this in mind, let us turn to children’s early words.

Early Words

Elsewhere I have sketched a development sequence for the origin of segments [Studdert-Kennedy, 1987, 1991; Studdert-Kennedy and Goodell, 1995] that draws its exemplar-type model from the account proposed by Lindblom et al. [1984]. On this account, the initial unit of linguistic action is the holistic word [Ferguson and Farwell, 1975]. The word is said to be holistic, even though it is spoken as a sequence of discrete gestures, because gestures are not yet represented as independent phonetic elements that can be marshaled for use in an unbounded set of other contexts. As an automatic consequence of sorting and stacking phonetically similar words, independent gestures eventually emerge, and recurrent patterns of co-occurring gestures are then gradually integrated into segments. Lindblom [e.g. 1992, 1998] has characterized such an emergent process far more elegantly and concisely than I can, both in this volume with his NEP model and in other papers. Evidence for the gesture as an independent unit of function in young children is hard to come by, partly because children tend to avoid words that they cannot pronounce [Vihman, 1991; Vihman and DePaolis, 2000] and so are surprisingly accurate in their pronunciation, partly because the period during which children make nonrandom and interpretable speech errors tends be a narrow window around the end of the 1st year when they are attempting their first words. Nonetheless, systematic analysis of such errors strongly supports the gesture as the child’s initial intrasyllabic unit of phonetic action [Studdert-Kennedy and Goodell, 1995]. A remarkable fact about early consonantal errors is that they tend to be errors of gestural timing or amplitude rather than of place of articulation. This is remarkable because, according to standard speech lore, place of articulation is significantly more susceptible to degradation by noise and filtering than are manner and voicing [Miller and Nicely, 1955]. Yet, if we look at the word-initial phone classes (the set of interchangeable segments with which a child attempts a given word) for the children of Ferguson and Farwell [1975], labials tend to be exchanged with labials, alveolars with alveolars, velars with velars. Errors of place seldom occur. Table 1 makes the point with data I have tabulated from 4 children in the Stanford Child Phonology Project [Vihman, 1996, Appendix C]: 80% of single-feature errors are on voicing or manner, 20% on place of articulation.

280

Phonetica 2000;57:275–283

Studdert-Kennedy

Table 1. Initial consonants in early words of 4 English-learning children in three half-hour sessions

during transition fom babbling to word use (13–16 months approximately): tabulation of data from Stanford Child Phonology Project in Appendix C of Vihman [1996] Target

Number correct

Number attempted

place

voicing

other

88 46 27

3 14 4

46 25 9

3 1 0

140 86 40

161

21

80

4

266

20

76

4

Bilabial [p, b, m, w] Alveolar [t, d, n, r, l, s, z] Velar [k, g] Total

Number of single feature errors

% of errors

Taking a hint from fact and theory in facial imitation and from the macaque’s mirror neurons we may speculate that the acoustic speech signal, like the optic face, specifies the ‘organs’ (articulators) that are to be activated more clearly than the amplitude and relative phasing of their activation. Such a hypothesis is consistent with the surprising intelligibility of both fricative speech [Kuhn, 1975] and sine wave speech [Remez et al., 1994]. The sound source for fricative speech is uniform frication exciting the oral front cavity resonance; in sine wave speech all the acoustic elements characteristic of vocal sound production are replaced by a set of time-varying sinusoids that track the changing resonances, and so the changing configurations, of the vocal tract. Perhaps it is because these bizarre forms of speech preserve information about which articulators were engaged (that is, about place of articulation) that they are so readily intelligible.

Conclusion

Exemplar-based learning models are attractive because, as Lindblom [this vol.] remarks, they undertake to solve the problems of speech invariance and segmentation by statistical accumulation rather than by ad hoc hypothetical decoding mechanisms. In this respect exemplar models formalize a style of approach that Lindblom has been following for many years. Yet there is little reason to suppose that segments automatically emerge from the statistical stacking of auditory exemplars, none of which is segmented at the time of storage. What seems to be missing, then, from Lindblom’s emergent phonology is an explicit mechanism by which auditory patterns make contact with the neuroanatomically segmented vocal machinery that produces them. Perhaps studies of social interaction in other primates, such as macaques, and of imitation in other modalities, such as facial imitation, will provide the key. Acknowledgments My thanks to René Carré, Randy Diehl, John Kingston, and Robert Remez for instructive comments and discussion. Preparation of the paper was supported in part by Haskins Laboratories.


Phonetica 2000;57:275–283

281

References Abler, W.L.: On the particulate principle of self-diversifying systems. J. soc. Biol. Struct. 12: 1–13 (1989). Brothers, L.; Ring, B.; Kling, A.: Response of neurons in the macaque amygdala to complex social stimuli. Behav. Brain Res. 41: 199–213 (1990). Browman, C.P.; Goldstein, L.: Articulatory phonology: an overview. Phonetica 49: 155–180 (1992). Carré, R.; Chennoukh, S.: Vowel-consonant-vowel modeling by superposition of consonant closure on vowel-tovowel gestures. J. Phonet. 23: 231–241 (1995). Chadwick, J.: The decipherment of Linear B (Cambridge University Press, Cambridge 1958). Chomsky, N.: Aspects of the theory of syntax (MIT Press, Cambridge 1965). Coe, M.: Breaking the Maya code (Thames and Hudson, New York 1992). Cole, R.A.; Rudnicky, A.I.; Zue, V.M.; Reddy, D.R.: Speech as patterns on paper; in Cole, Perception and production of fluent speech, pp. 3–50 (Erlbaum, Hillsdale 1980). Cossu, G.; Urbinati, M.L.; Marshall, J.C.: Reading without speech and writing without arm movements. 5th Eur. Workshop on Cognitive Neuropsychol., Bressanone 1987. Evans, E.F.: Representation of complex sounds at cochlear nerve and cochlear nucleus levels; in Carlson, Granström, The representation of speech in the peripheral auditory system, pp. 27–42 (Elsevier, Amsterdam 1982). Faber, A.: Phonemic segmentation as epiphenomenon: evidence from the history of alphabetic writing; in Downing, Lima, Noonan, The linguistics of literacy, pp. 111–134 (J. Benjamins, Philadelphia 1992). Fant, C.G.M.: Descriptive analysis of the acoustic aspects of speech. Logos 5: 3–17 (1962). Ferguson, C.A.; Farwell, C.B.: Words and Sounds in early language acquisition. Language 15: 419–439 (1975). Fowler, C.A.: An event approach to the study of speech perception from a direct-realist perspective. J. Phonet. 14: 3–28 (1986). Fowler, C.A.; Smith, M.R.: Speech perception as ‘vector analysis’: an approach to the problems of invariance and segmentation; in Perkell, Klatt, Invariance and variability in speech processes, pp. 123–139 (Erlbaum, Hillsdale 1986). Halle, M.: On the bases of phonology; In Fodor, Katz, The structure of language, pp. 324–333 (Prentice-Hall, Englewood Cliffs 1964). Hauser, M.D.: The evolution of communication (MIT Press, Cambridge 1996). Humboldt, W. von: Linguistic variability and intellectual development; translated by G.C. Buck and F.A. Raven (University of Pennsylvania Press, Philadelphia 1836/1972). Johnson, K.: Speech perception without speaker normalization: an exemplar model; in Johnson, Mullenix, Talker variability in speech processing, pp. 145–165 (Academic Press, New York 1997). Kluender, K.R.; Diehl, R.; Killeen, P.: Japanese quail can learn phonetic categories. Science 237: 1195–1197 (1987). Kuhn, G.M.: On the front cavity resonance and its possible role in speech perception. J. acoust. Soc. Am. 58: 428–433 (1975). Liberman, A.M.; Cooper, F.S.; Shankweiler, D.P.; Studdert-Kennedy, M.: Perception of the speech code. Psychol. Rev. 74: 431–461 (1967). Liberman, A.M.; Mattingly, I.G.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985). Liljencrants, J.; Lindblom, B.: Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 829–862 (1972). Lindblom, B.: Phonetic universals in vowel systems; in Ohala, Jaeger, Experimental phonology, pp. 13–44 (Academic Press, New York 1986). Lindblom, B.: Phonological units as adaptive emergents of lexical development; in Ferguson, Menn, Stoel-Gammon, Phonological development: models, research, implications, pp. 131–163 (York Press, Timonium 1992). Lindblom, B.: Systemic constraints and adaptive change in the formation of sound structure; in Hurford, StuddertKennedy, Knight, Approaches to the evolution of language, pp. 242–264 (Cambridge University Press, Cambridge 1998). Lindblom, B.: MacNeilage, P.I.; Studdert-Kennedy, M.: Self-organizing processes and the explanation of phonological universals; in Butterworth, Comrie, Dahl, Explanations for language universals, pp. 181–203 (Mouton, New York 1984). Liu, S.A.: Landmark detection for distinctive feature-based recognition. J. acoust. Soc. Am. 100: 3417–3430 (1996). MacNeilage, P.F.: Serial-ordering errors in speech and typing; in Fromkin, Phonetic linguistics, pp. 193–201 (Academic Press, New York 1985). Mattingly, I.G.: The global character of phonetic gestures. J. Phonet. 18: 445–452 (1991). Mayr, E.: The growth of biological thought (Belknap Press of Harward University, Cambridge 1982). Meltzoff, A.N.; Moore, M.K.: Explaining facial imitation: a theoretical model. Early Dev. Parenting 6: 179–192 (1997). Miller, G.A.; Nicely, P.E.: An analysis of perceptual confusions among some English consonants. J. acoust. Soc. Am. 27: 338–352 (1955). Mowrey, R.A.; Mackay, I.R.A.: Phonological primitives: electromyographic speech error evidence. J. acoust. Soc. Am. 88: 1299–1312 (1990).

282

Phonetica 2000;57:275–283

Studdert-Kennedy

Remez, R.E.; Rubin, P.E.; Berns, S.M.; Pardo, J.S.; Lang, J.R.: On the perceptual organization of speech. Psychol. Rev. 101: 129–156 (1994). Rizzolatti, G.; Arbib, M.A.: Language within our grasp. Trends Neurosci. 21: 188–194 (1998). Rizzolatti, G.; Fadiga, L.; Gallese, V.; Fogassi, L.: Premotor cortex and the recognition of motor actions. Cognitive Brain Res. 3: 131–141 (1996). Schroeder, M.R.; Atal, B.S.; Hall, J.L.: Objective measures of certain speech signal degradations based on masking properties of human auditory perception; in Lindblom, Öhman, Frontiers of speech communication research, pp. 217–229 (Academic Press, London 1979). Stevens, K.N.: Evidence for the role of acoustic boundaries in the perception of speech sounds; in Fromkin, Phonetic linguistics, pp. 243–255 (Academic Press, New York 1985). Studdert-Kennedy, M.: The phoneme as a perceptuomotor structure; in Allport, MacKay, Prinz, Scheerer, Language perception and production, pp. 67–84 (Academic Press, London 1987). Studdert-Kennedy, M.: Language development from an evolutionary perspective; in Krasnegor, Rumbaugh, Schiefelbusch, Studdert-Kennedy, Biological and behavioral determinants of language development, pp. 5–28 (Erlbaum, Hillsdale 1991). Studdert-Kennedy, M.: The particulate origins of language generativity: from syllable to gesture; in Hurford, Studdert-Kennedy, Knight, Approaches to the evolution of language, pp. 202–221 (Cambridge University Press, Cambridge 1998). Studdert-Kennedy, M.: Evolutionary implications of the particulate principle: imitation and the dissociation of phonetic form from semantic function; in Knight, Studdert-Kennedy, Hurford, The emergence of language: social function and the origins of linguistic form, pp. 161–176 (Cambridge University Press, Cambridge 2000). Studdert-Kennedy, M.; Goodell, E.W.: Gestures, features and segments in early child speech; in deGelder, Morais, Speech and reading, pp. 65–88 (Erlbaum, Taylor & Francis, Hove 1995). Traunmüller, H.J.: Conventional, biological and environmental factors in speech communication: a modulation theory. Phonetica 51: 170–183 (1994). Vihman, M.M.: Ontogeny of phonetic gestures; in Mattingly, Studdert-Kennedy, Modularity and the motor theory of speech perception, pp. 9–84 (Erlbaum, Hillsdale 1991). Vihman, M.M.: Phonological development (Blackwell, Oxford 1996). Vihman, M.M.; DePaolis, R.A.: The role of mimesis in infant language development: evidence for phylogeny?; in Knight, Studdert-Kennedy, Hurford, The emergence of language: social function and the origins of linguistic form (Cambridge University Press, Cambridge 2000).


Phonetica 2000;57:275–283

283

Commentary Phonetica 2000;57:284–296

Received: November 25, 1999 Accepted: March 16, 2000

Deriving Speech from Nonspeech: A View from Ontogeny Peter F. MacNeilage Barbara L. Davis University of Texas at Austin, Tex., USA

Abstract A comparison of babbling and early speech, word patterns of languages, and, in one instance, a protolanguage corpus, reveals three basic movement patterns: (1) a ‘Frame’ provided by the cycles of mandibular oscillation underlying the basic mouth close-open alternation of speech; this Frame appears in relatively ‘pure’ form in the tendency for labial consonants to co-occur with central vowels; (2) two other intracyclical consonant-vowel (CV) co-occurrence patterns sharing the alternation: coronal consonants with front vowels and dorsal consonants with back vowels; (3) an intercyclical tendency towards a labial consonant-vowel-coronal consonant (LC) sequence preference for word initiation. The first two patterns were derived from oral movement capabilities which predated speech. The Frame (1) may have evolved from ingestive cyclicities (e.g. chewing). The intracyclical consonant-vowel (CV) co-occurrence patterns involving tongue position constraints common to consonants and vowels (2) may result from the basic biomechanical property of inertia. The third pattern (LC) was a self-organizational result of pressures for interfacing cognition with action – a result which must have numerous analogs in other domains of movement organization. Copyright © 2000 S. Karger AG, Basel

Introduction

It gives us considerable pleasure to be included in this special issue honoring Björn Lindblom. In our opinion, he has done more than any other phonetician in the 20th century to advance the cause of the discipline of phonetics. He has done this by insisting on and demonstrating the value of a fundamental conceptual framework for the discipline, summarized by his phrase ‘derive language from nonlanguage’ [Lindblom, 1984, p. 78]. Standing behind this advocacy is the Neo-Darwinian theory of evolution by natural selection with its fundamental tenet of descent with modification. In our discipline, Lindblom’s dictum boils down to ‘derive phonology from phonetics’. In his words we must derive ‘the fundamental units and processes deductively from independent premises anchored in physiological and physical realities’ [Lindblom, cited by



Peter F. MacNeilage Department of Psychology University of Texas at Austin (B 3800) Austin, TX 78712 (USA)

Ladefoged, 1984; see also Lindblom, 1984]. This approach is diametrically opposed to the one espoused by Ladefoged [1984] and common in phonology, according to which the most important thing about speech is that it has a level of abstract form which is largely independent of phonetics [e.g. Anderson, 1981; Halle, 1990; Kenstowicz, 1994; and a number of papers in Goldsmith, 1995]. Most phoneticians do not even take sides on these issues. Instead, they focus on the most limited of the four questions that the Nobel laureate Tinbergen [1952] tells us must be answered in order to understand a communication system: ‘How does it work?’ Scant attention is given to the other three questions which must be included if we are ever to explain speech in terms of Mayr’s [1984] ‘ultimate causes’. They are: What does it do for the organism? How did it get that way in ontogeny? How did it get that way in phylogeny?

Deriving Speech From Nonspeech

Our aim in this paper is to provide evidence in support of Lindblom’s belief that the most profitable approach for phonetics is to derive speech from nonspeech. We specifically concern ourselves with Tinbergen’s [1952], questions 3 and 4, ontogeny and phylogeny. Our main conclusion is that aspects of the structure of speech held in common between infants and modern languages were probably the building blocks of the first spoken words. A little over a decade ago we began to study the acquisition of speech. Although we did not know of Tinbergen’s framework at the time, we began to consider all of his questions simultaneously. The ‘how does it work’ question was addressed by attempting to infer the nature of movement control of the production system from acoustic information. We assumed, with Lindblom, that what speech does for the organism is that it allows us to send and receive a large linguistic message set by developing a system which responds to pressures for optimization across the conflicting constraints of ease of production and perceptual distinctiveness. The core of a possible answer to the question ‘how did it get that way in phylogeny’ was provided by the Frame/Content theory of evolution of speech, initially presented by MacNeilage et al. in 1984 and 1985. According to this theory, an initial frame for speech in the form of a close-open mouth cycle produced by mandibular oscillation was subsequently elaborated by programming of different consonants and vowels for the closing and opening phases. The initial stage was considered to be dominated by motor constraints with subsequent developments highly influenced by the need for perceptual distinctiveness between message variants. The empirical aspect of our research program involves ontogeny. The proposed answer to the question of ‘how did it get that way in ontogeny’ is that ontogeny recapitulates phylogeny in the sense of beginning with a similar set of motor constraints to those of early phylogeny, and then progressively acquiring the sounds and sound patterns that were introduced into languages later [MacNeilage and Davis, in press a]. There was one additional perspective with which we began this work. It was provided by Karl Lashley [1951] in a classic paper entitled ‘The Problem of Serial Order in Behavior’. The problem he posed was: ‘How is any sequence of actions organized?’ He considered speech to be at once the most challenging and potentially the most revealing serially ordered behavior in living forms.

Deriving Speech from Nonspeech

Phonetica 2000;57:284–296

285

The Frame/Content Theory of Evolution of Speech

The Frame/Content theory [see MacNeilage, 1998, for a current version] was initially formulated as a possible explanation of the serial organization of adult speech. The key observation that led to the theory was that in segmental exchange errors, the segments almost always go into the same position in syllable structure that they came out of [see Levelt, 1989]. Initial consonants exchange with initial consonants (‘tall ships’ → ‘shawl tips’), vowels exchange with vowels (‘ad hoc’ → ‘odd hack’) and final consonants exchange with final consonants (‘top shelf’ → ‘toff shelp’). Phenomena such as these led Levelt [1992, p. 10] to conclude that ‘probably the most fundamental insight from modern speech error research is that a word’s skeleton or frame and its segmental content are independently generated’. The basic assumption of the Frame/Content theory is that the premotor frame constraining speech errors evolved from a motor frame of mandibular oscillation. The capability of inserting independently controlled segmental content elements is considered to have evolved later.

Frame Dominance in Babbling and Early Speech

One of the most salient characteristics of babbling and early speech is that the prototypical event is a relatively rhythmic alternation between a closed and open mouth configuration (e.g. [baba]). Our initial assumption was that this oscillation was related to the one we postulated as the motor frame for the earliest speech of hominids. We assumed that acquisition of speech production was a matter of ‘Frames, then Content’ [MacNeilage and Davis, 1990a]. Our primary interest was to determine how segmental content came to be differentiated from frames, or, in other words, how segmental independence developed [MacNeilage and Davis, 1990b]. The methodological approach we have taken is very simple, but powerful. We generate extremely large databases of phonetically transcribed babbling episodes and early words (and more recently words in dictionaries). We then evaluate the frequencies of serial organization patterns against frequencies to be expected on the basis of the overall occurrence of the individual elements in the corpus. So far, our work on speech acquisition has been confined to the babbling stage (7–12 months), and the subsequent stage of production of single words (12–18 months). When we began the work it was already known that these two stages involved very similar output forms [e.g. Vihman et al., 1985]. We have found very little evidence that infants have developed any segmental independence by the end of the single-word stage. Instead, what we found can be summarized by the term ‘Frame Dominance’ [Davis and MacNeilage, 1995]. Most of the variance in vocal output from 7 to 18 months of age can be attributed to mandibular oscillation alone, with very little evidence that any of the other articulators – lips, tongue, soft palate – are moved independently during vocal episodes. We will now describe the studies that led to this conclusion. The first problem we encountered was that there was very little knowledge about vowels in early speech acquisition. As vowels and consonants alternate with each other in babbling and early speech, both the problem of segmental independence and the more general serial order problem crucially involve vowels as well as consonants, and, specifically, the relations between the two elements. Consequently, consistent with our

286

Phonetica 2000;57:284–296

MacNeilage/Davis

methodological strictures, our first study was an extremely large-scale case study of phonetic transcription of words of a single infant at the early word stage, 14–20 months, with the intention of considering vowels and vowel-consonant relations [Davis and MacNeilage, 1990]. The results of this study gave us our first inkling of the presence of Frame Dominance. We found three consonant-vowel co-occurrence patterns: coronals with front vowels, dorsals with back vowels and labials with central vowels. The two lingual patterns – coronal-front and dorsal-back – suggested that the tongue was typically not moving independently of the mandible. For the labial-central pattern, it seemed possible that lip contact was simply the result of mandibular oscillation and that the tongue was in a rest position in the center of the mouth. Consequently, apart from the positioning of the tongue in the anterior or posterior part of the mouth, which apparently occurred by the onset of the utterance, mandibular oscillation alone could be the source of all three of these patterns. As we had no reason to believe at that time that these patterns were also present in the English language, and therefore could have been learned by the time of the first words, we hypothesized that they must be basic to early vocalization and must therefore be present from the onset of babbling. Subsequent studies of CV co-occurrences in babbling and early speech, in which we attempted to confirm these hypotheses, focussed on the co-occurrence of vowels with stop consonants, nasals and glides, because these segment types constituted the overwhelming majority of nonvocalic sounds (92% in the first-word period). In three studies of babbling, in a total of 9 infants, all three predictions have been strongly confirmed. There were 27 possible tests of the three hypotheses (9 subjects, 3 per subject). Three could not be made because there were insufficient numbers of dorsal consonants. Of the remaining 24 tests, 23 were positive and there was 1 null result. Overall the three co-occurrence patterns were almost 30% higher than would have been expected on the basis of the frequencies of the two components of the CV pair in the overall corpus. Median ratios of observed-to expected co-occurrences in the study of 6 subjects were: coronal-front 1.28; dorsal-back 1.22, and labial-central 1.34. A very similar result was found in two studies of early words, involving 10 infants and the same methodology [MacNeilage et al., 1997, and unpublished observations]. A total of 27 of the 28 possible tests were positive, with 1 negative outcome. The median ratios of observed-to expected co-occurrences in the 10 infants were: coronalfront 1.45; dorsal-back 1.36, labial-central 1.28. We believe that most of the null findings and counterexamples reported in other studies [Boysson-Bardies, 1993; Oller and Steffans, 1993; Tyler and Langsdale, 1996; Vihman, 1992] are a result of four methodological differences between their studies and ours. These studies uniformly involved much smaller databases than ours, sometimes divided their data into small subsets for analysis (thus reducing sample sizes), sometimes used different vowel classifications, and sometimes did not take both vowel and consonant frequencies into account in computing expected values. Another prediction we made after the initial study [MacNeilage and Davis, 1990a] concerned a possible lack of independence between successive syllables in addition to the lack of independence we had found between successive sounds within syllables. Some background is needed in order to understand why this prediction was made. Earlier work had led to the claim that the initial intersyllabic pattern in babbling was one of syllable reduplication (repetition) and that variegation (nonrepetition) only became


Phonetica 2000;57:284–296

287

prominent in later babbling [e.g. Oller, 1980]. This claim seemed plausible because it was consistent with the commonsense assumption that there was a progression towards greater serial complexity as infants got older. However, subsequent results suggested that reduplication and variegation coexist in more or less equal quantities from the beginning of babbling [Smith et al., 1989; Mitchell and Kent, 1990]. We subsequently found, in a study of babbling in 6 infants [Davis and MacNeilage, 1995], that there was the same amount of variegation in the first half of the babbling period as in the second. These results cast some doubt on the conclusion that variegation involves more segmental independence than reduplication. This finding encouraged us to hypothesize that, as in intrasyllabic organization, most of the variance in intersyllabic organization might involve the vertical, or closeopen dimension rather than the horizontal, or front-back dimension of the oral cavity, and might therefore be primarily due to ‘Frame Modulation’ [MacNeilage and Davis, 1990a]. Consonants might vary primarily in manner of articulation, which mostly involves the amount of constriction in the tract (related to mouth closing), rather than in place of articulation, which involves changes in the front-back dimension. Vowels might vary more in height (related to mouth opening) than in the front-back axis. The mandible might be the primary contributor to both the amount of mouth closing for consonants and the amount of opening for vowels. Both of these predictions were also strongly confirmed for 6 infants in babbling [Davis and MacNeilage, 1995] and 10 infants in first words. All infants showed both predicted effects in babbling at highly significant levels. The median ratio of observed to expected occurrences for consonants was 2.80. For vowels it was 1.42. As to speech, because some infants produced too few instances of consonant variegation in their first words for separate tests, the data for the 10 infants was pooled and the expected effect was found to be highly significant. All infants but 1 showed the vowel effect. The median ratio of observed to expected occurrences was 7.0. The confirmation of the 5 predictions, 3 for CV co-occurrences and 2 for frame modulation, provided strong support for the Frame Dominance concept. Most of the variance in babbling, and early speech, is the result of mandibular oscillation, with other articulators tending to adopt a static configuration throughout the utterance. Three other studies provided further support for the Frame Dominance conception. The studies described earlier only involved the frequently occurring nonvocalic sounds – stop consonants, nasals and glides. However an additional study of infrequent consonants in babbling – fricatives, affricates and liquids – showed that they are for the most part subject to the same CV co-occurrence patterns and variegation patterns as the more frequently occurring consonants [Gildersleeve-Neumann et al., in press]. An acoustic analysis showed that the vowels intervening between successive nasal consonants (e.g. [mam]) were strongly nasalized, suggesting that the soft palate – like the tongue and the lips – typically stayed in the same position for entire nasalized utterances [Matyear et al., 1997]. A separate study was made of consonants that occur in absolute-final position in babbling, as earlier studies suggested that they tended to be different than consonants elsewhere in being more often voiceless and/or fricatives [Redford et al., 1997]. We found that, regardless of their voicing and manner of articulation, these consonants typically agreed with the consonant preceding them in place of articulation, suggesting that they result primarily from frame reiteration.

288

Phonetica 2000;57:284–296

MacNeilage/Davis

CV Co-Occurrence Patterns in Languages

The strength of the CV co-occurrence patterns in infants led us to ask whether this lack of segmental independence was a development-specific phenomenon or whether it was also present in adult speech. We tabulated CV co-occurrences involving stops and nasals in 12,630 words derived from dictionary counts of ten languages including several major language families – English, Estonian, French, German, Hebrew, Japanese, New Zealand Maori, Ecuadorian Quichua, Spanish and Swahili [MacNeilage et al., 2000]. We found evidence for all three patterns at an average of 18% above expectation. Seven languages showed the coronal-front pattern with a mean observed-toexpected ratio of 1.16. Eight languages showed the dorsal-back pattern with a mean observed-to-expected ratio of 1.27. Seven languages showed the labial-central pattern with a mean ratio of 1.10. We believe that the joint occurrence of these three patterns in infants and languages indicates that they were present in the first language/s. We believe this because the biomechanical contingencies involved in these patterns are so basic. The lingual co-occurrences (coronal-front and dorsal-back) must simply be due to a biomechanical constraint on the amount of tongue movement that can readily be made between a consonant involving the tongue and a vowel that follows it. The labial-central pattern is of particular interest from the standpoint of the Frame/Content theory. According to the theory, the basic frame involves mandibular oscillation alone with no necessary active intrasyllabic movement of any other articulator. This is exactly what seems to be occurring in instances of labial-central co-occurrence in infants. The tongue is presumably in a resting position in the center of the mouth. Even though the tongue and lips are not passive in modern adult labial-central pairings [MacNeilage and Sholes, 1964] modern languages appear to have retained a tendency to preserve the position the tongue had when it was a passive accompaniment of ancestral ‘pure’ frame production, i.e., mandibular oscillation alone.

Early Speech: The Labial-Coronal Sequence Pattern

Much research has shown that the output forms of babbling and early speech are very similar [see MacNeilage, 1997, for a summary]. Although similarity is the mode, we have found three differences. First, labial consonants increase in first words [MacNeilage et al., 1997; see also Boysson-Bardies et al., 1992]. This seems to be a regression towards pure frames induced by the increase in functional load resulting from the task of interfacing the motor system with the mental lexicon [MacNeilage, 2000]. Second, there is more ability to vary the identity of the vowel in utterance-final position. The third trend is especially significant as it is the only clear-cut case of an increase in segmental independence that we have found during the first-word stage. The trend is called ‘Fronting’ [Ingram, 1974]. The first consonant in a word tends to have a more anterior place of articulation than the second. As dorsal consonants are often rare in first words, the main trend is to begin with a labial consonant, and, after the vowel, continue with a coronal (the LC pattern). This pattern was about 2.5 times more frequent than the reverse pattern in our group of 10 infants in the first-word stage [MacNeilage et al., 1999]. We also found it in nine of the ten languages we studied, occurring 2.25 times as often as the reverse pattern [MacNeilage et al., 1999].


Phonetica 2000;57:284–296

289

As in the case of the increase in labials, we have interpreted this fronting as an ease-related response to the problem of interfacing output with use of the mental lexicon [MacNeilage et al., 2000]. There are two reasons for this interpretation, in addition to the implications of the labial regression effect itself. First, as discussed earlier, labials may be a simple outcome of the mandibular oscillation, at least in infants, while coronals involve an additional movement – of the tongue. Second, infants whose babbling and early speech attempts have been prevented by a tracheostomy strongly prefer labials in their posttracheostomy vocalizations [e.g. Locke and Pearson, 1990]. Why would there be an easy beginning rather than an easy ending? The existence of a separate neural system for movement initiation in vertebrates [e.g. Loeb, 1987] can be taken to mean that movement initiation poses a unique control problem, which is perhaps reduced when movement can be initiated in an easy manner. The labial-coronal sequence effect may be a self-organizational response to the problem of simulating the serial output complexity that the infant discerns in the ambient language. The likelihood that the labial-coronal sequence is not simply the result of a copying operation in infants is suggested by the fact that it is somewhat stronger in infants from its onset than it is in adults. In addition, infants have been reported to produce this pattern even when the target word has the opposite pattern [as in ‘pot’ for ‘top’, Macken, 1978; Jaeger, 1997]. As with the CV co-occurrence patterns, the presence of this pattern in languages as well as infants suggests that it may have first developed early in language evolution. As in infants, it may have first developed in a self-organizational manner, because it was relatively readily producible, but once produced, was retained as a sound pattern for a new lexical unit. If this interpretation is correct, the onset of the LC effect is an extremely momentous event in both evolution and acquisition, not so much in itself, but in its consequences. In both cases, it results in a quantum jump in serial complexity which, if occurring in the context of labial-labial and coronal-coronal sequences, as we have portrayed it, increases the possible disyllabic patterns by 50% at a single stroke. By making a tongue movement after the first frame, instead of prepositioning it before the utterance begins, an output discontinuity is induced. Perhaps an additional step consisted of being able to control the intervening vowel to make it either of the type that usually goes with the preceding consonant, or with the following one (central or front), thus producing a further quantum jump in sequence possibilities. Such events could rapidly conspire to give the appearance of a literally systematic discontinuity in structure of the communication system when the system is viewed from a distance, so to speak. Thus the now unique phenomenon we know as speech may have been born. However the uniqueness may lie more in the end result as we see it than in any single formative event. As suggested by Gould [1977, p. 409], ‘external discontinuity may well be inherent in underlying continuity provided that a system displays enough complexity’.

Serial Organization in a Protolanguage Corpus

Our hypotheses about the form of early language/s – that they had frames, CV cooccurrence patterns and the LC sequence effect – cannot be directly tested. However we have recently found evidence for these patterns in an analysis of a 27-word corpus of putative protowords – words with common sound patterns across many existing

290

Phonetica 2000;57:284–296

MacNeilage/Davis

language families – presented by Bengtson and Ruhlen [1994; MacNeilage and Davis, in press b]. In this corpus, totalling 49 syllables, there were either only 2 exceptions to the tendency to alterate between a single consonant and a single vowel, or 4, depending on whether the [ts] sequence is classified as an affricate or as a consonant cluster. The observed-to-expected ratios for the three hypothesized CV co-occurrence patterns were: coronal-front 1.94, dorsal-back 1.63, and labial-central 1.31. None of the other six observed-to-expected ratios exceeded 0.94. A chi square analysis of the overall distribution of CV co-occurrences was significant (chi square n = 46, d.f.= 4, 9.63, p < 0.05). The most frequent variegated consonant sequence was labial-coronal (8) and there was only 1 coronal-labial sequence.

Implications for the Frame/Content Theory

The evidence presented here suggests to us that ancestral hominids were like modern infants in going through two stages in the evolution of true speech: an initial Frame stage and a subsequent Frame/Content stage. In the Frame stage, which begins in infants at the beginning of babbling, a systematic pairing of phonation and mandibular oscillation allows sustained voiced alternation between consonants and vowels. Pure frames are the simplest result, but static nonrest positioning of the soft palate and tongue is also possible, allowing nonnasal sound production and fronted (coronalfront) and backed (dorsal-back) frames, all tending to be sustained through an entire utterance. One other probable property of this frame stage suggests itself: babbled utterances [Davis and MacNeilage, 1995], first words, words of languages [Bell and Hooper, 1978] and the putative protolanguage forms [MacNeilage and Davis, 2000] all tend to begin with a consonant and end with a vowel. This property is probably a response to a problem of serial order in vocalization, a problem held in common with other mammals. Presumably mammalian calls in general show influences of the need to co-ordinate phonation with the departure of the articulatory system from a resting or vegetative configuration at the beginning of the call and the return to this configuration at the end of the call. In the case of hominids the problem is to sandwich one form in particular – the frame – between the prior and following vegetative states of the production apparatus [MacNeilage and Davis, in press b]. We share Lindblom’s view that modern speech results from a compromise between articulatory ease and perceptual distinctiveness. The topic of this paper could have been called ‘foundations of articulatory ease’ except for the problem of an acceptable definition of the word ‘ease’. Our conviction is that perceptual distinctiveness played/plays only a minor role in the frame stage of phylogeny and ontogeny. For the early hominid, distinctiveness was unlikely to have been a problem because initially there was only a small vocabulary, and it is easier to distinguish between a small number of possible signals (e.g. Morse code). It is clear that the complexity of the signal level of speech must have vastly increased in evolution in order for modern languages to directly encode so many message variants. Every language has developed its modern sound and sound pattern repertoire by a historical process of paradigmatic (sound inventory) and syntagmatic (serial organization) expansion. Selection pressures have forced the expansion of the message set, even though this expansion involves the use of new sounds and patterns which take the


Phonetica 2000;57:284–296

291

production system out of its most comfortable range. The finding of Lindblom and Maddieson [1988] that languages with large inventories have a disproportionately large number of consonants which are difficult to produce is evidence of this process of expansion. We suggest that these developments occurred in a second, Frame/Content stage of evolution. In our opinion, speech acquisition involves recapitulation of this sequence of events, but by progressively more precise simulation of ambient language models rather than, as in phylogeny, by a sequence of inventions of new lexicon-sound links. The LC effect may have been the first systematic syntagmatic trend in the direction of increased complexity in evolution, as it is in acquisition. In English, the long course of acquisition of fricatives and liquids is an indication of the difficulty of paradigmatic aspects of the historical process of overall repertoire expansion, and the difficulty of acquisition of consonant clusters is an indication of the difficulty of syntagmatic aspects of the process. In the course of these developments, infants, and presumably early hominids, are/were forced to abandon some of their most basic motor patterns, in the transition to adult forms in the case of infants, and in the historical transition to modern forms in the case of hominids. These patterns, though basic to the hominid production system, are superceded because they have become incompatible with complex high speed modern speech transmission and perception. The main example of this is the virtual loss of syllable reduplication in languages, even though it is the main form of intersyllabic organization in infants, and, according to the Frame/Content theory, was also the main original form of intersyllabic organization in earlier hominids. In our study of ten languages we found that consonant reduplication occurs at only 67% of chance, levels suggesting an active prohibition of the form rather than a simple reduction to chance levels [MacNeilage et al., 2000]. Infants are therefore forced to abandon their early preference for syllable reduplication in the course of speech acquisition. It is perhaps worth noting that the disappearance of a universal and therefore (many would say) genetically specified property of speech during ontogeny is at present inexplicable within generative phonology [Drachman, 1976; Pater, 1997].

Conclusions

How successful have we been in our attempt to derive speech from nonspeech with an approach centered on speech ontogeny? How our success is judged depends primarily on the plausibility of the assumption that sounds and patterns common to infants, languages and protolanguage corpora were probably present in first words. Basic motor properties were the common element in all the patterns we found. The original form of the most basic of these patterns, the frame, may have been present since the origin of mammals, a fifth of a billion years ago [Radinsky, 1987]. Pure frames may reflect the simplest operational form of the frame. It could be said that the acoustic correlates of the resting position of the tongue in pure frames only began to signal a ‘vowel’ when they began to make a difference in message transmission depending on their presence or absence. In linguistic parlance this only happened when one acoustic constellation formed a minimal pair with another in the service of sending two messages. In this sense, vowel-related acoustic packages preceded vowels, and the same could be said for consonants. Speech evolved from nonspeech.

292

Phonetica 2000;57:284–296

MacNeilage/Davis

We have argued that lingual frames result from one of the most basic properties of matter of any kind, the property of inertia. We would like to turn the most obvious question around in this context. Rather than asking why the tongue does not move actively from one segment to another in infants in particular, we would ask why it should. Presumably the view that there is a genetic specification of distinctive features (usually separate features for vowels and consonants) and preferences in modes of interaction between these features would lead to the expectation of independent tongue movements from segment to segment in infants. But that view is only obtained by reification of adult sequential patterns, which are endpoints of phylogenetic and ontogenetic progressions, endpoints which, for the most part, successfully disguise their lineage. What alternative explanations exist for the CV co-occurrence patterns? There is absolutely no evidence that there is a species-specific genetic basis for intersegmental patterns we have observed. The genetic conclusion is obviously in a different realm than the view that one of the most basic properties of languages is lingual inertia. These conclusions are on opposite sides of the mind/body dichotomy. In our opinion, the tendency toward absence of intersegmental tongue movements can be attributed to inertia, in the absence of forces from any realm that are acting to overcome it, as there is no evidence of such forces. Is this not a more economical possible explanation of the lingual co-occurrence constraints than the postulation of genetically specified units and processes of Universal Grammar that have nothing to do with the prior evolution of communication? The labial-coronal sequence pattern is in our opinion a reflection, in the serial organization of words, of an interaction between basic motor system capabilities and mental representations associated with words. There is nothing necessarily specific to speech in the labial-coronal pattern except for the actual problem space in which the pattern is evoked. To say that some property is speech-specific simply because it occurs in speech is to finesse the problem of causality. Functional load effects are commonplace in the human sciences, although performance effects of any kind are excluded from classical generative linguistics. A similar functional load effect to the one we suggest has been shown for infant speech perception at the age when the labial-coronal effect first appears. Steger and Werker [1997] have found that infants show less discrimination of fine phonetic detail when required to pair words with objects than they show in syllable discrimination tasks. Numerous self-organization effects on action produced in the context of a mental intentional state have been demonstrated in the ontogeny of human walking [e.g. Thelen and Smith, 1994]. We have suggested that nonspeech phenomena are not only responsible for the presence of some of the speech patterns that we see today in languages but also for some that we do not see (e.g. syllable repetition), even though there are good nonspeech reasons for them to be easily producible. Languages may have below-chance levels of intersyllabic consonant repetition because of the deleterious effects of fequent repetition of the same sound in working memory in modern high speed message transmission. It is well known in studies of working memory [e.g. Conrad and Hull, 1964] that lists for recall that include spelled letters which share sounds such as ‘dee’ and ‘bee’, and ‘ell’ and ‘eff’ lead to confusion. Equally well known is the ‘repeated phonemic effect’ whereby the occurrence of two examples of the same sound in close proximity tends to induce serial ordering errors in speech production [e.g. MacKay, 1987]. However, confusion in serial organization is not necessarily a speech-


Phonetica 2000;57:284–296

293

specific phenomenon. It is obviously also a factor in typing [MacNeilage, 1964], a function that did not evolve, and uses a different control system than speech. We have provided a good deal of evidence for Lindblom’s [1984] contention that it is possible to derive speech phenomena from the realm of nonspeech. We have also voiced our agreement with Lindblom that there is no alternative to this endeavor if we wish to understand the origin of speech. Time will tell how far we can get with this effort, but at the moment there is no reason for pessimism. We believe we have only seen the tip of the iceberg for the particular approach that we have adopted. A comprehensive statistical analysis of properties of an infant’s early speech makes it possible to understand phonological subpatterns of that infant in the context of the overall functioning of his/her system. The absence of this context has tended to result in a good deal of indeterminacy in a large number of reports that have focussed on system fragments and resulted in the formulation of ad hoc ‘rules’ or ‘mental strategies’. Virtually no studies have been done on statistical properties of words of languages. We have been amazed at how much common structure in infant speech and languages we have found in our initial dictionary analyses of words. To our knowledge, our statistical study of serial organization propensities in a protolanguage corpus is the first such study ever done. The outcome of a program that studies a combination of the three approaches cannot but exceed the sum of its parts. We believe that the approach described here is a vindication of Lindblom’s advocacy for deriving speech from nonspeech and of Tinbergen’s advocacy of a four-pronged attack on the understanding of communication.

Acknowledgment This work was supported in part by National Institutes of Health Grant RO1 HD2773-07.

References Anderson, S.R.: Why phonology isn’t natural. Ling. Inqu. 12: 493–539 (1981). Bell, A.; Hooper, J.B.: Issues and evidence in syllabic phonology; in Bell, Hooper, Syllables and segments, pp. 3–22 (North-Holland, Amsterdam 1978). Bengtson, J.D.; Ruhlen, M.: Global etymologies; in Ruhlen, On the origin of language, pp. 277–336 (Stanford University Press, Stanford 1994). Boysson-Bardies, B.: Ontogeny of language-specific syllable production; in Boysson-Bardies, de Shonen, Jusczyk, MacNeilage, Morton, Developmental neurocognition: speech and face processing in the first year of life, pp. 353–364 (Kluwer Academic Publishers, Dordrecht 1993). Boysson-Bardies, B.; Vihman, M.M.; Roug-Hellichius, L.; Durand, C.; Landberg, I.; Arao, F.: Material evidence of infant selection from the target language: a cross-linguistic study; in Ferguson, Menn, Stoel-Gammon, Phonological development: models, research, implications, pp. 369–392 (York Press, Timonium 1992). Conrad, R.; Hull, A.J.: Information, acoustic confusion and memory span. Br. J. Psychol. 55: 429–432 (1964). Davis, B.L.; MacNeilage, P.F.: Acquisition of vowel production: a quantitative case study. J. Speech Hear. Res. 33: 16–27 (1990). Davis, B.L.; MacNeilage, P.F.: The articulatory basis of babbling. J. Speech Hear. Res. 38: 1199–1211 (1995). Drachman, G.: Child language and language change: a conjecture and some refutations; in Fisiak, Recent developments in historical phonology, pp. 123–144 (Mouton, The Hague 1976). Gildersleeve, C.; Davis, B.L.; MacNeilage, P.F.: Production constraints in babbling: implications for fricatives, affricates and liquids. Appl. Psycholing. (in press). Goldsmith, J.A.: The handbook of phonological theory (Blackwell, Oxford 1995). Gould, S.J.: Ontogeny and phylogeny (Belknap, Cambridge 1977). Halle, M.: Phonology; in Osherson, Lasnik, Language: an invitation to cognitive science, pp. 43–68 (MIT Press, Cambridge 1990).

294

Phonetica 2000;57:284–296

MacNeilage/Davis

Ingram, D.: Fronting in child phonology. J. Child Lang. 1: 233–241 (1974). Jaeger, J.: How to say ‘Grandma’: the problem of developing phonological representations. First Lang. 17: 1–29 (1997). Kenstowicz, M.: Phonology in generative grammar (Blackwell, Oxford 1994). Ladefoged, P.: The limits of biological explanation in phonetics; in Cohen, van den Broecke, Abstr. 10th Int. Congr. Phonet. Sci., pp. 31–37 (Dordrecht, Foris 1984). Lashley, K.S.: The problem of serial order in behavior; in Jeffress, Cerebral mechanisms in behavior, pp. 112–136 (Wiley, New York 1951). Levelt, W.J.M.: Speaking: from intention to articulation (MIT Press, Cambridge 1989). Levelt, W.J.M.: Accessing words in speech production: processes, stages and representations. Cognition 42: 1–22 (1992). Lindblom, B.: Can the models of evolutionary biology be applied to phonetic problems? in Cohen, van den Broecke, Proc. 10th Int. Congr. Phonet. Sci., pp. 67–81 (Dordrecht, Foris 1984). Lindblom, B.; Maddieson, I.: Phonetic universals and consonant systems; in Hyman, Li, Language, speech and mind, pp. 62–80 (Routledge, London 1988). Locke, J.L.; Pearson, D.: Linguistic significance of babbling: evidence from a tracheostomized infant. J. Child Lang. 17: 1–16 (1990). Loeb, G.E.: Motor systems; in Adelman, Encyclopedia of neuroscience, pp. 690–692 (Burkhauser, Boston 1987). MacKay, D.G.: The organization of perception and action (Springer, New York 1987). Macken, M.: Permitted complexity in phonological development: one child’s acquisition of Spanish consonants. Lingua 44: 219–253 (1978). MacNeilage, P.F.: Typing errors as clues to serial order mechanisms in language behavior. Lang. Speech 7: 144–159 (1964). MacNeilage, P.F.: Acquisition of speech; in Hardcastle, Laver, The handbook of phonetic sciences, pp. 301–332 (Blackwell, Oxford 1997). MacNeilage, P.F.: The Frame/Content theory of evolution of speech production. Behav. Brain Sci. 21: 499–548 (1998). MacNeilage, P.F.; Davis, B.L.: Acquisition of speech: frames, then content; in Jeannerod, Attention and performance XIII, pp. 453–475 (Erlbaum, Hillsdale 1990a). MacNeilage, P.F.; Davis, B.L.: Acquisition of speech: the achievement of segmental independence; in Hardcastle, Marchal, Speech production and speech modelling, pp. 55–68 (Kluwer, Dordrecht 1990b). MacNeilage, P.F.; Davis, B.L.: On the origin of internal structure of word forms. Science 288: 527–531 (2000). MacNeilage, P.F.; Davis, B.L.: Evolution of speech: the relation between ontogeny and phylogeny; in Hurford, Knight, Studdert-Kennedy, The evolutionary emergence of language (Cambridge University Press, Cambridge, in press, a). MacNeilage, P.F.; Davis, B.L.: Evolution of the form of spoken words. Evol. Commun. (in press, b). MacNeilage, P.F.; Davis, B.L.; Matyear, C.L.: Babbling and first words: phonetic similarities and differences. Speech Commun. 22: 269–277 (1997). MacNeilage, P.F.; Davis, B.L.; Kinney, A.; Matyear, C.L.: Origin of serial output complexity in speech. Psychol. Sci. 10: 459–460 (1999). MacNeilage, P.F.; Davis, B.L.; Kinney, A.; Matyear, C.L.: The motor core of speech: a comparison of serial organization patterns in infants and languages. Child Dev. 71: 153–163 (2000). MacNeilage, P.F.; Sholes, G.N.: An electromyographic study of the tongue during vowel production. J. Speech Hear. Res. 7: 209–232 (1964). MacNeilage, P.F.; Studdert-Kennedy, M.G.; Lindblom, B.: Functional precursors to language and its lateralization. Am. J. Physiol. B (Reg. integ. comp. Physiol. 15) 246: 912–915 (1984). MacNeilage, P.F.; Studdert-Kennedy, M.G.; Lindblom, B.: Planning and production of speech: an overview; in Lauter, Planning and production of speech in normally hearing and deaf people, pp. 15–22, ASHA Rep. (American Speech and Hearing Association, Washington 1985). Matyear, C.L.; MacNeilage, P.F.; Davis, B.L.: Nasalization of vowels in nasal environments in babbling: evidence of frame dominance. Phonetica 55: 1–17 (1997). Mayr, E.: The history of biological thought (Belknap, Cambridge 1984). Mitchell, P.; Kent, R.: Phonetic variation in multisyllabic babbling. J. Child Lang. 17: 247–265 (1990). Oller, D.K.: The emergence of the sounds of speech in infancy; in Yeni-Komshian, Kavanagh, Ferguson, Child phonology, vol. 1: Production, pp. 29–42 (Academic Press, New York 1980). Oller, D.K.; Steffans, M.L.: Syllables and segments in infant vocalizations and young child speech; in Yavas, First and second language phonology, pp. 45–62 (Singular Press, San Diego 1993). Pater, J.: Minimal violation and phonological development. Lang. Acquis. 6: 201–253 (1997). Radinsky, L.B.: The evolution of vertebrate design (University of Chicago Press, Chicago 1987). Redford, M.L.; MacNeilage, P.F.; Davis, B.L.: Perceptual and motor influences on final consonant inventories in babbling. Phonetica 54: 172–186 (1997). Smith, B.L.; Brown-Sweeney, S.; Stoel-Gammon, C.: A quantitative analysis of reduplicated and variegated babbling. First Lang. 17: 147–153 (1989). Steger, C.L.; Werker, J.F.: Infants listen for more phonetic detail in speech perception than in word learning tasks. Nature 388: 381–382 (1997).


Phonetica 2000;57:284–296

295

Thelen, E.; Smith, L.: A dynamic systems approach to the development of cognition and action (MIT Press, Cambridge 1994). Tinbergen, N.: Derived activities: their causation, biological significance, origin and emancipation during evolution. Q. Rev. Biol. 27: (1952). Tyler, A.A.; Langsdale, T.E.: Consonant-vowel interactions in early phonological development. First Lang. 16: 159–191 (1996). Vihman, M.M.: Early syllables and the construction of phonology; in Ferguson, Menn, Stoel-Gammon, Phonological development: models, research, implications (York Press, Timonium 1992). Vihman, M.; Macken, M.; Miller, R.; Simmons, H.; Miller, J.: From babbling to speech: a reassessment of the continuity issue. Language 60: 397–445 (1985).

296

Phonetica 2000;57:284–296

MacNeilage/Davis

Phonetica 2000;57:297–314

Received: January 18, 2000 Accepted: March 16, 2000

Developmental Origins of Adult Phonology: The Interplay between Phonetic Emergents and the Evolutionary Adaptations of Sound Patterns Björn Lindblom Department of Linguistics, Stockholm University, Stockholm, Sweden

Abstract In this paper fragments of a theory of emergent phonology are presented. Phonological patterns are seen as products of cultural evolution adapted to universal biological constraints on listening, speaking and learning. It is proposed that children develop adult phonology thanks to the interaction of the emergent patterning of phonetic content and the adaptive organization of sound structure. Emergence – here used in the technical sense of qualitatively new development – is illustrated with examples from the study of perception, motor mechanisms and memory encoding. In this framework, there is no split between ‘behavioral phonetics’ and ‘abstract phonology’. Phonology differs qualitatively from phonetics in that it represents a new, more complex and higher level of organization of speech behavior. Accordingly, the phonology that the child ends up with as an adult is neither abstract nor independent of use. It represents an emergent patterning of phonetic content. Copyright © 2000 S. Karger AG, Basel

Introduction

Fundamental to linguistic methology is to distinguish between the abstract structure of an utterance, its form, and its behavioral expression, its substance. The traditional division of labor between phonology and phonetics derives from that distinction [Fischer-Jørgensen, 1975]. A crucial step in the history of the discipline was taken by stipulating that from (la langue) take precedence over (la parole) [de Saussure, 1916]. However, as students of speech sounds endeavor to increase the explanatory adequacy of their descriptions, it is becoming increasingly clear that the assumption of the ‘logical priority of linguistic form’, left essentially intact since de Saussure, is counterproductive to that goal. One of the aims of this introduction is to reexamine this timehonored assumption. An area where the consequences of the doctrine are particularly evident is language acquisition where, logically, the priority of linguistic form cannot be applied in any plausible way, since at the onset of development there is no form. We shall end our



Björn Lindblom Department of Linguistics, Stockholm University S–10691 Stockholm (Sweden) Tel. 46-8-165582 (home), 46-8-161187 (off) E-Mail [email protected]

Fig. 1. The traditional division of labor between phonetics and phonology.

introductory remarks by concluding that, if accounting for how children learn their native sound systems is to be part of explanatory linguistics, the doctrine of ‘form first, then substance’ must be rejected and replaced by another paradigm. The question of what that framework should be will be considered in the second part of the paper. The ‘Inescapable’ Dogma of 20th Century Linguistics The roles of phonology and phonetics are schematically diagrammed in figure 1. The starting point is spoken samples from a given language observed by ear and specified in terms of the elements of a universal phonetic alphabet. This provides raw materials for functional analyses in which judgments of contrast by native informants play a crucial role. Experimental phonetics contributes physical descriptions of the perceptually relevant correlates of the phonological units. In principle, these specifications can be translated into audible, and therefore perceptually testable, form by means of speech synthesis. The key notion is that speech performance is to be analyzed as a realization of underlying, grammatically determined aspects of sound. Phonology aims at discovering those aspects, whereas phonetics describes how, once defined, linguistic structure is actualized by the speaker and how it is recovered from the signal by the listener. The significant phrase here is ‘once defined’, since, traditionally, physical observations cannot precede functional analyses: Linguistic form must take precedence over phonetic substance. A glimpse of the origins of the form-substance distinction can be obtained by considering the difficulties that the founders of the International Phonetic Association must have had in their attempts to create a phonetic alphabet, e.g. the problems of ‘phonetic variability’ and ‘phonetic detail’. What speaking style should phonological analyses be based on? Suppose numerous instances are recorded of the ‘same’ German utterance, e.g. mit dem Wagen, ranging from clear [mRth de:m v‰a:gñ] to more casual forms such as [mR m v‰a:\] [Kohler, 1990]. A standard move has been to exclude such style-dependent variations from the domain of phonology proper. As stated by Jakobson and Halle [1968, pp. 413–414]: ‘When analyzing the pattern of phonemes or distinctive features composing them, one

298

Phonetica 2000;57:297–314

Lindblom

must recur to the fullest, optimal code at the command of the given speakers.’ Second consider the problem of ‘phonetic detail’. Should tenth be represented as [tenθ], [th£æ£ntθ], or with even more detail? Sweet [1877, pp. 103–104] concluded: ‘It is necessary to have an alphabet which indicates only those broader distinctions of sound which actually correspond to distinctions of meaning in a given language.’ It has been claimed that one of de Saussure’s major contributions was: … to focus the attention of the linguist on the system of regularities and relations which support the differences among signs, rather than on the details of individual sound and meaning in and of themselves. For Saussure, the detailed information accumulated by phoneticians is of only limited utility for the linguist, since he is primarily interested in the ways in which sound images differ, and thus does not need to know everything the phonetician can tell him. By this move, then, linguists could be emancipated from their growing obsession with phonetic detail [Anderson, 1985, pp. 41–42, italics ours].

The form-substance distinction ‘solves’ the problems of phonetic detail and phonetic variability by invoking a process of abstraction and idealization. It replaces variable and context-dependent behavioral data by invariant and context-free entities, such as phonemes and allophones. Phonetic substance is stripped away as linguistically irrelevant so as to uncover the phonologically significant structure assumed to be embedded in that substance. In other words, process is achieved by making phonological structure independent of its behavioral use. As mentioned, physical observation can never precede functional analysis. For an illustration consider the following observations. When the Swedish word ‘nolla’ (‘zero’) is played backwards, native Swedes hear ‘hallon’ (‘raspberry’) rather than the nonword ‘allon’ [Lindblom, 1980]1. Spectrograms indicate that, when spoken as a citation form, ‘nolla’ has expiration noise at the end of the final vowel. This noise is heard as a speech sound when the tape runs backwards, but not when the word is presented in the forward direction. Another finding is that, when the name ‘Anna’ is played backwards, native Swedes hear ‘Hanna’ rather than ‘Anna’, which indicates that the perceptual asymmetry of the ‘nolla-hallon’ example is likely to originate in auditory processing (e.g. differences between forward and backward masking) rather than in language-specific lexical access. The point of these examples is that the same physical pattern (the utterance-final noise) is a linguistically significant event in one situation, but not in the other. To the phonologist, ‘… nothing in the physical event … tells us what is worth measuring and what is not’ [Anderson, 1985, p. 41]. The ‘nolla-hallon’ demonstration2 conforms with the widespread conviction that making phonetic measurements, no matter how comprehensive, would not help the phonologist, since only the ear and the brain of the native speaker can determine what is of linguistic relevance in a speech signal. It is observations of this sort that have led linguists to stipulate that form must come first, then substance. This distinction has

1 These

words have accent II, the ‘grave’ accent. Produced as citation forms their F0 contours are fall-rise patterns which will remain fall-rise patterns also when reversed. 2 The full account of the nolla-hallon effect [Lindblom, 1980] has a second part which relates the observed perceptual asymmetry to the fact that the world’s languages seem to prefer using the glottal aspirated [h] in syllable-initial over syllable-final position. If, as we suggest, the nolla-hallon asymmetry is linked to universal characteristics of human hearing, this speech-independent auditory property could be invoked to explain the typological observations as well. However, predicting such patterns would require a nontraditional framework that puts ‘substance’ first and ‘form’ second (see further discussion below).

Developmental Origins of Adult Phonology

Phonetica 2000;57:297–314

299

been central for all of 20th century linguistics. Linguists have left it intact assuming their primary concern to be with the individual native speaker’s competence (mental grammar, tacit knowledge), not with performance (its behavioral instantiations): It seems natural to suppose that the study of actual linguistic performance can be seriously pursued only to the extent that we have a good understanding of the generative grammars that are acquired by the learner and put to use by the speaker or hearer. The classical Saussurean assumption of the logical priority of the study of langue (and the generative grammars that describe it) seems quite inescapable [Chomsky, 1964, p. 52, italics ours].

The Focus of Phonetics: ‘Given the Units, What Are the Phonetic Correlates’? As these remarks suggest, sound structure is postulated, not observed in the laboratory. Nonetheless, experimental phoneticians have accepted the ‘logical priority of form’ since, without an analysis of utterances into some kind of abstract units, it would be difficult to make sense of laboratory records. Therefore the following 30-year-old handbook statement on the relationship between phonetics and phonology continues to be a valid description of how speech sounds are analyzed. … a combination of a strictly structural approach on the form level with an auditorily based description on the substance level will be the best basis for a scientific analysis of the expression when manifested as sound. This description has to start by the functional analysis, then it must establish in auditory terms the distinctions used for separating phonemic units, and finally, by means of appropriate instruments, find out which acoustic and physiological events correspond to these different units. The interplay between the different sets of phenomena will probably for a long time remain a basic problem in phonetic research [Malmberg, 1968, p. 15].

We conclude that, in keeping with the ‘inescapable’ dogma, the focus of phonetics is placed on describing how postulated phonological units are realized in production and how they are recovered from the signal in perception. In short, ‘given the units, what are the phonetic (behavioral) correlates?’ In suggesting that progress was made by defining phonological structure as independent of on-line use and by relegating the study of individual, situational and style-dependent variations to phonetics and other performance-oriented disciplines, the preceding account is unlikely to be controversial. However, as soon as issues of behavioral realism and explanatory adequacy are raised, problems arise and consensus tends to disappear.

Why a New Paradigm Is Needed

The Behavioral Realism of Linguistic Form Let us begin by mentioning two classical issues that remain unresolved despite decades of experimental work. They are closely linked to accepting the priority of linguistic form: (i) the question of the ‘psychological reality’ of linguistic units and rules, (ii) the issues of ‘phonetic invariance’. The ‘psychological reality’ issue derives from the fact that the formal constructs of linguistic analyses are postulated rather than observed. Data on e.g. the alphabet, speech errors [Fromkin, 1973], word games, synchronic and diachronic phonology [Halle, 1964]3 have been used as evidence for segmental organization and discrete 3 ‘Almost every insight gained by modern linguistics from Grimm’s law to Jakobson’s distinctive features depends crucially on the assumption that speech is a sequence of discrete entities’ [Halle, 1964].

300

Phonetica 2000;57:297–314

Lindblom

units as psychologically genuine phenomena. However, this evidence is only indirect and that leaves room for alternative interpretations. Therefore, it is not surprising that phoneticians and psycholinguists differ as to how compelling that evidence really is [Ladefoged, 1984]. The ‘phonetic invariance’ issue has a similar source. It arises from the fact that natural speech patterns exhibit extensive complex individual, situational and stylistic variations, and by assuming that formal linguistic units – stripped of variability – can by hypothesis be upgraded from ‘operationally defined’ to ‘behaviorally real’. On the one hand, a variable phonetic reality – on the other, context-free invariant linguistic representations. The mismatch between the two generates the invariance issue. Again there is indirect evidence but so far there has never been a direct demonstration of phonetic invariance as a physical observable [Perkell and Klatt, 1986]. Are these two long-standing problems indications of significant, but reparable cracks in the theoretical edifice of linguistic science? Or are they irremediable consequences of the ‘priority of linguistic form’ portending a paradigm shift? Lacking solutions, these difficulties have given some speech researchers second thoughts concerning the behavioral status of phonological units as invariant and context-free. A case in point is the recent interest in ‘exemplar models’ of speech perception (more anon). Explanatory Adequacy in Phonology and the Form-Substance Distinction Few linguists would currently interpret the form-substance distinction so rigorously as to claim that phonological units and processes are totally arbitrary, empty logical phenomena4. Recall the following afterthought in chapter 9 of The Sound Pattern of English [Chomsky and Halle, 1968, p. 400, italics ours]: The entire discussion in this book suffers from a fundamental theoretical inadequacy. … The problem is that our approach to features, to rules, and to evaluation has been overly formal. … In particular, we have not made any use of the fact that features have intrinsic content.

Contemporary phonology presents numerous developments (e.g. ‘Optimality Theory’, ‘Grounded Phonology’, Laboratory Phonology’) indicating that attempts are being made to link the description of sound patterns more tightly to the production and perception of speech. It appears clear that phonetics and phonology are undergoing a rapprochement. Such ‘second thoughts’ imply a softening of the condition that, like the rest of grammar, sound structure be autonomous and independent of language use. Are these developments indicative that a more productive fine-tuning of the phonetics/phonology division of labor is under way? Or are they signs of a growing realization that accepting the priority of form creates an impasse that unnecessarily deprives linguistics of explanatory power? The Logical Priority of ‘la Parole’ in the Study of Language Acquisition The perspective of child phonology further underscores how real and serious these questions are. On the one hand, it appears reasonable to expect a phonological theory with explanatory ambitions to aim at accounting for how children develop the sound 4 In ‘Why Phonology Isn’t ‘‘Natural’’’, Anderson [1981] acknowledges a role for performance-based accounts, but places them outside linguistics proper, since they fail to deal with the aspects that ought to interest the linguist the most, i.e. the formal idiosyncracies of language per se. On that view, what counts as a ‘real’ linguistic explanation is one that deals with the functionally inexplicable.


Phonetica 2000;57:297–314

301

structure of their native languages. In response to the question, ‘Where does phonology come from?’ linguistics would provide a developmental answer instead of claiming that it is largely determined by our genetic endowment (= nativism)5, or that it has to be postulated given analysis methods and the observations themselves (= curve-fitting). On the other hand, honoring the ‘priority of linguistic form’ does not make sense in the case of phonetic learning because at the onset of development there is no ‘form’. We are forced to conclude that research in this area does not conform to the game plan of figure 1 and that the focus question of traditional phonetics, ‘given the units what are the phonetic correlates’, is utterly problematic. It would seem preferable to rephrase it as: ‘Given the child’s behavior what are the units?’. Consequently, the ‘inescapable’ dogma of 20th century linguistics does not apply to language acquisition. If accounting for how children learn their native sound systems is to be part of explanatory linguistics, the priority of form must be rejected and another paradigm must be found. How could such a framework be developed?

Modeling Phonological Development as Emergent Computation

Methodological Conditions, Long-Term Goals and Hypotheses The present section programmatically sketches fragments of a theory of emergent phonology. The following ground rules provide the key to escaping the explanatory impasse imposed by the priority of linguistic form: (1) Phonological structure must not prematurely be assumed to be genetically prespecified. Rather it should be deduced from the child’s experience and minimal assumptions about ‘initial knowledge’. In the technical sense of the term, it should be derived as emergent behavior. (2) Phonological structure should not be postulated simply because the entities or processes are suggested by the data to be explained. Always seek independent motivations in ‘first principles’ and avoid mere ‘curve fitting’. Restated the first rule says that nativism should be replaced by emergent computation. The second is an anticircularity condition. In summary, the common message is ‘deduce rather than postulate’! The following presentation is made up of several hypotheses. (i) Cumulative perceptual experience is complex but shows lawful effects of emergent categorization. (ii) Motor learning unfolds according to a criterion of minimum energy consumption. This universal physiological constraint puts the child within closer reach of the articulatory patterns of its native phonology. (iii) Reuse of perceptual and motor patterns is favored owing to a metabolic constraint on memory formation. These ideas exemplify possible roles that listening, speaking and learning might have in the shaping of sound systems and suggest domains where behaviorally motivated ‘first principles’ might be sought. Ambient input interacts with all three. (iv) Languages exhibit (a tangled fabric of) adaptations to the proposed processes, e.g. patterns of perceptual contrast, articulatory ease and combinatorial coding of submorphemic elements such as ‘features’ and

5 ‘This analysis into features could not plausibly be said to have been learned, for there are surely few experiences in the life of a normal individual who is not a professional linguist or a phonetician that would lead her/him to develop a system of features for classifying speech sounds. One is, therefore, led to assume that the speech-analysing system is part of our genetic endowment …’ [Halle and Stevens, 1979, pp. 339–340]. ‘And similar correlations between articulatory activity and acoustic signal are genetically provided for each of the nineteen or so features that make up the universal set of phonetic features’ [Halle and Stevens, 1991, p. 10].

302

Phonetica 2000;57:297–314

Lindblom

‘segments’. (v) Furthermore, it is assumed that, although sociocultural evolution sometimes opposes the effects of the above-mentioned factors, linguistic systems nevertheless retain those adaptations owing to the blind phonetic ‘editing’ unwittingly performed by speakers, listeners and learners during on-line language use [Lindblom et al., 1995]. Conceivably it might be objected that the present suggestions drastically overestimate the role of functional constraints at the expense of formal factors. As a brief response to that objection we note that obviously the complexity of the functionalism vs. formalism issue is considerable. Therefore a computational approach will be necessary. Our position is that any hypothesis propounded – whether formal or functional, articulatory or perceptual – must eventually be evaluated using ‘first principles’ simulations in an integrated manner that give all the component hypotheses a fair chance to compete and be numerically evaluated. Accordingly, whether we favor a functionalist or a formalist stance the agenda becomes identical. A framework of emergent computation will be needed in either case.

Listening

Emergent Effects of Cumulative Perceptual Experience Studies of unscripted speech are beginning to draw attention to the drastic modifications that phonetic forms frequently suffer under natural, nonlaboratory conditions [Kohler, this vol.]. The variability of infant-directed speech with its emotive coloring and lively prosody [Fónagy, 1983; Fernald, 1984] appears to be similar to what is found in adult-to-adult styles [Kuhl et al., 1997; Davis and Lindblom, 2000; U. Sundberg, 1998]. It is a near certainty that the invariance issue would need to be resolved also for baby talk. How do children segment the legato flow of signal information into words? How do they factor out emotive and stylistic transforms [J. Sundberg, this vol.]? In short, how do they manage to build their phonologies from a complex input? Machine learning including neural network theory appears relevant to that question, particularly the ‘unsupervised’ approaches [Hinton and Sejnowski, 1999]. For instance, work has been done to study how ‘structure’ can be derived from complex inputs such as human faces. There is neurophysiological evidence indicating that the brain represents whole objects in terms of component parts [Wachsmuth et al., 1994; Logothetis and Sheinberg, 1996]. Lee and Seung [1999] describe an attempt to simulate such behavior computationally. They designed an algorithm that learned to analyze faces nonholistically, i.e. it automatically parsed them into parts resembling ‘several versions of mouths, noses and other facial parts’ [Lee and Seung, 1999, p. 789]. The term ‘parts’ is here used to refer to ‘entities that allow objects to be reassembled using purely additive combinations’ [Mel, 1999]. The field of automatic speech recognition also offers results rich in implications for phonetics and phonology. It turns out that currently the best-performing systems are not based on extensive a priori knowledge about phonetic structure. They do surprisingly well simply by exploiting statistical regularities in the speech signal. ‘Units’ are derived automatically as the stored data form clusters and as patterns of ‘tied states’ become defined [Young and Woodland, 1993]. These and other findings raise the question whether, in processing speech, children’s brains come up with units-based representations in a similarly unsupervised fashion. If so, perceptual phonological units would not represent abstract ‘form’ but simply the emergent patterning of the phonetic information.


Phonetica 2000;57:297–314

303

The philosophy of the work just exemplified is reminiscent of so-called exemplarbased models of perception and learning, a paradigm explored for some time by psychologists [Estes, 1993]. Exemplar models make minimal assumptions about ‘initial conditions’ and are therefore not guilty of ‘resolving’ issues such as the variability problem by postulating unknown, innate mechanisms. They make the most of the signal and its complex, but lawfully structured, variability before positing abstract hypothetical decoding mechanisms. Johnson and Mullenix [1997] compare traditional and exemplar-based approaches to speech perception. They point out that classical accounts assume representations (e.g. phoneme-sized units) to be simple (context-free invariants). The task of deriving such units from the speech signal calls for complex processes capable of extracting invariants. Mechanisms of this type have been proposed – e.g. the ‘phonetic module’ of the motor theory [Liberman and Mattingly, 1985], the ‘smart mechanisms’ of direct realism [Fowler, 1986, 1994] and the ‘top-down’ processes (reconstructive rules, inference making and hypothesis testing) of cognitively oriented approaches. The details of how they operate still need to spelled out. Exemplar accounts adopt the opposite perspective. They assume representation to be complex and mapping to be simple. Categories form as emergent products of cumulative phonetic experience. A key point is that, although the variability of speech signals is extensive, it is highly systematic. Exemplar models capitalize on this fact storing stimulus information along with its immediate signal context. As more data accumulate, systematic covariations among stimulus dimensions gradually appear. The system can be said to use context to sort and disambiguate the variability. As a result, speech sounds have complex and contextually embedded representations unlike abstract phonetic segments. However, with sound systems shaped by a distinctiveness constraint [Diehl and Lindblom, in press; Johnson, this vol.] and a perceptual space of sufficiently large dimensionality, the integrity of sound-meaning relationships should have a fair chance of being maintained.6

Speaking

Clues from the Study of Nonspeech Motor Processes ‘Articulatory ease’ is sometimes informally invoked to explain both phonetic and phonological observations. It has a certain commonsense appeal but admittedly its current status is controversial. Ladefoged’s [1990] position is (i) that it is languagedependent, (ii) that it cannot be measured, and (iii) that therefore appeals to it are unscientific. In a paper on assimilation, usually seen as an articulatory process, Ohala [1990] rejects articulatory factors in favor of a perceptual account. He argues that articulatory ease is likely to play a marginal role in shaping sound patterns and that invoking it makes explanations teleological. As warnings against uncritical use of articulatory ease such statements are well taken, but, in the broader context of experimental biology, they appear overly pessimistic.

6 For an illustration of this reasoning see the simplified exemplar-based account [Lindblom, 1990, 1996] of phonetic learning in Japanese quail [Kluender et al., 1987]. For attempts to implement exemplar learning computationally see Lacerda [1995].

304

Phonetica 2000;57:297–314

Lindblom

This field presents a large literature on the energetics of locomotion in various species. Quantitative data are available on how humans and dogs walk and run, birds and bumblebees fly and how fish swim. A standard way of presenting results is to plot the amount of energy that the subject expends against traveling speed. The energy used is inferred from measurements of oxygen consumption made for subjects under steadystate conditions and therefore in an aerobic mode of oxygen uptake [McNeill Alexander, 1992]. A typical example of this research is the study by Hoyt and Taylor [1981], who measured energy consumption for horses walking, trotting and galloping. The subjects were observed as they moved freely and at speeds controlled by a treadmill. The energy used expressed per unit distance traveled and plotted against traveling speed formed U-shaped curves with distinct minima. Significantly, these minima were found to occur at speeds that subjects spontaneously adopted when moving freely and unconstrained by the speed of the treadmill. Such findings rest solidly on a large body of physiological studies [McArdle et al., 1996] and have been reported for a number of species. Experimental biologists interpret them to suggest that locomotion is shaped by a criterion of ‘minimum energy expenditure’. Why Should Speech Movements Be Different? Are speech movements and whole body movements similarly organized? Since energy costs for speech are likely to be small in comparison with those of locomotion, it might be argued that they play no major role at all in shaping phonetic movements. It is true that, until speech energy costs can be reliably measured, we have no basis for settling that issue satisfactorily. However, evolution’s tendency towards parsimony would make us expect the same rules to apply for small as for big movements. Among phoneticians, it is widely believed that both speech and sound patterns have many characteristics that are most readily accounted for in terms of production constraints. Conceivably, we will ultimately be able to show that many of them derive from a minimum energy expenditure condition. For instance, in running speech, prosodic modulations and speaking styles produce both strong elaborated and weak reduced forms. In our opinion, this segmental dynamics is an obvious candidate for an analysis based on energetics. Similarly, looking typologically at phonological systems, we observe a clear preference for low-cost motor patterns [Lindblom, 1983]. We hypothesize that minimization of energy expenditure plays a causal role in: (1) the absence of vegetative movements and mouth sounds; (2) determining the feature composition of phonetic segments (e.g. why are /i/ and /u/ universally ‘close’ vowels?); (3) constraining the universal organization of syllabic and phonotactic structure; (4) the patterning of diachronic and synchronic lenition and fortition processes; (5) shaping the system-dependent selection of phonetic values in segment inventories, etc. Recognizing the ‘Degrees of Freedom Problem’ for Speech Production Motor systems offer their users an extremely rich set of possibilities for executing a given task. In principle, there is an infinite number of trajectories that a movement from one point to another could take. This motoric embarrassment of riches is technically known as the degrees of freedom (DOF) problem. Solving the DOF problem means selecting a unique movement from a very large search space. As the following example will show, speech production offers talkers countless possibilities for any given task which makes DOF a very real issue also for the phonetician.


Phonetica 2000;57:297–314

305

Fig. 2. The energy required to

drive a biomechanical model of the jaw as a function of frequency for a sinusoidal movement of 10 mm amplitude.

Articulatory modeling [Lindblom and Sundberg, 1971; Maeda, 1991] has shown that there is a continuous trade-off between jaw opening and tongue raising in producing a given vowel, e.g. an /i/. In principle there is an infinite number of ways in which a given /i/ formant pattern could be produced. The normal way of making this vowel is to raise the jaw and adopt a moderately palatal tongue shape. However, it has been experimentally demonstrated that, when speakers are asked to produce a normalsounding /i/ with an atypically large jaw opening maintained by a ‘biteblock’ [Lindblom et al., 1979], their output does not approach the /£/-like quality predicted by areafunction-to-formant calculations. In fact, subjects are able to match the normal quality and the formant pattern of the vowel quite closely, a result that clearly indicates a compensatory mode of articulation. X-ray data [Gay et al., 1981] have confirmed this interpretation showing that, for biteblock /i/s, subjects compensate by raising the tongue higher than normal into a superpalatal position. This case is typical of many situations arising in articulatory modeling. It shows that the DOF problem definitely also applies to speech. From Walking to Talking In order to see how articulatory models might handle DOF let us briefly turn to some recent computational research on human walking [Anderson and Pandy, 1999; Pandy and Anderson, this vol.]. It offers an interesting solution to this problem. The human body is represented as a three-dimensional model of the musculoskeletal system. The upper part consists of a rigid torso without arms. The lower part has 23 degrees of freedom controlled by a set of 54 muscles. Attempts were made to simulate the normal human gait cycle. The findings indicate that the model walks at a forward velocity of 81 m/min, a value typical of human subjects [Ralston, 1976]. Predicted displacements of anatomical structures were quantitatively similar to experimental observations. Muscle coordination patterns were consistent with EMG data from human subjects. Metabolic energy was expended at a rate comparable to that for human walking. A compelling impression of the model’s realism is obtained from a

306

Phonetica 2000;57:297–314

Lindblom

video demonstration of the performance of the model. The normal gait of the model skeleton is presented and compared with average measurements from human subjects. It shows the model walking in an extremely humanlike fashion. With so many muscles and mechanical dimensions, this model has a significant DOF problem. Particularly relevant to discussions of articulatory ease in phonetics is the fact that the results were obtained when a performance criterion of least metabolic cost (= minimum heat production) was used. We can interpret the success of the simulations as implying that the optimization criterion drastically reduces the search space and makes it possible for the algorithm to identify unique and optimal movement trajectories for each subtask during the gait cycle. Summarizing so far we note (i) that minimum energy consumption evidently helps solve the DOF problem for nonspeech movements and (ii) that this problem also arises for speech modeling. In the light of those observations it seems justified to assume that energy costs ought to play an important role also in shaping on-line speech as well as sound systems. In order to get a preliminary idea of what energy costs might be for speech movements, a simplified model of the mandible was constructed [Lindblom et al., in press] (fig. 2). The jaw was represented by a system defined by its mass (m), damping (b) and elasticity (k). The mass was equal to 250 g. Critical damping and a resonance frequency of 5 Hz were assumed. The energy required to drive the system was calculated as a function of frequency for sinusoidal jaw movement of 10 mm amplitude. The results plotted with energy per distance on the ordinate and frequency of jaw movement on the abscissa indicated a function with a U shape and a distinct minimum similar to an ‘upside-down’ resonance curve and not unlike the locomotion findings reviewed earlier.

Learning to Speak

Articulatory Bootstrapping: ‘Easy-Way-Sounds-OK’ The work of MacNeilage [1998] draws attention to the prominent role of the mandible in babbling and early speech. He argues convincingly that speech did not have to develop a new rhythm generator for the production of syllables. By the evolutionary process of continuity and tinkering it made conservative use of existing central pattern generators, namely those already developed for vegetative purposes. ‘… speech makes use of the same brainstem pattern generator that ingestive cyclicities do, and … control structures for speech purposes are, in part at least, shared with those of ingestion’ [MacNeilage, 1998. p. 503]. This helps explain the universal fact that virtually every utterance of every speaker of every one of the world’s languages exhibits syllabic organization – that is, involves a mandibular open-close movement. It also sheds light on why, both motorically and sensorily, the jaw and the area around the mouth opening are particularly salient regions of the vocal tract [Lindblom and Lubker, 1985] and are therefore likely to be explored early on. Let us supplement this scenario with a few remarks based on energetics. Suppose that talking is like walking. In other words, young children vocalizing behave exactly like subjects walking and running in preferring energetically low-cost movements. If so, their vocal systems would tend to be activated at the minimum points of the U-shaped curves of their articulators. To further simplify this view of early vocal


Phonetica 2000;57:297–314

307

behavior, let us limit the degrees of freedom of the production mechanism to the jaw because of its vegetative salience. What would the articulatory and acoustic characteristics of openning and closing the jaw at minimum energy cost be like? Metaphorically, it would be given by the minimum value of the U-shaped curve and correspond to an open-close alternation near the jaw’s resonance frequency. Combining this movement with phonation would produce a quasi-syllabic acoustic output resembling [bababa]. In other words, least effort applied to the jaw would produce an utterance not unlike canonical babbling. MacNeilage [1998] is right in making us wonder why open-close alternations should be so ubiquitous in spoken language. We can restate the facts and interpretations presented so far: (a) The low-energy articulatory search (start pianissimo!) is limited to only a fragment of the child’s phonetic space (= mandibular oscillation). (2) It helps the child spontaneously bump into many articulatory patterns used by the ambient phonology (= ‘protosyllables’) by significantly narrowing the alternative possibilities. Could steps (1) and (2) be generalized and incorporated into a more comprehensive model of phonetic learning? Do they constitute a general bootstrapping strategy for discovering native articulatory patterns? An affirmative answer would be possible if it could be shown that: (a) The DOF problem for speech is solved in the same way as it is solved for nonspeech movements. That would produce a strong statistical bias in favor of low-cost motor patterns. (b) Many aspects of the world’s phonologies are lowcost motor patterns. (c) By cultural evolution the world’s phonologies could in principle have developed biologically less optimal motor patterns than they use now, but have done so only to a limited extent. In our opinion there is a strong probability that all three claims are correct. The reason is, we suggest, that sound patterns are adapted for phonetic development. Low-cost motor patterns are retained so as to accommodate the child’s energy-efficient search by providing ambient reinforcement of the child’s efforts [MacNeilage and Davis, this vol.]. The phrase ‘easy-way-sounds-OK’ captures the nature of this bootstrapping. Phonologically organized speech presupposes the specialized ability of vocal imitation [Studdert-Kennedy, 1998, this vol.]. The present account suggests that imitating is supplemented in important ways by mechanisms of motor emergence. As articulations are fortuitously discovered the ‘easy way’ and confirmed by the ambient input, perceptuomotor links get established to budding perceptual categories. Where Do Phonological Units Come from? The preceding discussion has concentrated on substantive aspects. In this final section we address the possibility of behavioral origins (as opposed to the prespecification) of a formal universal of linguistic structure, e.g. the combinatorial coding of discrete units. To address this topic, we will describe a game based on a simple algorithm that automatically analyzes holistic patterns into smaller elements and then reuses those elements. The phenomenon of reuse implies combinatorial organization. In keeping with the spirit of the proposed phonetics/phonology program, the point is that the derived units are emergent consequences of system growth and that they do not come prespecified. We suggest that this mechanism is formally similar to what goes on in lexical development. Phonetically the holistic patterns can be pictured as articulatory and/or auditory patterns. The segmentation into smaller elements defines the ‘units’. Reuse of those units is promoted by the fact that memory storage is associated with a biochemical cost.

308

Phonetica 2000;57:297–314

Lindblom

This cost is hypothesized to derive from the energy metabolism of memory formation [Gonzales-Lima, 1992] and is an increasing function of the novelty of the stored materials. Since novelty is expensive, holistic coding is disfavored whereas parts-based reuse is not. At this point a short summary is needed of some simplified neurobiological facts about how memories are encoded. Learning causes the brain to change physically. This change is activity-dependent. Active neural tissue contains more energy-rich substances. Hence, learning costs metabolic energy. Such conclusions have been drawn from histochemical analyses of brain tissue. Cytochrome oxidase is a substance used as a marker of metabolic capacity. The mitochondrial amount of this enzyme is assumed to reflect the functional level of activity in the neuron. More active neurons have more cytochrome oxidase and more active regions within a neuron have more mitochondria [Wong-Riley, 1989]. Gonzales-Lima [1992] reports experiments in which rats were trained to associate reward with an auditory stimulus (FM signal 1–2 kHz). After training for 11 days the brains of experimental and control animals were examined for cytochrome oxidase contents in their auditory neostriatum. The experimental group showed significantly increased amounts of cytochrome oxidase. The proposed interpretation is that the memory of the conditioning stimulus changes the neurons activated by the task. This change takes the form of an increase in their metabolic capacity. Fuel (‘potential energy’) is available should a demand arise for their activation (e.g. recall). This is reminiscent of other familiar examples of activity-dependent change, e.g. callous hands and bigger muscles. These results suggest that a principle of ‘minimal incremental storage’ may be embodied in the neural metabolism of memory formation. If so, it would mean that patterns containing more information (more ‘bits’ in the information theory sense) are energetically more costly, and therefore they take longer, to commit to memory. Here we do not confidently claim that this is the process underlying phonetic learning. Our objective is rather to demonstrate that a formal-looking property such as combinatorial coding could in principle readily arise for functional reasons. The mere possibility of such an account should make us wary of ‘inescapable’ conclusions about arbitrary formal idiosyncracies. It is of course important that an account of emerging structure be completely nonteleological. To say that children acquire large vocabularies because of the advantages of combinatorial coding is to make a teleological argument. What has to be argued instead is that combinatorial coding comes into existence owing to the fortuitous coincidence of several factors. Once that happens, that mode of organization is reinforced by its functional advantages. We hypothesize that one of the causal factors behind the ability to code up to 100,000 words or more [Miller, 1977] is a metabolic constraint on memory formation. The Nepotism Game: ‘Close Relatives Get Promoted’ Imagine a 10 ×10 matrix with 100 cells. The point of the game is to choose a sequence of n points located in the matrix so that a ‘cost criterion’ is minimized. We consider two alternative definitions of ‘cost’: (1) For every new cell we pay 1 unit! (2) For every new coordinate specification (row or column) we pay 0.5 units! A single item costs 1 unit on either measure. For the first criterion, the cost is equal to n units regardless of the cells selected. In the case of rule (2), costs can be cut by selecting a


Phonetica 2000;57:297–314

309

Fig. 3. Selecting a sequence

of n matrix cells in accordance with a cost criterion.

cell in a previously activated row and/or column. As n (system size) increases numerous opportunities for reuse arise. Figure 3 shows a situation with six points sequentially chosen according to the second measure. Selected cells are marked in black. When a choice is made, the other cells of that row and column become available at half price (0.5 units). This is indicated by the shading. Zero cost is associated with cells at intersections of already committed rows and columns. The example in figure 2 costs 6 units when we pay per cell (first measure), but only 2.5 units when selections are priced by coordinate specifications as in (2). We conclude that rule (1) corresponds to Gestalt coding and that, in conjunction with cost minimization, rule (2) forces the system to go combinatorial. Self-Segmentation and the Emergence of Articulatory ‘Reuse’ To explore what this exercise might tell us about speech, let us interpret the matrix as a crude articulatory space and replace rows and columns by continuous parameters, say the phase and amplitude of elementary oscillatory movement. Along a third dimension we specify the articulator performing the movement. A given point in this threedimensional space represents a Gestalt motor score. Further suppose that a given child consistently uses forms sounding like [didi], [mWmW] and [bÆbÆ]. In the articulatory space these forms are represented by three points whose coordinates specify the movement parameters: e.g. three amplitude values for the open-close movement of the jaw, two positions (front and back) for the rest/target alternation of the tongue etc. In standard notation (but without implying any segmental organization), the jaw-tongue parameters form the matrix shown in figure 4a. These specifications are each linked to its own type of anatomically distinct oscillatory closure movement: d_d_, m_m_, and b_b_. The nepotism principle (NEP) literally states that a recombination of all these hidden ‘component’ movements is favored by the memory constraint. If NEP were consistently and mechanically implemented, it would yield the additional potential reuse patterns for jaw-tongue movement shown in figure 4b. Moreover, it would put a num-

310

Phonetica 2000;57:297–314

Lindblom

a

b

Fig. 4. a Oscillatory jaw and tongue movements in [didi], [mWmW] [bÆbÆ]. (Standard notation without implying segmental organization.) b Jaw and tongue movements complementary to those of a made available by the mechanism of reuse.

ber of forms in a state of ‘readiness’, e.g. [d£d£], [dædæ], [d!d!], [dWdW], [dÆdÆ], [mimi], [m£m£], [mæmæ], [m!m!], [mÆmÆ], [bibi], [b£b£], [bæbæ], [b!b!], [bWbW]. Again no segmental organization is implied. How does this reuse come about? How are the ‘component movements’ identified? The quotation marks around ‘component’ are important, since so far we have little reason to treat phonetic forms as anything but Gestalts. As a first step, we note that the vocal tract consists of several independently controllable structures. In other words, although early vocalizations do not arise from phoneme-like control signals, the system producing them is in fact anatomically ‘segmented’. Second, we observe that, in many cases, neural representations are somatotopically organized [Kandel et al., 1991], which means that the brain stores individual motor and sensory activities in specific locations with anatomical identity preserved (cf. notion of homunculus). Both of these circumstances play a crucial role in the proposed self-segmentation process. Faced with the task of producing ambient forms not yet acquired the child must solve the problem of assembling new motor programs. NEP predicts that the speed and accuracy of imitation, spontaneous use and recall will depend significantly on whether or not the new form shares ‘component’ movements with old forms. Assembling a new motor score is assisted by overlap with previously encoded patterns even if those patterns are part of unanalyzed wholes and have not yet been ‘defined’ as separate motor entities. For developmental and typological evidence supporting this suggestion see review in Lindblom [1998].


Phonetica 2000;57:297–314

311

We propose that in part the NEP bias makes the child engage in spontaneous articulatory reuse, in part the native language favors forms that match the output of NEP. Learners can thus use NEP to find ‘hidden’ structure. Behavioral conditions make certain patterns more functional than others. Languages are molded by those functional constraints. They adapt to them incorporating fossils of naturalness in their architecture and by so doing they become more learnable and easier to use.

Summary How do children find the ‘hidden’ structure of speech? This question presupposes that ‘structure’ is something disembodied. In other words, it is seen as embedded in an incomplete, degraded, noisy and infinitely variable signal. That is the traditional, but, in our view, not necessarily correct view. Instead the following approach is advocated. Phonetic variations are far from random. They are patterned in principled ways because of perceptual distinctiveness, articulatory dynamics and VT acoustics [Fant, 1960; Stevens, 1998]. A cumulatively growing, exemplar-based phonetic memory should go a long way towards revealing that patterning to the child. In such a model ‘categories’ do not resemble the neat, operationally defined units of classical phonemic analysis, since their correlates are likely to be strongly contextually embedded, in a sense ‘hidden’. However, over time, variability would get sorted and disambiguated by context and by the cues providing semantic and situational labeling. ‘Mapping simple, representation complex!’ One source of information for perceptual labeling is articulatory. Resarch on nonspeech offers the phonetician valuable clues as to how motor processes operate. The role of metabolic cost in solving the DOF problem is a case in point. We have made the parsimonious assumption that speech movements are organized like other movements. Therefore energetics should be relevant. From that conclusion we were led to propose a two-part hypothesis: Easy-way-sounds-OK! It says (1) that children initially explore their vocal resources in an energetically low-cost mode and (2) that sound patterns have adapted to reward that behavior. This is a kind of ‘conspiracy’ that makes children stumble on motorically motivated phenomena in the ambient language such as syllabic organization. It also establishes motor links to perceptual forms (together with imitation). A related scenario was sketched for the development of the phonemically coded lexicon. We suggested that a linguistic system with featural and phonemic recombination humors learners whose memories charge a metabolic fee for storage. If that fee increases with the number of bits (amount of information) to be stored, it follows that patterns that do not share materials (Gestalts) are costly, whereas patterns with overlap are cheaper. Somatotopic organization and VT anatomy were found to impose an unsupervised segmentation of this overlap into articulator-specific parameters. This is the process that leads the child to the ‘phonetic gesture’ [Studdert-Kennedy, this vol.; Carré and Divenyi, this vol.]. Metabolically controlled reuse is thus launched and paves the way for cognitively driven and combinatorial vocabulary growth. These considerations favor the view that phonemic coding is an adaptive emergent rather than a formal idiosyncracy of our genetic endowment for language. Emergent phonology is proposed to promote a new vision of the relationship between phonetics and phonology. By substituting it for the traditional division of labor, we would get away from Chomsky’s ‘inescapable dogma’. The distinctions between form/substance and competence/performance should be abandoned having served their historical purpose. There is no split between phonetics and phonology because, from the developmental point of view, phonology remains behavior. Phonology differs qualitatively from phonetics in that it represents a new, more complex and higher level of organization of that behavior. For the child, phonology is not abstract. Its foundation is an emergent patterning of phonetic content. The starting point is the behavior. ‘Structure’ unfolds from it. Therefore the issue of ‘psychological reality’ does not arise. Similarly, explanations need not be limited to posthoc experimental justifications for postulated formal phenomena but are integrated into the theory’s predictions. Behavioral realism and explanatory adequacy are given free reins.

312

Phonetica 2000;57:297–314

Lindblom

Acknowledgments This research is supported by grant number BCS-9901021 from the National Science Foundation, Washington, D.C.

References Anderson, S.R.: Why phonology isn’t ‘natural’. Linguistic Inquiry 12: 493–539 (1981). Anderson, S.R.: Phonology in the twentieth century (Chicago University Press, Chicago 1985). Anderson, F.C.; Pandy, M.G.: A dynamic optimization solution for vertical jumping in three dimensions. Computer Methods in Biomechanics and Biomedical Engineering, pp. 1–31 (1999). Carré, R.; Divenyi, P.L.: Modeling and perception of ‘gesture reduction’. Phonetica, this vol. Chomsky, N.: Current trends in linguistic theory, in Fodor, Katz, The structure of language, pp. 50–118 (PrenticeHall, New York 1964). Chomsky, N.; Halle, M.: The sound pattern of English (Harper & Row, New York 1968). Davis, B.L.; Lindblom, B.: Prototype formation in speech development and phonetic variability in baby talk; in Lacerda, von Hofsten, Heiman, Emerging cognitive abilities in early infancy (Erlbaum, Hillsdale 2000). Diehl, R.L.; Lindblom, B.: Explaining the structure of feature and phoneme inventories; in Greenberg, Ainsworth, Speech processing in the auditory system, Springer Handbook of Auditory Research (SHAR) (in press). Estes, W.K.: Concepts, categories, and psychological science. Psychol. Sci. 4: 143–153 (1993). Fant, G.: The acoustic theory of speech production (Mouton, The Hague 1960). Fernald, A.: The perceptual and affective salience of mothers’ speech to infants; in Feagans, Garvey, Golinkoff, The origins and growth of communication, pp. 5–29 (Ablex, New Brunswick 1984). Fischer-Jørgensen, E.: Trends in phonological theory: a historical introduction (Akademisk forlag, Copenhagen 1975). Fónagy, I.: La vive voix (Payot, Paris 1983). Fowler, C.A.: An event approach to the study of speech perception from a direct-realist perspective. J. Phonet. 14: 3–28 (1986). Fowler, C.A.: Speech perception: direct realist theory; in Asher, Encyclopedia of language and linguistics, pp. 4199–4203 (Pergamon, New York 1994). Fromkin, V.: Speech errors as linguistic evidence (Mouton, The Hague 1973). Gay, T.; Lindblom, B.; Lubker, J.: Production of bite-block vowels: acoustic equivalence by selective compensation. J. acoust. Soc. Am. 69: 802–810 (1981). Gonzales-Lima, F.: Brain imaging of auditory learning functions in rats: studies with fluorodeoxyglucose autoradiography and cytochrome oxidase histochemistry; in Gonzales-Lima, Finkenstädt, Sheich, Advances in metabolic mapping techniques for brain imaging of behavioral and learning functions. NATO ASI Series D:68 (Kluwer, Dordrecht 1992). Halle, M.: On the bases of phonology; in Fodor, Katz, The structure of language, pp. 604–612 (Prentice-Hall, New York 1964). Halle, M.; Stevens, K.N.: Some reflections on the theoretical bases of phonetics; in Lindblom, Öhman, Frontiers of speech communication research, pp. 335–353 (Academic Press, London 1979). Halle, M.; Stevens, K.N.: Knowledge of language and the sounds of speech; in Sundberg, Nord, Carlson, Music, language, speech and brain (MacMillan, Basingstoke 1991). Hinton, G.E.; Sejnowski, T.J.: Unsupervised learning: foundations of neural computation (MIT Press, Cambridge 1999). Hoyt, D.F.; Taylor, C.R.: Gait and the energetics of locomotion in horses. Nature 292: 239 (1981). Jakobson, R.; Halle, M.: Phonology in relation to phonetics; in Malmberg, Manual of Phonetics, pp. 411–449 (North-Holland, Amsterdam 1968). Johnson, K.: Adaptive dispersion in vowel perception. Phonetica, this vol. Johnson, K.; Mullenix, J.: Complex representations used in speech processing: overview of the book; in Johnson, Mullenix, Talker variability in speech processing, pp. 1–8 (Academic Press, San Diego 1997). Kandel, E.; Schwartz, J.; Jessel, T.: Principles of neural science; 3rd ed. (Elsevier, New York 1991). Kluender, K.R.; Diehl, R.; Killeen, P.: Japanese quail can learn phonetic categories. Science 237: 1195–1197 (1987). Kohler, K.: Segmental reduction in connected speech in German: phonological facts and phonetic explanations; in Hardcastle, Marchal, Speech production and speech modeling, pp. 69–92 (Kluwer, Dordrecht 1990). Kohler, K.: Investigating unscripted speech: implications for phonetics and phonology. Phonetica, this vol. Kuhl, P.K.; Andruski, J.E.; Chistovich, I.A.; Chistovich, L.A.; Koshevnikova, E.V.; Ryskina, V.L.; Stolyarova, E.I.; Sundberg, U.; Lacerda, F.: Cross-language analysis of phonetic units in language addressed to infants. Science 277: 684–686 (1997). Lacerda, L.: The perceptual magnet effect: an emergent consequence of exemplar-based phonetic memory. Proc. ICPhS Stockholm, vol. 2, pp. 140–147, 1995. Ladefoged, P.: ‘Out of chaos comes order’: physical, biological and structural patterns in phonetics; in Van den Broecke, Cohen, Proc. 10th Int. Congr. Phonet. Sci., vol. IIB, pp. 83–95, 1984.


Phonetica 2000;57:297–314

313

Ladefoged, P.: Some reflections on the IPA. J. Phonet. 18: 335–346 (1990). Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791 (1999). Liberman, A.; Mattingly, I.: The motor theory of speech perception revised. Cognition 21: 1–36 (1985). Lindblom, B.: The goals of phonetics, its unification and application. Phonetica 37: 7–26 (1980). Lindblom, B.: Economy of speech gestures; in MacNeilage, Speech production, pp. 217–246 (Springer, New York 1983). Lindblom, B.: Explaining phonetic variation: a sketch of the H&H theory; in Hardcastle, Marchal, Speech production and speech modeling, pp. 403–439 (Kluwer, Dordrecht 1990). Lindblom, B.: Role of articulation in speech perception: clues from production. J. acoust. Soc. Am. 99: 1683–1692 (1996). Lindblom, B.: Systemic constraints and adaptive change in the formation of sound structure; in Hurford, StuddertKennedy, Knight, Approaches to the evolution of language, pp. 242–264 (Cambridge University Press, Cambridge 1998). Lindblom, B.; Davis, J.; Brownlee, S.; Moon, S.-J.; Simpson, Z.: Energetics in phonetics and phonology; in Fujimura, et al., Linguistics and phonetics (Ohio State University, Columbus, in press). Lindblom, B.; Guion, S.; Hura, S.; Moon, S.-J.; Willerman, R.: Is sound change adaptive? Revta ling. 7: 5–37 (1995). Lindblom, B.; Lubker, J.: The speech homunculus and a problem of phonetic linguistics; in Fromkin, Phonetic linguistics, pp. 169–192 (Academic Press, London 1985). Lindblom, B.; Lubker, J.; Gay, T.: Formant frequencies of some fixed-mandible vowels and a model of speech programming by predictive simulation. J. Phonet. 7: 147–162 (1979). Lindblom, B.; Sundberg, J.: Acoustical consequences of lip, tongue, jaw and larynx movement. J. acoust. Soc. Am. 50: 1166–1179 (1971). Logothetis, N.K.; Sheinberg, D.J.: Visual object recognition. Annu. Rev. Neurosci. 19: 577–621 (1996). McArdle, W.D.; Katch, F.I.; Katch, V.L.: Exercise physiology; 4th ed. (Williams&Wilkins, Baltimore 1996). MacNeilage, P.F.: The frame/content theory of evolution of speech production. Behav. Brain Sci. 21: 499–546 (1998). MacNeilage, P.F.; Davis, B.L.: Deriving speech from non-speech: a view from ontogeny. Phonetica, this vol. McNeill Alexander, R.: The human machine (Columbia University Press, New York 1992). Maeda, S.: On articulatory and acoustic variabilities. J. Phonet. 19: 321–331 (1991). Malmberg, B.: Manual of phonetics (North-Holland, Amsterdam 1968). Mel, B.W.: Think positive to find parts. Nature 401: 759–760 (1999). Miller, G.A.: Spontaneous apprentices (Seabury Press, New York 1977). Ohala, J.J.: The phonetics and phonology of aspects of assimilation; in Kingston, Beckman, Papers in laboratory phonology, vol. 1: Between grammar and the physics of speech, pp. 258–275 (Cambridge University Press, Cambridge 1990). Pandy, M; Anderson, F.C.: Dynamic simulation of human movement using large-scale models of the body. Phonetica, this vol. Perkell, J.; Klatt, D.: Invariance and variability of speech processes (Erlbaum, Hillsdale 1986). Ralston, H.J.: Energetics of human walking; in Herman, Grillner, Stein, Stuart, Neural control of locomotion, pp. 77–98 (Plenum Press, New York 1976). Saussure, F. de: Cours de linguistique générale (Payot, Paris 1916). Stevens, K.N.: Acoustic phonetics (MIT Press, Cambridge 1998). Studdert-Kennedy, M.: Evolutionary implications of the particulate principle: imitation and the dissociation of phonetic form from semantic function; in Knight, Studdert-Kennedy, Hurford, The emergence of language: social function and the origins of linguistic form (Cambridge University Press, Cambridge 1998). Studdert-Kennedy, M.: Imitation and the emergence of segments. Phonetica, this vol. Sundberg, J.: Emotive transforms. Phonetica, this vol. Sundberg, U.: Mother tongue: phonetic aspects of infant-directed speech; PhD diss. Stockholm University (1998). Sweet, H.: Handbook of phonetics (Frowde, Oxford 1877). Wachsmuth, E.; Oram, M.W.; Perrett, D.J.: Recognition of objects and their components: responses of single units in the temporal cortex of the macaque. Cereb Cortex 4: 509–522 (1994). Wong-Riley, M.T.T.: Cytochrome oxidase: an endogenous metabolic marker for neuronal activity. Trends Neurosci 12: 94–101 (1989). Young, S.J.; Woodland, P.C.: The use of tying in continuous speech recognition. Proc Eurospeech 93, pp. 2203– 2206, 1993.

314

Phonetica 2000;57:297–314

Lindblom

Publications Phonetica 2000;57:315–321

Publications Björn Lindblom Theses Lindblom, B.: On vowel reduction; Fil. lic. thesis University of Uppsala, Rep. No. 29, Speech Transmission Laboratory, Royal Institute of Technology, Stockholm (1963). Lindblom, B.: On the production and recognition of vowels; doct. diss. Lund University (1968).

Books and Monographs Edited Lindblom, B.; Murray, T.; Spens, K.-E.: Övningsmaterial i akustisk fonetik. PILUS 1 (Institute of Linguistics, Stockholm University, 1969). Jonasson, J.; Lindblom, B.; Serpa-Leitão, A.: Spektrografiska illustrationer av några IPA-tecken. PILUS 5 (Institute of Linguistic, Stockholm University, 1970). Lindblom, B.: Vad är fonetik? (Gleerup, Lund 1972). Lindblom, B.: På väg till laboratoriet (Gleerup, Lund 1974). Lindblom, B.; Lubker, J.; Fritzell, B. (eds): Experimentalfonetiska studier av dysartri. PILUS 27 (Institute of Linguistics, Stockholm University, 1974). Lindblom, B.; Nordström, P.-E. (eds): Fonetik och uttalspedagogik. Papers from a symposium (Institute of Linguistics, Stockholm University, 1975). Lindblom, B.; Lubker, J. (eds): Experiments in speech perception. PERILUS I (Institute of Linguistics, Stockholm University, 1979). Lindblom, B.; Öhman, S. (eds): Frontiers of speech communication research (Academic Press, London 1979). Grillner, S.; Lindblom, B.; Lubker, J.; Person, A. (eds): Speech motor control (Pergamon Press, Oxford 1982). Lindblom, B. (ed): Speech processes in the light of action theory and event perception, theme issue with open peer commentary. J. Phonet. 14: 1–196 (1986). Lindblom, B.; Zetterström, R. (eds): Precursors of early speech (Macmillan Press, Basingstoke 1986).

Articles and Research Notes Cederlund, C.; Lindblom, B.F.; Martony, J.; Møller, A.; Öhman, S.: Automatic identification of sound features. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, p. 10 (1960). Lindblom, B.; Öhman, S.; Risberg, A.: Evaluation of spectrographic data sampling techniques. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 11–13 (1960). Lindblom, B.: Spectrographic measurements. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, pp. 5–6 (1960). Lisker, L.; Martony, J.; Lindblom, B.; Öhman, S.: F-pattern approximations of voiced stops and fricatives. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 20–22 (1960). Lindblom, B.: Sona-Graph measurements. Speech Transm. Lab., Q. Prog. Status Rep. No. 3, pp. 3–5 (1961). Fant, G.; Lindblom, B.: Studies of minimal sound units. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 1–11 (1961). Pickett, J.M.; Lindblom, B.; Martony, J.; Öhman, S.: F-pattern approximations of voiced stops and fricatives. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 20–22 (1961). Lindblom, B.: Accuracy and limitations of Sona-Graph measurements. Proc. 4th Int. Congr. Phonet. Sci., Helsinki 1961, pp. 188–202 (Mouton, ’s-Gravenhage 1962). Fintoft, K.; Lindblom, B.; Martony, J.: Formant amplitude measurements. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, pp. 9–17 (1962). Martony, J.; Cederlund, C.; Liljencrants, J.; Lindblom, B.: On the analysis and synthesis of vowels and fricatives. Proc. 4th Int. Congr. Phonet. Sci., Helsinki 1961. Lindblom, B.: Spectrographic study of the dynamics of vowel articulation. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 5–9 (1963). Lindblom, B.: Spectrographic study of vowel reduction. J. Acoust. Soc. Am. 35: 1773–1781 (1963/1991). Fant, G.; Fintoft, K.; Liljencrants, J.; Lindblom, B.; Martony, J.: Formant amplitude measurements. Proc. Speech Commun. Semin., Stockholm 1962, vol. 1, pp. 1–13 (1963).



Fant, G.; Fintoft, K.; Liljencrants, J.; Lindblom, B.; Martony, J.: Formant amplitude measurements. J. Acoust. Soc. Am. 35: 1753–1761 (1963). Fant, G.; Lindblom, B.; Martony, J.: Spectrograms of Swedish stops. Speech Transm. Lab., Q. Prog. Status Rep. No. 3 (1963). Lindblom, B.: A note on segment duration in Swedish polysyllables. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, pp. 1–5 (1964). Lindblom, B.: Articulatory activity in vowels. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 1–15 (1964). Lindblom, B.: Analysis of labial movement. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, pp. 20–22 (1965). Lindblom, B.: Jaw-dependence of labial parameters and a measure of labialization. Speech Transm. Lab., Q. Prog. Status Rep. No. 3, pp. 12–15 (1965). Lindblom, B.: Studies of labial articulation. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 7–9 (1965). Lindblom, B.: Dynamic aspects of vowel articulation; in Proc. 5th Int. Congr. Phonet. Sci., Münster 1964, pp. 387–388 (Karger, Basel 1965). Lindblom, B.; Florén, Å.: Estimating short-term context-dependence of formant pattern perception. Speech Transm. Lab., Q. Prog. Status Rep. No. 2, pp. 24–26 (1965). Lindblom, B.; Soron, H.: Analysis of labial movement. J. Acoust. Soc. Am. 38: 935 (1965). Lindblom, B.; Bivner, P.: A method for continuous recording of articulatory movement. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 14–16 (1966). Fant, G.; Lindblom, B.; Serpa-Leitão, A.: Consonant confusions in English and Swedish: A pilot study. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 31–34 (1966). Lindblom, B.: Vowel duration and a model of lip-mandible coordination. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 1–29 (1967). Lindblom, B.; Studdert-Kennedy, M.: Estimating short-term context-dependence of formant pattern perception. II. Results. Speech Transm. Lab., Q. Prog. Status Rep. No. 1, pp. 21–24 (1967). Lindblom, B.; Studdert-Kennedy, M.: On the role of formant transitions in vowel recognition. J. Acoust. Soc. Am. 42: 830–843 (1967). Heinz, J.; Lindblom, B.; Lindqvist, J.: Patterns of residual masking for sounds with speech-like characteristics. Speech Transm. Lab., Q. Prog. Status Rep. No. 2–3 (1967). Heinz, J.; Lindblom, B.; Lindqvist, J.: Patterns of residual masking for sounds with speech-like characteristics. IEEE Trans. Audio Electroacoust. 16: 107–111 (1968). Lindblom, B.: Temporal organization of syllable production. Speech Transm. Lab., Q. Prog. Status Rep. No. 2–3, pp. 1–5 (1968). Lindblom, B.: Studies of labial articulation. Z. Phonet., Sprachw. KommunForsch. 21: 171–172 (1968). Lindblom, B.; Sundberg, J.: En fonetisk beskrivning av svenska vokaler; in Cleve, Svenskans beskrivning 5; also as PILUS 2, pp. 80–111 (Institute of Linguistics, Stockholm University 1969). Lindblom, B.; Sundberg, J.: Towards a generative theory of melody. Swed. J. Musicol. 52: 71–88 (1970). Lindblom, B.: Commentary on Eguchi S.; Hirsh, I.J: Development of speech sounds in children; in Fant, Speech communication ability and profound deafness, pp. 157–162 (Alexander Graham Bell Association for the Deaf, Washington 1970). Lindblom, B.: Neurophysiological representation of speech sounds. PILUS 7, pp. 1–15 (Institute of Linguistics, Stockholm University, 1971). Lindblom, B.; Rapp, K.: Reinterpreting the compensatory adjustments of vowel and consonant segments in Swedish words. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 19–25 (1971). Lindblom, B.; Sundberg, J.: A quantitative theory of cardinal vowels and the teaching of pronunciation; in Perren, Trim, Applications of linguistics (Cambridge University Press, Cambridge 1971). Lindblom, B.; Sundberg, J.: Acoustical consequences of lip, tongue, jaw and larynx movement. J. Acoust. Soc. Am. 50: 1166–1179 (1971/1991). Lindblom, B.: Phonetics and the description of language; in Rigault, Charbonneau, Proc. 7th Int. Congr. Phonet. Sci., Montreal (Mouton, The Hague 1972). Lindblom, B.; Rapp, K.: Some determinants of vowel duration in Swedish words. J. Acoust. Soc. Am. 52: 133 (1972). Lindblom, B.; Sundberg, J.: Observations on tongue contour length. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 1–5 (1972). Lindblom, B.; Sundberg, J.: Music composed by a computer program. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 20–28 (1972). Lindblom, B.; Sundberg, J.: Approaches to articulatory modeling. GALF 1972: 3–45. Lindblom, B.; Svensson S.-G.: Interaction entre facteurs segmentaux et non-segmentaux dans la reconnaissance de la parole. GALF 1972: 119–213. Carlson, R.; Granström, B.; Lindblom, B.; Rapp, K.: Some timing and fundamental frequency characteristics of Swedish sentences. Speech Transm. Lab., Q. Prog. Status Rep. No. 4, pp. 11–19 (1972). Leanderson, R.; Lindblom, B.: Muscle activation for labial speech gestures: an electromyographic and acoustic study of normal speakers. Acta oto-lar. 73: 363–373 (1972). Liljencrants, J.; Lindblom, B.: Numerical simulation of vowel quality systems: the role of perceptual contrast. Language 48: 839–862 (1972).

316

Phonetica 2000;57:315–321

Björn Lindblom

Lindblom, B.; Rapp, K.: Some temporal regularities of Swedish. PILUS 21 (Institute of Linguistics, Stockholm University, 1973). Lindblom, B.; Sundberg, J.; Rapp, K.: On the control mechanisms underlying normal speech production. Electroenceph. clin. Neurophysiol. 34: 774 (1973). Lindblom, B.; Svensson S.-G.: Interaction between segmental and non-segmental factors in the recognition of speech. IEEE Trans. Audio Electroacoust. 21: 536–545 (1973). Lindblom, B.: Talet-ett fönster mot hjärnan; in Teleman, Hultman, Språket i bruk, pp. 13–49 (Liber Läromedel/ Gleerup, Lund 1974). Lindblom, B.: Les mécanismes des contrôles moteurs. Bull. Institut Phonét. Grenoble 3: 1–21 (1974). Lindblom, B.: Experiments in sound structure; 8th Int. Congr. Phonet. Sci. Leeds. Revue Phonét. appl. 51: 155–189 (1975). Lindblom, B.: Some temporal regularities of spoken Swedish; in Fant, Tatham, Auditory analysis and perception of speech (Academic Press, New York 1975). Lindblom, B.; Pauli, S.; Sundberg, J.: Modeling coarticulation in apical stops; in Fant, Proc. SCS-74, Speech Communication, pp. 87–94 (1975). Carlson, R.; Erikson, Y.; Granström, B.; Lindblom, B.; Rapp, K.: Neutral and emphatic stress patterns in Swedish; in Fant, Proc. SCS-74, Speech Communication, pp. 209–217 (1975). Lindblom, B.: Instrumentell analys av tal i uttalsundervisningen; in Sigurd; af Trampe, Lingvistik och språkpedagogik. Symposium, pp. 81–98 (Institute of Linguistics, Stockholm University, 1976). Lindblom, B.; Lyberg, B.; Holmgren, K.: Durational patterns of Swedish phonology: do they reflect short-term memory processe (Department of Linguistics, Stockholm University, 1976/1981). Sundberg, J.; Lindblom, B.: Generative theories in language and music descriptions. Cognition 4: 99–122 (1976). Lindblom, B.: Some uses of articulatory models in phonetic theory; in Carré, Descout, Wajskop, Articulatory modeling and phonetics, GALF, pp. 99–104 (1977). Lindblom, B.; Lubker, J.; Pauli, S.: An acoustic-perceptual method for the quantitative evaluation of hypernasality. J. Speech Hear. Res. 20: 485–496 (1977). Lindblom, B.; Lubker, J.; McAllister, R.: Compensatory articulation and the modeling of normal speech behavior; in Carré, Descout, Wajskop, Articulatory modeling and phonetics; GALF, pp. 147–161 (1977). Lubker, J.; McAllister, R.; Lindblom, B.: On the notion of inter-articulatory programming. J. Phonet. 5: 213–226 (1977). Lubker, J.; McAllister, R.; Lindblom, B.: Some determinants of vowel duration in Swedish words. J. Acoust. Soc. Am. 62: S16 (1977). Lundström, E.; Carlson, J.; Lindblom, B.: On the representation of vowel formants in pulsation-threshold measurements. J. Acoust. Soc. Am. 62: S59 (1977). Lindblom, B.: Final lengthening in speech and music; in Gårding, Bannert, Nordic prosody, pp. 85–102 (1978). Lindblom, B.: Phonetic aspects of linguistic explanation; in Sign and Sound, Studies presented to Bertil Malmberg on the occasion of his sixty-fifth birthday. Stud. ling. 32: 137–153 (1978). Lindblom, B.: Les aspects phonétiques de l'explication en linguistique. Bull. Institut Phonét. Grenoble, 2: 1–23 (1978). Lindblom, B.: Some phonetic null hypotheses for a biological theory of language. Proc. 9th Int. Congr. Phonet. Sci., vol. I, pp. 3–10 (1979). Lindblom, B.; Lubker, J.; Gay, T.: Formant frequencies of some fixed-mandible vowels and a model of motor programming by predictive simulation. J. Phonet. 7: 147–161 (1979). Bladon, R.A.; W.; Lindblom, B.: Spectral and temporal-domain questions for an auditory model of vowel perception. Proc. Inst. Acoust., autumn conf., pp. 79–83 (1979). Lindblom, B.: Att kunna tala och förstå tal; in Stigmark, Wengelin, Kommunikation – trots handikapp, pp. 53–59 (Riksbankens Jubileumsfond, 1980). Lindblom, B.: The goal of phonetics, its unification and application. Phonetica 37: 7–26 (1980). Welin, C.-W.; Lindblom, B.: The identification of prosodic information from acoustic records by phoneticians. Abstr. 99th ASA Meet., Atlanta. J. Acoust. Soc. Am. 67: S65 (1980). Bladon, R.A.; Lindblom, B.: Modeling the judgement of vowel quality differences. J. Acoust. Soc. Am. 69: 1414– 1422 (1981). Gay, T.; Lindblom, B.; Lubker, J.: Production of bite-block vowels: acoustic equivalence by selective compensation. J. Acoust. Soc. Am. 69: 802–810 (1981). Lindblom, B.: A deductive account of vowel features; in House, Acoustic Phonetics and Speech Modeling Project Scamp, part 3 (Institute for Defense Analyses, Princeton 1981). Lindblom, B.: The interdisciplinary challenge of speech motor control; in Grillner, Lindblom, Lubker, Person, Speech motor control, pp. 3–18 (Pergamon Press, Oxford 1982). Lindblom, B.; Schulman, R.: The target theory of speech production in the light of mandibular dynamics. Proc. Inst. Acoust., England, A2 1–A2 5 (1982). Tyler, R.S.; Lindblom, B.: Preliminary study of simultaneous masking and pulsation threshold patterns of vowels. J. Acoust. Soc. Am. 71: 220–224 (1982). Lindblom, B.: On the teleological nature of speech processes. 11th Int. Congr. on Acoustics, Toulouse, Speech Communication 2, pp. 155–158 (1983).

Publications

Phonetica 2000;57:315–321

317

Lindblom, B.: On the notion of self-organizing systems in physics, chemistry, biology – and linguistics? in Karlson, 7th Scand. Conf. Ling., Helsinki 1983. Lindblom, B.: Förstå och underförstå, något om de processer som formar talrörelserna; in Teleman, Tal och tanke, pp. 147–178 (Liber Läromedel/Gleerup, Lund 1983). Lindblom, B.: Economy of speech gestures; in MacNeilage, The production of speech (Springer, New York, 1983). Lindgren, R.; Lindblom, B.: Speech perception processes. Scand. Audiol. 18: suppl., pp. 57–70 (1983). Lindblom, B.; MacNeilage, P.; Studdert-Kennedy, M.: Self-organizing processes and the explanation of phonological universals; in Butterworth, Comrie, Dahl, Explanations for language universals (Mouton, Berlin 1983). Lindblom, B.: Can the models of evolutionary biology be applied to phonetic problems? in Cohen, van den Broecke, Proc. 10th Int. Congr. Phonet. Sci. (Foris, Dordrecht 1984). (Cf. companion discussion paper by Ladefoged, P.: Out of chaos comes order, pp. 67–82, 1984.) Lacerda, F.; Lindblom, B.: How do stimulus onset characteristics influence formant frequency difference limens; in Cohen, van den Broecke, Abstr. 10th Int. Congr. Phonet. Sci., p. 491 (Foris, Dordrecht 1984). MacNeilage, P.; Studdert-Kennedy, M.; Lindblom, B.: Functional precursors to language and its lateralization. Am. J. Physiol. 246: R912–914 (1984). Lindblom, B.; Lubker, J.: The speech homunculus and a problem of phonetic linguistics; in Fromkin, Phonetic linguistics: essays in honor of Peter Ladefoged, pp. 169–192 (Academic Press, Orlando 1985). MacNeilage, P.; Studdert-Kennedy, M.; Lindblom, B.: Planning and production of speech: an overview; in Lauter, Proc. Conf. on Planning and Production of Speech by Normally and Deaf People. Am. Speech Hear. Ass. Rep. (1985). Lindblom, B.: Phonetic universals in vowel systems; in Ohala, Jaeger, Experimental phonology, pp. 13–44 (Academic Press, Orlando 1986). Lindblom, B.; MacNeilage, P.: Action theory: problems and alternative approaches. J. Phonet. 14: 117–132 (1986). Lindblom, B.: On the origin and purpose of discreteness and invariance in sound patterns; in Perkell, Klatt, Invariance and variability of speech processes, pp. 493–523 (Erlbaum, Hillsdale 1986). (See companion discussion papers by Bromberger, Halle, Ferguson, Port, pp. 510–523, 1986.) Holmgren, K.; Lindblom, B.; Aurelius, G.; Jalling, B.; Zetterström, R.: On the phonetics of infant vocalization; in Lindblom, Zetterström, Precursors of early speech, pp. 51–63 (Macmillan Press, Basingstoke 1986). Bickley, C.; Lindblom, B.; Roug, L.: Acoustic measures of rhythm in infants’ babbling, or ‘all god’s children got rhythm’. Proc. Int. Congr. Acoust. 12, Canada, A6–4 (1986). Lindblom, B.: A typological study of consonant systems: role of inventory size. RUUL 17, pp. 1–9 (Department of Linguistics, Uppsala University, 1987). Lindblom, B.: Absolute constancy and adaptive variability: two themes in the quest for phonetic invariance. Proc. 11th Int. Congr. Phonet. Sci., Tallinn 1987, pp. 5–18. Lindblom, B.; Lubker, J.; Gay, T.; Lyberg, B.; Branderud, P.; Holmgren, K.: The concept of target and speech timing; in Channon, Shockey, In honor of Ilse Lehiste, pp. 161–181 (Foris, Dordrecht 1987). House, D.; Bruce, G.; Lacerda, F.; Lindblom, B.: Automatic prosodic analysis for Swedish speech recognition; in Laver, Jack, Eur. Conf. on Speech Technol., Edinburgh 1987, vol. 1, pp. 215–218. House, D.; Bruce, G.; Lacerda, F.; Lindblom, B.: Automatic prosodic analysis for Swedish speech recognition. Dept. Ling. Phonet., Lund Univ., Working Papers 31: 87–101 (1987). House, D.; Bruce, G.; Lacerda, F.; Lindblom, B.: Prosodisk parsning för igenkänning av svenska; in Löfqvist, TalLjud-Hörsel 87, Lund University, pp. 20–21 (1987). MacNeilage, P.; Studdert-Kennedy, M.; Lindblom, B.: Primate handedness reconsidered. Behav. Brain Sci. 10: 247–303 (1987). Lindblom, B.: Phonetic invariance and the adaptive nature of speech; in Elsendoorn, Bouma, Working models of human perception, pp. 139–173 (Academic Press, London 1988). (Cf. companion discussion paper by Ohala, pp. 175–183, 1988.) Lindblom, B.; Maddieson, I.: Phonetic universals in consonant systems; in Hyman, Li, Language, speech and mind. Studies in honor of Victoria Fromkin, pp. 62–78 (Routledge, London 1988). Lindblom, B.: Some remarks on the origin of the phonetic code; in Euler von, Lundberg, Lennerstrand, Brain and reading, pp. 27–44 (Stockton Press, New York 1989). Lindblom, B.: Role of input in children's early vocabulary; in Euler von, Forssberg, Lagercrantz, Neurobiology of early infant behavior, pp. 303–307 (Stockton Press, New York 1989). Lindblom, B.; Engstrand, O.: In what sense is speech quantal. J. Phonet. 17: 3–45 (1989). Lindblom, B.: On the notion of ‘possible speech sound’. J. Phonet. 18: 135–152 (1990). Lindblom, B.: Explaining phonetic variation: a sketch of the H&H theory; in Hardcastle, Marchal, Speech production and speech modeling, pp. 403–439 (Kluwer, Dordrecht 1990). Lindblom, B.: Adaptive complexity in sound patterns. Behav. Brain Sci. 13: 743–744 (1990). Commentary on target article by Pinker, Bloom, Natural language and natural selection, Behav. Brain Sci. 13: 707–784 (1990). Lindblom, B.: On the communicative process: speaker-listener interaction and the development of speech. AAC Augmentative and Alternative Communication. Proc. 4th Biennial Int. ISAAC Conf. on Augmentative and Alternative Commun., pp. 220–230 (1990). Keating, P.A.; Lindblom, B.; Lubker, J.; Kreiman, J.: Jaw positions in English and Swedish VCVs. UCLA Working Papers 74: 77–95 (1990).

318

Phonetica 2000;57:315–321

Björn Lindblom

Sundberg, J.; Lindblom, B.: Acoustic estimations of the front cavity in apical stops. J. Acoust. Soc. Am. 88: 1313–1317 (1990). Lindblom, B.: The status of phonetic gestures; in Mattingly, Studdert-Kennedy, Modularity and the motor theory of speech perception. Proc. Conf. to Honor Alvin M. Liberman, pp. 7–24 (Erlbaum, Hillsdale 1991). Lindblom, B.: The relations between speech production and levels of representation; in Proc. 12th Int. Congr. Phonet. Sci., Aix-en-Provence 1991. Lindblom, B.: Summary and discussion of cognitive and perceptual aspects; in Sundberg, Nord, Carlson, Music, language, speech and brain. Wenner-Gren Int. Symp. Ser. 59, pp. 304–310 (Macmillan, Basingstoke 1991). Davis, B.L.; Lindblom, B.: Prototypical vowel information in baby talk; in Engstrand, Kylander, Symp. on Curr. Phonet. Res. Paradigms. PERILUS XIV (1991). Sundberg, J.; Lindblom, B.: Generative theories for describing musical structure; in Howell, West, Cross, Representing musical structure, pp. 245–272 (Academic Press, London 1991). Willerman, R.; Lindblom, B.: The phonetics of pronouns; in Engstrand, Kylander, Dufberg, PERILUS XIII, Working papers in phonetics, pp. 19–23 (Department of Linguistics, Stockholm University, 1991). Lindblom, B.: Phonological units as adaptive emergents of lexical development; in Ferguson, Menn, Stoel-Gammon, Phonological development: models, research, implications, pp. 131–163 (York Press, Parkton 1992). Lindblom, B.: Role of phonetic content in phonology; in Dressler, Luschützky, Pfeiffer, Rennison, Phonologica 1988, 6th Int. Phonol. Meet., Cambridge 1992, pp. 181–196. Lindblom, B.; Brownlee, S.; Davis, B.; Moon, S.-J.: Speech transforms. Speech Commun. 11: 357–368 (1992). Lindblom, B.; Krull, D.; Stark, J.: Use of place and manner dimensions in the Superb-UPSID database: some patterns of in(ter)dependence. Fonetik ’92, pp. 39–42 (University of Gothenburg 1992). Hura, S.; Lindblom, B.; Diehl, R.: Some evidence that perceptual factors shape assimilations. Fonetik ’92, pp. 71–74 (University of Gothenburg 1992). Hura, S.L.; Lindblom, B.; Diehl, R.: On the role of perception in shaping phonological assimilation rules. Lang. Speech 35: 59–72 (1992). Krull, D.; Lindblom, B.: Comparing vowel formant data cross-linguistically. Fonetik ’92, pp. 51–55 (University of Gothenburg 1992). Kuhl, P.K.; Williams, K.A.; Lacerda, F.; Stevens, K.N.; Lindblom, B.: Linguistic experience alters phonetic perception in infants by 6 months of age. Science 255: 606–608 (1992). Sundberg, J.; Lindblom, B.; Liljencrants, J.: Formant frequency estimates for abruptly changing area functions: a comparison between calculations and measurements. J. Acoust. Soc. Am. 91: 3478–3482 (1992). MacNeilage, P.F.; Studdert-Kennedy, M.; Lindblom, B.: Hand signals. Sciences 1993: 26–37. Lindblom, B.; Krull, D.; Stark, J.: Phonetic systems and phonological development; in de Boysson-Bardies, de Schonen, Jusczyk, MacNeilage, Morton, Developmental neurocognition: speech and face processing in the first year of life, pp. 399–409 (Kluwer, Dordrecht 1993). Carré, R.; Lindblom, B.; MacNeilage, P.: Acoustic contrast and the origin of the human vowel space. J. Acoust. Soc. Am. 95: 2924 (1994). Davis, B.L.; Lindblom, B.: Some acoustic properties of baby talk and the prototype effect in infant speech perception; in Rolf, Jonker, Wind, Studies in language origins, vol. 3, pp. 45–53 (Benjamins, Philadelphia 1994). Keating, P.A.; Lindblom, B.; Lubker, J.; Kreiman, J.: Variability in jaw height for segments in English and Swedish VCVs. J. Phonet. 22: 407–422 (1994). Löfqvist, A.; Lindblom, B.: Speech motor control. Curr. Opin. Neurobiol. 4: 823–826 (1994). Molis, M.R.; Lindblom, B.; Castleman, W.; Carré, R.: Cross-language analysis of VCV coarticulation. J. Acoust. Soc. Am. 95: 2925 (1994). Moon, S.-J.; Lindblom, B.: Interaction between duration, context and speaking style in English stressed vowels. J. Acoust. Soc. Am. 96: 40–55 (1994). Lindblom, B.: A view of the future of phonetics; in Bloothooft, Hazan, Huber, Llisterri, Eur. Stud. Phonet. Speech Commun. (CIP-Gegevens Koninklijke Bibliotheek, den Haag 1995). Lindblom, B.; Guion, S.; Hura, S.; Moon, S.-J.; Willerman, R.: Is sound change adaptive? Revista di linguistica. 7: 5–37 (1995). Carré, R.; Lindblom, B.; Divenyi, P.: The role of transition velocity in the perception of V1V2 complexes; in Elenius, Branderud, Proc. ICPhS 95, Stockholm 1995, vol. 2, pp. 258–261. Carré, R.; Lindblom, B.; MacNeilage, P.: Rôle de l’acoustique dans l’évolution du conduit vocal humain. C.r. Acad. Sci. Paris, tome 320, série Iib (1995). Krull, D.; Lindblom, B.; Shia, B.-E.; Fruchter, D.: Cross-linguistic aspects of coarticulation: an acoustic and electropalatographic study of dental and retroflex consonants; in Elenius, Branderud, Proc. ICPhS 95, Stockholm 1995, vol. 3, pp. 436–439. Moon, S.-J. Lindblom, B.; Lame, J.: A perceptual study of reduced vowels in clear and casual speech; in Elenius, Branderud, Proc. ICPhS 95, Stockholm 1995, vol. 2, pp. 670–677. Lindblom, B.: Role of articulation in speech perception: clues from production. J. Acoust. Soc. Am. 99: 1683–1692 (1996). Lindblom, B.: Vowel reduction as formant undershoot. Proc. Forum Acusticum 1996, ACUSTICA. Acta acust. 82: suppl. 1. p. 128 (1996).

Publications

Phonetica 2000;57:315–321

319

Lindblom, B.: Approche intégrée de la production et de la perception; in Méloni, Fondements et perspectives en traitement automatique de la parole, Universités Francophones, AUPELF UREF, pp. 9–17 (Hachette, Paris 1996). Lindblom, B.; Brownlee, S.A.; Lindgren, R.: Formant undershoot and speaking styles: an attempt to resolve some controversial issues; in Kohler, Sound patterns of connected speech: description, models and explanation. Symp. Kiel 1996, pp. 119–129. Engstrand, O.; McAllister, R.; Lindblom, B.: Investigating the ‘trough’: vowel dynamics and aerodynamics. J. Acoust. Soc. Am. 100: 2659–2660 (1996). Krull, D.; Lindblom, B.: Coarticulation in apical consonants: acoustic and articulatory analyses of Hindi, Swedish and Tamil. Fonetik 96, Swed. Phonet. Conf. Speech Transm. Lab., Q. Prog. Status Rep., No. 2, pp. 73–76 (1996). Diehl, R.L.; Lindblom, B.; Hoemeke, K.A.; Fahey, R.: On explaining certain male-female differences in the phonetic realization of vowel categories. J. Phonet. 24: 187–208 (1996). Stark, J.; Lindblom, B.; Sundberg, J.: APEX – an articulatory synthesis model for experimental and computational studies of speech production. Fonetik 96, Swed. Phonet. Conf. Speech Transm.Lab., Q. Prog. Status Rep., No. 2, pp. 45–48 (1996). Lindblom, B.: Talet tar form; in Söderbergh, Från joller till läsning och skrivning, pp. 11–32 (1997). Lindblom, B.; Stark, J.; Sundberg, J.: From sound to vocal gesture: learning to (co)-articulate with APEX. Fonetik ’97, Phonum 4, pp. 37–40 (Umeå University 1997). Lindblom, B.; Lacerda, F.: Hur barn lär sig tala. KS lär ut, lecture series (Karolinska Institutet, Solna 1997). Chennoukh, S.; Carré, R.; Lindblom, B.: Locus equations in the light of articulatory modeling. J. Acoust. Soc. Am. 102: 2380–2389 (1997). Engstrand, O.; Lindblom, B.: The locus line: does aspiration affect its steepness. Fonetik ’97, Phonum 4, pp. 101– 104 (Umeå University 1997). Lacerda, F.; Lindblom, B.: Modeling the early stages of language acquisition; in Olofsson, Strömqvist, Cross-linguistic studies of dyslexia and early language development. Eur. Commission, Social Sci., COST A8: 14–21 (1997). Lindblom, B.: Making sense of the infinite variety of natural speech patterns. SPOSS, ESCA Workshop on Spontaneous Speech, organized by Danielle Duez and le Laboratoire «Parole et Langage», La Baume-lès-Aix 1998. Lindblom, B.: Systemic constraints and adaptive change in the formation of sound structure; in Hurford, Aitchison, Knight, Evolution of human language, social and cognitive bases, pp. 242–264 (Cambridge University Press, Cambridge 1998). Lindblom, B.: An articulatory perspective on the ‘locus equation’; commentary on target article by Sussman, Fruchter, Hilbert, Sirosh, Linear correlates in the speech signal: the orderly output constraint. Behav. Brain Sci. 21: 241–299 (1998). Lindblom, B.: A curiously ubiquitous articulatory movement; commentary on target article by MacNeilage: The frame/content theory of evolution of speech production. Behav. Brain Sci. 21: 499–546 (1998). Lindblom, B.; Davis, J.: Calculating and measuring the energy costs of speech movements. Fonetik 98, Swed. Phonet. Conf., Stockholm 1998. Branderud, P.; Lundberg, H.-J.; Lander, J.; Djamshidpey, H.; Wäneland, I.; Krull, D.; Lindblom, B.: X-ray analyses of speech: methodological aspects. Fonetik 98, Swed. Phonet. Conf., Stockholm 1998. Ericsdotter, C.; Lindblom, B.; Stark, J.; Sundberg, J.: The coarticulation of coronal stops – a modeling study. J. Acoust. Soc. Am. 104: 1804 (1998). Lacerda, F.; Lindblom, B.: Some remarks on the ‘Tallal transform’ in the light of emergent phonology; in von Euler, Lundberg, Llinás, Basic neural mechanisms in cognition and language with special reference to phonological problems in dyslexia, pp. 197–222 (Wenner-Gren Foundation/ Rodin Remediation Academy/ Elsevier, Amsterdam 1998). Stark, J.; Ericsdotter, C.; Lindblom, B.; Sundberg, J.: Using X-ray data to calibrate the APEX synthesis. Fonetik 98, Swed. Phonet. Conf., Stockholm 1998. Stark, J.; Ericsdotter, C.; Lindblom, B.; Sundberg, J.: The APEX model: from articulatory positions to sound. J. Acoust. Soc. Am. 104: 1820 (1998). Lindblom, B.: Emergent phonology. Proc. 25th Annu. Meet. Berkeley Ling. Soc., University of California, Berkeley 1999. Ericsdotter, C.; Lindblom, B.; Stark, J.: Articulatory coordination in coronal stops: implications for theories of coarticulation. Proc. 14th ICPhS, San Francisco 1999. Davis, B.L.; Lindblom, B.: Prototype formation in speech development and phonetic variability in baby talk; in Lacerda, von Hofsten, Heiman, Emerging cognitive abilities in early infancy (Erlbaum, Hillsdale 2000). Diehl, R.L.; Lindblom, B.: Explaining the structure of feature and phoneme inventories; in Greenberg, Ainsworth, Popper, Fay, Speech processing in the auditory system (Springer, New York 2000). Lindblom, B.; Davis, J.H.; Brownlee, S.A.; Moon, S.-J.; Simpson, Z.: Energetics in phonetics: a preliminary look; in Fujimura, Honda, Palek, Linguistics and phonetics (Ohio State University, Columbus 2000). Lindblom, B.: Developmental origins of adult phonology: the interplay between phonetic emergents and evolutionary adaptations; in Kohler, Diehl, Engstrand, Kingston (eds): Studies in speech communication and language development, dedicated to Björn Lindblom on his 65th birthday. Phonetica 57 (2000).

320

Phonetica 2000;57:315–321

Björn Lindblom

Reviews and Miscellanea Lindblom, B.: Articulatory and acoustic studies of human speech production. Report to the Wenner-Gren Foundation for Anthropological Research (New York 1964). Lindblom, B.: Review of form and substance: papers presented to Eli Fischer-Jørgensen. J. Acoust. Soc. Am. 53: 1459 (1973). Lindblom, B.: Kan musiken liknas vid ett språk? (Review of Bernstein, L.: the unanswered question, Harvard University Press, Cambridge 1973) Dagens Nyheter, Oct. 16 (1977). Lindblom, B.: A teleological look at the biology of speech. Report to the summer school in psycholinguistics (Mullsjö 1979). Lindblom, B.: Språkliga informationsprocesser – en forskningsöversikt. FRN Rep. (Stockholm 1980). Lindblom, B.: Genetiskt program bakom individens språk (Review of Bickerton, D.: Roots of language, Karoma, Ann Arbor 1981). Svenska Dagbladet, Febr. 2 (1983). Lindblom, B.: Tvärvetenskap – på gott och ont; in Ahlsén, Allwood, Hjelmqvist, 134-154 Tal, Ljud och Hörsel 2 (Gothenburg University, 1985). Lindblom, B.; Perkell, J.S.; Klatt, D.H.: Preface to Perkell, Klatt, Invariance and variability of speech processes (Erlbaum, Hillsdale 1986). Lindblom, B.: Fonetik. Nationalencyclopedin, Bra Böcker (Svenska Akademin, 1987). Lindblom, B.: Models of phonetic variation and selection: prepublication draft. Conf. on Lang. Change and Biol. Evolution, Institute for Scientific Interchange, Torino, 1990, PERILUS XI, pp. 65–100. Lindblom, B.: A model of phonetic variation and selection and the evolution of vowel systems; in Wang, Language Transmission and Change, Conf. Center for Advanced Study in the Behav. Sci., Stanford (unpublished). Lindblom, B.: Fonetik och talteknologi (Phonetics and speech technology). Speech Technol. Semin. (Dept of Speech Communication and Music Acoustics, KTH, Stockholm 1994). Lindblom, B.: Speech: biological bases; in Asher Encyclopedia of Language and Linguistics, pp. 4161–4168 (Pergamon, New York 1994). Lindblom, B.: Evolutionsteorins frätande syra når språkvetenskapen. Svenska Dagbladet, May 17 (1996). Welin, C.-W.; Lindblom, B.: SAGT – Studier av Aktiv Grammatikanvändning i Talförståelse. Report on project sponsored by Riksbanksfonden (1978).

Publications

Phonetica 2000;57:315–321

321

Index autorum Vol. 57, 2000

(L) = Libri, (R) = Recensors

Absalom, M. 73 (R, issue 1) Anderson, F.C. 219 Byrd, D. 3 (issue 1) Carré, R. 152 Davis, B.L. 229, 284 Di Cristo, A. 70 (L, issue 1) Diehl, R.L. 267 Divenyi, P.L. 152 Ericsdotter Bresin, C. 83 Fant, G. 113 Fernald, A. 242 Fitch, W.T. 205 Fujimura, O. 128 Hajek, J. 73 (L, issue 1) Hirst, D. 70 (L, issue 1) Holt, L.L. 170 Jacobsen, B. 40 (issue 1) Jassem, W. 70 (R, issue 1) Johnson, K. 181

ABC


Fax + 41 61 306 12 34 E-Mail [email protected] Accessible online at: www.karger.com www.karger.com/journals/pho

Kingston, J. 1 (issue 1) Kluender, K.R. 170 Kohler, K.J. 1 (issue 1), 85 Kruckenberg, A. 113 Lacerda, F. 83 Liberman, A.M. 68 Liljencrants, J. 113 Lindblom, B. 297 Lotto, A.J. 189 MacNeilage, P.F. 229, 284 Pandy, M.G. 219 Peeters, W.J.M. 17 (issue 1) Schouten, M.E.H. 17 (issue 1) Sjölander, S. 197 Stevens, K.N. 139 Studdert-Kennedy, M. 275 Sundberg, J. 95 Velleman, S.L. 255 Vihman, M.M. 255

Contents Vol. 57, 2000

189 Language Acquisition as Complex Category

No. 1

Formation 001 Editorial Kohler, K.J. (Kiel); Kingston, J. (Amherst, Mass.)

Lotto, A.J. (Chicago, Ill.)

Biology of Communication and Motor Processes 197 Singing Birds, Playing Cats, and Babbling Babies:

Original Papers 003 Articulatory Vowel Lengthening and Coordination at

Phrasal Junctures Byrd, D. (Los Angeles, Calif./New Haven, Conn.)

017 Searching for an Explanation for Diphthong

Perception: Dynamic Tones and Dynamic Spectral Profiles Schouten, M.E.H.; Peeters, W.J.M. (Utrecht)

040 The Question of ‘Stress’ in West Greenlandic.

An Acoustic Investigation of Rhythmicization, Intonation, and Syllable Weight Jacobsen, B. (Copenhagen)

068 Obituary 070 Libri 075 Publications Received for Review

Why Do They Do It? Sjölander, S. (Linköping)

205 The Phonetic Potential of Nonhuman Vocal Tracts:

Comparative Cineradiographic Observations of Vocalizing Animals Fitch, W.T. (Cambridge, Mass.)

219 Dynamic Simulation of Human Movement Using

Large-Scale Models of the Body Pandy, M.G.; Anderson, F.C. (Austin, Tex.)

En Route to Adult Spoken Language. Language Development 229 An Embodiment Perspective on the Acquisition of

Speech Perception Davis, B.L.; MacNeilage, P.F. (Austin, Tex.)

242 Speech to Infants as Hyperspeech: Knowledge-

Driven Processes in Early Word Recognition Fernald, A. (Stanford, Calif.)

No. 2–4

255 The Construction of a First Phonology Vihman, M.M. (Bangor); Velleman, S.L. (Amherst, Mass.)

083 Foreword

Auditory Constraints on Sound Structures

Acoustic Patterning of Speechn. Its Linguistic and Physiological Bases 085 Investigating Unscripted Speech: Implications for

Phonetics and Phonology Kohler, K.J. (Kiel)

267 Searching for an Auditory Description of Vowel

Categories Diehl, R.L. (Austin, Tex.)

Commentaries

095 Emotive Transforms Sundberg, J. (Stockholm)

275 Imitation and the Emergence of Segments Studdert-Kennedy, M. (New Haven, Conn.)

113 The Source-Filter Frame of Prominence Fant, G.; Kruckenberg, A.; Liljencrants, J. (Stockholm)

284 Deriving Speech from Nonspeech: A View from

128 The C/D Model and Prosodic Control of Articulatory

Ontogeny MacNeilage, P.F.; Davis, B.L. (Austin, Tex.)

297 Developmental Origins of Adult Phonology:

Behavior Fujimura, O. (Columbus, Ohio)

139 Diverse Acoustic Cues at Consonantal Landmarks Stevens, K.N. (Cambridge, Mass.)

Perceptual Processing 152 Modeling and Perception of ‘Gesture Reduction’ Carré, R. (Paris); Divenyi, P.L. (Martinez, Calif.) 170 General Auditory Processes Contribute to Perceptual

Accommodation of Coarticulation Holt, L.L. (Pittsburgh, Pa.); Kluender, K.R. (Madison, Wisc.)

181 Adaptive Dispersion in Vowel Perception Johnson, K. (Columbus, Ohio)

ABC


Fax + 41 61 306 12 34 E-Mail [email protected] www.karger.com

Access to full text and tables of contents, including tentative ones for forthcoming issues: www.karger.com/journals/pho/pho_bk.htm

The Interplay between Phonetic Emergents and the Evolutionary Adaptations of Sound Patterns Lindblom, B. (Stockholm)

315 Publications Björn Lindblom 322 Author Index Vol. 57, 2000

Emergence and Adaptation: Studies in Speech Communication and Language Development

Language and Space (Language, Speech, and Communication)

Construal (Language, Speech, and Communication)

Language Form and Language Function (Language, Speech, and Communication)

Lexical Competence (Language, Speech, and Communication)

Optimality-Theoretic Syntax (Language, Speech, and Communication)

Speech Physiology, Speech Perception, and Acoustic Phonetics (Cambridge Studies in Speech Science and Communication)

Speech and Language Processing

Speech and language processing

The Discovery of Spoken Language (Language, Speech, and Communication)

Language and Speech Processing

Speech, Language, and Communication (Handbook Of Perception And Cognition)

Speech, Language, and Communication (Handbook Of Perception And Cognition)

Patterns of Sounds (Cambridge Studies in Speech Science and Communication)

Language and Communication

Communication, Language and Literacy

Language Acquisition: Studies in First Language Development

Pattern Recognition in Speech and Language Processing

Pattern Recognition In Speech And Language Processing

Pattern Recognition in Speech and Language Processing

Methods for Assessing Children's Syntax (Language, Speech, and Communication)

Methods for Assessing Children's Syntax (Language, Speech, and Communication)

Speech And Language (Gray Matter)

Speech And Language (Gray Matter)

Ethics in Speech and Language Therapy

Scientific Thinking in Speech and Language Therapy

Language Development and Age

Language development and age

Language Development and Age

Tibetan and Zen Buddhism in Britain: transplantation, development and adaptation

The Balancing Act: Combining Symbolic and Statistical Approaches to Language (Language, Speech, and Communication)

Emergence and Adaptation: Studies in Speech Communication and Language Development

Language and Space (Language, Speech, and Communication)

Construal (Language, Speech, and Communication)

Language Form and Language Function (Language, Speech, and Communication)

Lexical Competence (Language, Speech, and Communication)

Optimality-Theoretic Syntax (Language, Speech, and Communication)

Speech Physiology, Speech Perception, and Acoustic Phonetics (Cambridge Studies in Speech Science and Communication)

Speech and Language Processing

Speech and language processing

The Discovery of Spoken Language (Language, Speech, and Communication)

Language and Speech Processing

Speech, Language, and Communication (Handbook Of Perception And Cognition)

Speech, Language, and Communication (Handbook Of Perception And Cognition)

Patterns of Sounds (Cambridge Studies in Speech Science and Communication)

Language and Communication

Communication, Language and Literacy

Language Acquisition: Studies in First Language Development

Pattern Recognition in Speech and Language Processing

Pattern Recognition In Speech And Language Processing

Pattern Recognition in Speech and Language Processing

Methods for Assessing Children's Syntax (Language, Speech, and Communication)

Methods for Assessing Children's Syntax (Language, Speech, and Communication)

Speech And Language (Gray Matter)

Speech And Language (Gray Matter)

Ethics in Speech and Language Therapy

Scientific Thinking in Speech and Language Therapy

Language Development and Age

Language development and age

Language Development and Age

Tibetan and Zen Buddhism in Britain: transplantation, development and adaptation

The Balancing Act: Combining Symbolic and Statistical Approaches to Language (Language, Speech, and Communication)

Recommend Documents