Table of Contents Preface Acknowledgments Ch. 1 Statistical Methods and Linguistics 1 Ch. 2 Qualitative and Quantitativ...
27 downloads
795 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Table of Contents Preface Acknowledgments Ch. 1 Statistical Methods and Linguistics 1 Ch. 2 Qualitative and Quantitative Models of Speech Translation 27 Ch. 3 Study and Implementation of Combined Techniques for Automatic Extraction of Terminology 49 Ch. 4 Do We Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to a Statistical Word Grouping System 67 Ch. 5 The Automatic Construction of a Symbolic Parser via Statistical Techniques 95 Ch. 6 Combining Linguistic with Statistical Methods in Automatic Speech Understanding 119 Ch. 7 Exploring the Nature of Transformation-Based Learning 135 Ch. 8 Recovering from Parser Failures: A Hybrid Statistical and Symbolic Approach 157 Contributors 181 Index 183
file:///C|/Documents%20and%20Settings/Admin/Desktop/New%20Folder/TOC.txt[6/17/2009 10:22:27 PM]
Preface
The chaptersin this book comeout of a workshopheld at the 32nd Annual Meetingof theAssociationfor ComputationalLinguistics, at NewMexicoState University in Las Cruces, New Mexico, on 1 July 1994. Thepurpose of the workshopwas to provide a forum in which to explorecombinedsymbolicand statisticalapproachesin computationallinguistics. esto the study To manyresearchers , the merenotion of combiningapproach has . Indeed, in the past it of languageseemsanathema appearednecessaryto , studyingtwo essentially choosebetweentwo radically different researchagendas different kinds of data. On the one hand, we find cognitively motivated theoriesof languagein thettadition of generativelinguistics, with introspective es motivated by data as primary evidence. On the other, we fmd approach empiricalcoverage,with collectionsof naturallyoccurringdataasprimary evidence . Eachapproachhasits own kinds of theory, methodology, and criteria . for success Although underlying philosophicaldifferencesgo back much further, the genesisof generativegrammarin the late 1950sandearly 1960sdrew attention to the issuesof concernin this book. At that time, therewasa thriving quantitative linguistics community, in both the United Statesand Europe, that had originatedfollowing World War II in the surgeof developmentof sophisticated es to scientific problems[Shannonand Weaver, 1949] . quantitativeapproach es werebuilt on the foundationof observabledata Thesequantitativeapproach as the primary sourceof evidence. The appearanceof generativegrammar [Chomsky, 1957], with its emphasison intuitive grammaticalityjudgments, led to confrontationwith the existing quantitativeapproach,and the rift between the two communities, arising from fmnly held opinions on both sides, prevented esto languagegrew up productiveinteraction. Computationalapproach during this feud, with muchof computationallinguisticsdominatedby the theoretical perspectiveof generativegrammar, hostile to quantitativemethods,
viii
Preface
and much of the speechcommunitydominatedby statisticalinfonnation theory , hostileto theoreticallinguistics. Although a few naturallanguageprocessing( NLP) groupspersistedin taking a probabilisticapproachin the 1970sand 1980s,therule-governed,theorydriven approachdominatedthe field, even amongthe many industrial teams working on NLP (e.g. [ WoodsandKaplan, 1971; Petrick, 1971; Grosz, 1983]). The influenceof thelinguists' generativerevolutionon NLP projectswasoverwhelmi . Statisticalor evensimply quantitativenotionssurvivedin this environmen , included for the purpose of only as secondaryconsiderations ' but optimization rarely thoughtof asan integralpart of a systems coredesign. At the sametime, speechprocessinggrew more mature, building on an information -theoretictradition that emphasizedthe induction of statisticalmodels from training data(e.g. [Bahl .et al., 1983; Flanagan, 1972]). For quite sometime, the two communitiescontinuedwith little to sayto each other. However, in the late 1980sandearly 1990s,the field of NLP underwenta radical shift. Fueledpartly by the agendaof the DefenseAdvancedResearch ProjectsAgency (DARPA), a major sourceof American funding for both , and partly by the dramaticincrease speechand natural languageprocessing world wide in the availability of electronictexts, the two communitiesfound themselvesin close contact. The result for computationallinguists was that long-standingproblemsin their domain- for example, identifyingthe syntactic , or resolvingprepositionalphraseambiguity categoryof thewordsin a sentence in parsing- weretackledusingthe samesortsof statisticalmethodsprevalentin . Thespecifictechniquesvaried, but speechrecognition, oftenwith somesuccess all werefoundeduponthe ideaof inducingthe knowledgenecessaryto solvea problem by statistically analyzing large corpora of naturally occurring text, ratherthanbuilding in suchknowledgein the fonn of symbolicrules. Initially , the interestin corpus-basedstatisticalmethodsrekindledall the old controversies- rationalist vs. empiricist philosophies, theory-driven vs. datadrivenm , symbolic vs. statisticaltechniques(e.g., seediscussion in [ChurchandMercer, 1993]) . The BalancingAct workshopwe held in 1994 wasplannedwhenthe rhetoricwasat its height, at a time whenit seemedto us that, evenif somepeoplewereworking on commonground, not enoughpeople were talking about it. The field of computationallinguistics is now settling down somewhat:for themostpart, researchers havebecomelessunwaveringly adversarialover ideologicalquestions,andhaveinsteadbegunto focuson the searchfor a coherentcombinationof approach es. have Why things changed? First, there is an increasingrealization, within eachcommunity, that achievingcoregoalsmay requireexpertisepossessed by
Preface
esaddrobustnessandcoverageto traditionally the other. Quantitativeapproach brittle andnarrowsymbolicnaturallanguagesystems,permitting, for example, the automatedor semiautomated acquisitionof lexical knowledge(e.g., term inology, names, translationequivalents). At the sametime, quantitativeap aboutthe nature proachesare critically dependenton underlyingassumptions of the data, and more peopleare concludingthat pushingapplicationsto the next level of performancewill require quantitativemodelsthat are linguistically betterinformed; inductivestatisticalmethodsperformbetterin the faceof limited datawhenthey arebiasedwith accurateprior knowledge. A secondsourceof changeis the critical computationalresourcesnot widely availablewhenquantitativemethodswerelast in vogue. Fastcomputers,cheap disk space, CD-ROMs for distributing data, andfundeddata-collection initiatives have becomethe rule rather than the exception. The Brown Corpusof ' American English, Francis and KuCeras landmark project of the 1960s areannotated [KuCeraandFrancis, 1967], now hascompanionsthat arelarger, that from data of consist that multiple languages in morelinguistic detail, and . 1996 ICAME 1996 LDC , ]) . e. ; , [ ( g Third, thereis a generalpushtowardapplicationsthat work with languagein a broad, real-world context, rather than within the narrow domainsof traditional symbolicNLP systems.With the adventof suchbroadcoverageapplications , languagetechnologyis positionedto help satisfysomereal demandsin : large-vocabularyspeechrecognitionhasbecomea daily part the marketplace of life for many peopleunableto use a computerkeyboard[ Wilpon, 1994], into on-line rough automatictranslationof unrestrictedtext is finding its way services, and locating full text information on the World Wide Web has becomea priority [Foley andPitkow, 1994] . Applicationsof this kind arefaced with unpredictableinput from userswho are unfamiliar with the technology andits limitations, which makestheir tasksharder; on the otherhand, usersare adjusting to less than perfect results. All these considerations coverage, robustness , acceptabilityof gradedperformance- call out for systemsthattake advantageof large-scalequantitativemethods. is Finally, the resurgenceof interest in methodsgroundedin empiricism direction. in that back partly the result of an intellectualpendulumswinging Thus, evenindependentof applications, we seemore of a focus, on the scientific sideof computationallinguistics, on the propertiesof naturallyoccurring data, the objectiveandquantitativeevaluationof hypothesesagainstsuchdata, and the constructionof modelsthat explicitly take variability and uncertainty into account. Developmentsof this kind also have parallels in related areas ; in the latter such as sociolinguistics[Sankoff, 1978] and psycholinguistics
Preface
field, for example, probabilistic modelsof on-line sentenceprocessingtteat frequencyeffects, and more generally, weightedprobabilistic interactions, as fundamentalto the descriptionof on-line performance , in much the sameway that empiricistsseethe probabilistic natureof languageas fundamentalto its description(e.g. [ Tabossiet al., 1992]; also see [ Ferstl, 1993]) . This book focuseson the ttend toward empirical methodsas it bearson the engineering side of NLP, but we believe that ttend will also continueto have important implicationsfor the studyof languageasa whole. The premiseof this book is that there is no necessityfor a polar division. Indeed, oneof our goalsin this book is to changethat perception. We hold that es. thereis in fact no contradictionof defectioninvolvedin combiningapproach es to languageis a Rather, combining " symbolic" and " statistical" approach kind of balancingact in which the symbolic and the statisticalare properly thoughtof asparts, both essential,of a unified whole. The complementarynatureof the contributionof theseseeminglydiscrepant esis not ascontradictoryasit seems.An obviousfact that is often forgottenis approach that everyuseof statisticsis basedupona symbolicmodel. No matter what the application, statisticsare founded upon an underlying probability model, andthat modelis, at its core, symbolicandalgebraicratherthancontinuous andquantitative. For language,in particular, the naturalunits of manipulation in any statistical model are discrete constructs such as phoneme, morpheme, word, and so forth, as well as discreterelationshipsamongthese constructssuch as surface adjacencyor predicate-argument relationships. Regardlessof the detailsof the model, the numericalprobabilitiesare simply ' . On meaninglessexceptin the contextof the model s symbolicunderpinnings " " this view, thereis no suchthings asa purely statistical method. Evenhidden Markov models, the exemplarof statisticalmethodsinheritedfrom the speech community, arebasedupon an algebraicdescriptionof languagethat amounts to an assumptionof fmite-state generativecapacity. Conversely, symbolic underpinningsalonearenot enoughto capturethe variability inherentin naturally occurring linguistic data, its resistanceto inflexible, tersecharacteriza tions. In short, the essenceof the balancingact can be found in the opening chaptersof any elementarytext on probability theory: the core, symbolic underpinningsof a probability model reflect thoseconstraintsand assumptions that mu~t be built in, and form the basisfor a quantitativeside that reflects of preferences . uncertainty, variability, andgradedness The aim of this book is to explore the balancingact that must take place es are brought together- it contains when symbolic and statisticalapproach foreshadowingsof powerful partnershipsin the makingbetweenthe more lin -
Preface
es within the tradition of generativegrammar guistically motivatedapproach and the more empirically driven approach es from the tradition of infonnation . Research of this kind theory requiresbasicchoices: What knowledgewill be representedsymbolically and how will it be obtained? What assumptions underliethe statisticalmodel? What principlesmotivatethe symbolic model? Whatis theresearchergainingby combiningapproach es? Thesequestions,and the metaphorof the balancingact, provide a unifying themeto the contributions in this volume. References
L. R. BabI, F. Jelinek . 1983.A maximumlikelihoodapproach , andR. L. Mercer to continuous . IEEETransactions on speech Pattern recognition and Machine IntelAnalysis . liRence --190. , PAMI-5:179 NoamChomsky. 1957. SyntacticStructures. The Hague, Mouton. KennethW. ChurchandRobertMercer. 1993. Introductionto the specialon computationallinguistics usinglargecorpora. ComputationalLinguistics, 19( 1) :1- 24. JamesL. Flanagan. 1972. Speechanalysis, synthesisandperception, 2nd edition. New York, Springer-Verlag. Evelyn Ferstl. 1993. The role of lexical informationand discoursecontextin syntactic : a review of psycholinguisticstudies. Cognitive ScienceTechnical processing Report 93-03, Universityof Coloradoat Boulder. Jim Foley and JamesPitkow, editors. 1994. ResearchPriorities for the World-Wide Web: Reportofthe NSF workshopsponsoredby theIn/ormation, Robotics, and Intelligent SystemsDivision. NationalScienceFoundation, October1994. BarbaraGrosz. TEAM : a transportablenaturallanguageinterfacesystem. InProceedings of the Conferenceon AppliedNatural LanguageProcessing.Associationfor ComputationalLinguistics, Morristown, N.J., February1983. ICAME. ICAME corpuscollection. World Wide Web page, 1996. http:// nora.hd.uib. no/corpora.html. H. KuCeraand W. Francis. 1967. ComputationalAnalysis of PresentDay American English. Providence,R.I ., Brown University Press. LOC. Linguistic Data Consortium( LDC) home page. World Wide Web page, June 1996. http:// www.cis.upenn.edu/ - idc/ . StanleyR. Petrick. 1971. Transformationalanalysis. In RandallRustin, editor, Natural LanguageProcessing.New York, Algorithmics Press. David Sankoff. 1978. Linguistic variation: Modelsandmethods. New York, Academic Press. ClaudeE. ShannonandWarrenWeaver. 1949. TheMathematicalTheory Communication of . Urbana, University of Dlinois Press.
Preface
. Patrizia Tabossi, Michael Spivey-Knowlton, Ken McRae, and Michael Tanenhaus 1992. Semanticeffects on syntacticambiguity resolution: evidencefor aconstraintbasedresolutionprocess.Attentionand Performance, 15: 598- 615. Jay G. Wilpon. 1994. Applicationsof Voice-ProcessingTechnologyin Telecommunications . In David B. Roe andJay G. Wilpon, editors, VoiceCommunicationsBetween Humansand Machines. WashingtonD .C., National Academyof Sciences , National Press . Academy W. A. Woods and R. Kaplan. 1971. The lunar sciencesnatural languageinfonnation system. TechnicalReport2265. Cambridge,Mass., Bolt, Beranek, andNewman.
Acknowledgments
Sincetimeis theoneimmaterialobjectwhichwecannotinfluence- neitherspeedup norslowdown, addto nordiminish- it ~s animponderably valuablegift. ' - MayaAngelou , Wouldnt TakeNothing/or MyJourneyNow Many peoplehavegiven of their time in the preparationof this book, from its first stagesasa workshopto its fmal stagesof publication. The first setof pe0Pie to thankarethosewho submittedpapersto the original workshop. We had an overwhelmingnumberof very high-quality paperssubmitted, andregretted not being able to run a longer workshopon the topic. The issueof combining es is clearly part of the researchagendaof many computationallinapproach as . Thoseauthorswho havechapters guists, demonstrated by thesesubmissions in this book have revised them severaltimes in responseto two rounds of reviews, andwe aregrateful to themfor their efforts. Our anonymousreviewers for thesechaptersgavegenerouslyof their time- each article wasreviewed severaltimes, andeachreviewerdid a carefulandthoroughjob- - andalthough they are anonymousto the outsideworld, their generosityis known to themselves , andwe thankthemquietly. We alsodeeplyacknowledgethe time Amy Pierceof The MIT Pressgaveto us at many stagesalong the way. Her vision and insight havebeena gift , and shehascontributedto ensuringthe excellent quality of the fmal setof chaptersin the book. We aregratefulto the Associationfor ComputationalLinguistics (ACL ) for its supportandsupportiveness , especiallyits fmancialcommitmentto the Balancing Act workshop- anddoublypleasedthat theattendanceat the workshop helpedus give somethingtangible back to the organization. The ACL 1994 conferenceand post-conferenceworkshopswere held at New Mexico State University, with local arrangementshandled efficiently and graciously by JanyceWiebe; we thank her for her time and generosityof spirit. We would also like to thank Sun MicrosystemsLaboratoriesfor its supportiveness
throughoutthe processof organizingthe workshopand putting togetherthis book, especially Cookie Callahanof Sun Labs. Finally, we thank Richard . SproatandEvelyneTzoukennannfor their valuablediscussions We have enjoyedthe experienceof working on this topic, since the challenge of combiningwaysof looking at the world intriguedeachof us independently beforewe joined forcesto run the workshopand subsequentlyedit this book. Oneof ushastraining in theoreticallinguistics, andhasslowly beenconverted to an understandingof the role of perfonnancedataeven in linguistic . theory Theotherhadfonnal educationin computerscience,with healthydoses of linguistics and psychology, and so arrived in the field with both points of view. We are finding that more and more of our colleaguesare coming to believe that maintaining a balanceamong several approach es to language and is an act worth . analysis understanding pursuing
Chapter Statistical
1 Methods
Steven Abney
In general , the view of the linguist toward the use of statistical methods was shaped by the division that took place in the late 1950s and early 1960s between the language engineering community ( e.g . [ f ngve, 1954] ) and the linguistics community (e.g . [ Chomsky, 1964] ) . When Chomsky outlined the three levels of adequacy- observational , descriptive , and explanatory - much of what was in progress in the computational community of the time was labeled as either observational or descriptive with relatively little or no impact on the goal of producing an explanatorily adequate theory of language . The compu tationallinguist was said to deal just with perfonnance , while the goal of linguistics is to understand competence. This point of view was highly influential then and persists to this day as a set of a priori assumptions about the nature of computational work on language . ' Abney s chapter revisits and challenges these assumptions, with the goal of illustrating to the linguist what the rationale might be for the computational linguist in pursuing statistical analyses. He reviews several key areas of linguistics , specifically language acquisition , language change, and language variation , showing how statistical models reveal essential data for theory building and testing . Although these areas have typically used statistical modeling , Abney goesfurther by addressing the central figure of generative grammar : the adult monolingual speaker. He argues that statistical methods are of great interest even to the theoretical linguist , because the issues they bear on are in fact linguistic issues, basic to an understanding of human language . Finally , Abney defends the provocative position that a weighted grammar is the correct model for explanation of several central questions in linguistics , such as the nature of parameter setting and degrees of grammaticality .- - Eds.
1 Chapter In the spaceof the last 10 years, statisticalmethodshavegonefrom beingvirtually unknownin computationallinguistics to being a fundamentalgiven. In 1996, no one can professto be a computationallinguist without a passing ' knowledgeof statisticalmethods. HMM s are as de rigeur as LR tables, and anyonewho cannotat leastusethe tenninology persuasivelyrisks being mistaken for kitchenhelp at the ACL banquet. More seriously, statisticaltechniqueshave broughtsignificant advancesin broad-coverage language processing . Statistical methods have made real progresspossibleon a numberof issuesthat had previouslystymiedattempts to liberatesystemsfrom toy domains: issuesthat includedisambiguation , error correction, and the inductionof the sheervolume of infonnation requisitefor handlingunrestrictedtext. And the senseof progresshasgenerateda greatdeal of enthusiasmfor statisticalmethodsin computationallinguistics. However, this enthusiasmhasnot beencatchingin linguistics proper. It is alwaysdangerousto generalizeaboutlinguists, but I think it is fair to saythat mostlinguistsareeither unaware(and unconcerned ) abouttrendsin computa. The gulf in basicassumptions tionallinguistics, or hostileto currentdevelopments is simply too wide, with the resultthatresearchon the othersidecanonly seemnaive, ill -conceived, anda completewasteof time andmoney. In part the differenceis a differenceof goals. A largepart of computational linguisticsfocuseson practicalapplications,andis little concernedwith human . Nonetheless , at leastsomecomputationallinguistsaim to languageprocessing advanceour scientific understandingof the humanlanguagefaculty by better the computationalpropertiesof language . Oneof the mostinteresting understanding andchallengingquestionsabouthumanlanguagecomputationisjust how peopleareableto dealsoeffortlesslywith the very issuesthat makeprocessing unrestrictedtext so difficult . Statisticalmethodsprovide the most promising currentanswers , andasa resultthe excitementaboutstatisticalmethodsis also sharedby thosein the cognitivereachesof computationallinguistics. In thischapter,I would like to communicatesomeof thatexcitementto fellow . Thereis no denying , to makeit comprehensible linguists, or at least, perhaps that there is a culture clashbetweentheoreticaland computationallinguistics thatservesto reinforcemutualprejudices.In caricature,computationallinguists believe that by throwing more cycles and more raw text into their statistical black box, they candispensewith linguistsaltogether,alongwith their fanciful RubeGoldbergtheoriesaboutexoticlinguisticphenomena . The linguist objects that, evenif thoseblack boxesmakeyou oodlesof moneyon speechrecognizers andmachine-translationprograms(which they do not), they fail to advance our understanding . I will try to explainhow statisticalmethodsjust might contribute to understanding of the sortthat linguistsareafter.
StatisticalMethodsandLinguistics
This paper, then, is essentiallyanapology, in the old senseof apology. I wish to explainwhy we would do sucha thing asto usestatisticalmethods, andwhy they arenot really sucha badthing, maybenot evenfor linguisticsproper.
I think the mostcompelling, thoughleastwell-developed,argumentsforstatistical methodsin linguistics comefrom the areasof languageacquisition, language variation, andlanguagechange. LanguageAcquisition Under standardassumptionsaboutthe grammar, we would expectthecourseof languagedevelopmentto be characterized by abrupt , eachtime thechild learnsor altersa rule or parameterof the grammar. changes If , as seemsto be the case, changesin child grammarare actuallyreflectedin changesin relativefrequenciesof structuresthatextendover monthsor more, it is hard to avoid the conclusionthat the child has a probabilisticor weighted grammarin somefonn. Thefonn that wouldperhapsbe leastoffensiveto mainstream sensibilitiesis a grammarin which the child " tries out" rulesfor a time. During the trial period, both the new andold versionsof a rule coexist, andthe probabilityof usingoneor the otherchangeswith time, until the probabilityof using the old rule finally dropsto zero. At any given point, in this picture, a child' s grammaris a stochastic(i.e., probabilistic) grammar. An aspectof this little illustration that bearsemphasizingis that the probabilities are addedto a grammar of the usual sort. A large part of what is meant by " statistical methods" in computationallinguistics is the study of stochasticgrammarsof this fonn: grammarsobtainedby addingprobabilities in a fairly transparentway to " algebraic" (i.e., nonprobabilistic) grammars. Stochasticgrammarsof this sort do not constitutea rejection of the underlying . This is quite different from algebraicgrammars, but a supplementation someusesto which statisticalmodels(mostprominently, neuralnetworks) are put, in which attemptsare madeto model someapproximationof linguistic behaviorwith an undifferentiatednetwork, with the result that it is difficult or ' impossibleto relatethe network s behaviorto a linguistic understandingof the sort embodiedin an algebraicgrammar. (It should, however, be pointed out that the problemwith suchapplicationsdoesnot lie with neuralnets, but with the unenlighteningway they aresometimesput to use.) Language Change Similar commentsapply, on a larger scale, to language change. If the units of changeare as algebraicgrammarslead us to expectrules or parametersor the like - we would expectabrupt changes.We might
ChapterI
" " expectsomeoneto go down to the local pub oneevening, order Ale!, andbe servedan eel instead, becausethe Great Vowel Shift happenedto him a day too early.1 In fact, linguistic changesthat are attributed to rule changesor changes of parameter settings take place gradually, over considerable stretches of time measuredin decadesor centuries. It is more realistic to assumethat the languageof a speechcommunity is a stochasticcompositeof the languagesof the individual speakers , describedby a stochasticgrammar. " " In the stochastic community grammar, the probability of a given construction reflects the relative proportion of speakerswho use the constructionin question. Languagechangeconsistsin shifts in relative frequency of construction (rules, parametersettings, etc.) in the community. If we think of speechcommunities as populations of grammarsthat vary within certain bounds, and if we think of languagechangeas involving gradualshifts in the center of balanceof the grammarpopulation, then statisticalmodels are of immediateapplicability [Tabor, 1994] . In this picture, we might still continueto assumethat an adult monolingual es a particular algebraicgrammar, and that stochasticgrammars speakerpossess areonly relevantfor the descriptionof communitiesof varyinggrammars. However, we must at leastmake allowancefor the fact that individuals routinely comprehendthe languageof their community, with all its variance. This rather suggeststhat at leastthe grammarusedin languagecomprehensionis stochastic . I returnto this issuebelow.
of language variationI havein LanguageVariation Therearetwo senses mindhere:dialectology , ontheonehand,andtypology , ontheother.It hasbeen that somelanguages consistof a collectionof dialectsthat blend suggested one into the other to the aremoreorlessarbitrary , smoothly pointthatthedialects in . Forexample Inuit as" a fairly unbroken , Tait describes points a continuum chainof dialects , with mutualintelligibilitylimitedto proximityof contact , the furthestextremes of the continuumbeingunintelligibleto eachother " [ Tait, 1994 , p. 3] . To describethedisbibutionof Latin-Americannativelanguages , Kaufman definesa language as" a geographic zonethat complex ally continuous contains thanthatfoundwithina singlelanguage ..., linguisticdiversitygreater butwhereinternallinguisticboundaries similarto thosethatseparate clearlydiscrete " Kaufman1994 . 31 . Thecontinuousness are of , ,p ] languages lacking [ with distance is consistent withthepictureof a speech community changes geographic withgrammatical variance above . Withgeographic distance , assketched , 1. I havereadthis anecdotesomewherebefore, but havebeenunableto fmd the citation . My apologiesto the unknownauthor.
andLin9.Uistics Methods Statistical the mix of frequencyof usageof variousconstructionschanges , anda stochastic , 1995]. grammarof somesortis an appropriatemodel[ Kessier Similar commentsapply in the areaof typology, with a twist. Many of the universalsof languagethat have been identified are statistical rather than absolute, including rough statementsaboutthe probability distributionof language features(" head-initial and head-final languagesare about equally frequent " " or conditional ) probability distributions( postpositionsin verb-initial " languagesare more common than prepositions in verb final languages) [Hawkins, 1983, 1990]. Thereis asyet no modelof how this probability distribution comesabout, that is, how it arisesfrom the statisticalpropertiesof language change.Which aspectsof thedistributionarestable, andwhich would be ' different if we took a sampleof the world s languages10,000 yearsago or 10,000 yearshence? There is now ~ vast body of mathematicalwork on stochastic esandthe dynamicsof complexsystems(which includes, but is process not exhaustedby, work on neuralnets), much of which is of immediaterelevance to thesequestions. In short, it is plausibleto think of all of theseissues- languageacquisition, languagechange, and languagevariation- in terms of populationsof ~ mars, whetherthosepopulationsconsistof grammarsof different speakersor setsof hypothesesa languagelearnerentertains. When we examine.populations of grammarsvarying within bounds, it is naturalto expectstatisticalmodels to provideusefultools.
2 Adult MonolingualSpeakers ? Ever sinceChomsky, linguistics But whataboutanadultmonolingualspeaker hasbeenfmnly committedto the idealizationto an adult monolingualspeaker in a homogeneousspeechcommunity. Do statisticalmodelshaveanythingto sayaboutlanguageunderthat idealization? In a narrowsense,I think the answeris probablynot. Statisticalmethodsbear mostly on all the issuesthat are outsidethe scopeof interestof currentmainstream , though, I think that saysmore aboutthe linguistics. In a broadersense narrownessof the currentscopeof interestthanaboutthe linguistic importance of statisticalmethods.Statisticalmethodsareof greatlinguistic interestbecause the issuestheybearon arelinguistic issues,andessentialto anunderstandingof what humanlanguageis and what makesit tick. We must not forget that the idealizationsthat Chomskymadewere an expedient, a way of managingthe vastnessof our ignorance. One aspectof languageis its algebraicproperties, but that is only one aspectof language, and certainly not the only important
Chapter!
aspect. Also importantare the statisticalpropertiesof languagecommunities. And stochasticmodelsare also essentialfor understandinglanguageproduction andcomprehension , particularly in the presenceof variation andnoise. (I focus here on comprehension , though considerationsof languageproduction have also provided an important impetusfor statisticalmethodsin computationallinguistics [Smadja, 1989, 1991] .) To a significantdegree, I think linguisticshaslost sight of its original goal, andturnedChomsky's expedientinto anendin itself. Currenttheoreticalsyntax givesa systematicaccountof a very narrowclassof data, judgmentsaboutthe well-formednessof sentencesfor which the intendedstructureis specified, wherethejudgmentsareadjustedto eliminategradationsof goodnessandother complications.Linguistic dataother than structurejudgmentsare classifiedas " " performance data, andthe adjustmentsthat areperformedon structure-judgment dataaredeemedto be con-ectionsfor " performanceeffects." Performance is consideredthedomainof psychologists , or at least, not of concernto linguists. The term performance suggeststhat the things that the standardtheory abstractsawayfrom or ignoresarea naturalclass; they aredatathatbearon language processingbut not languagestructure. But in fact a good deal that is " is not labeled" performance computationalin any essentialway. It is more accurateto considerperformanceto be negativelydefmed: it is whateverthe grammardoesnot accountfor. It includesgenuinelycomputationalissues,but a gooddealmorethat is not. OneissueI would like to discussin somedetail is the issue of grammaticality and ambiguity judgments about sentencesas opposedto structures.Thesejudgmentsareno moreor lesscomputationalthan , but it is difficult to give a good accountof them judgmentsabout structures with grammarsof the usual sort; they seemto call for stochastic, or at least weighted, grammars. 2.1 Grammaticality and Ambiguity Considerthe following : ( I ) a. The a areof I b. The cowsaregrazingin the meadow c. JohnsawMary The questionis the statusof theseexampleswith respectto grammaticalityand ambiguity. The judgmentshere, I think, are crystal clear: ( I a) is word salad, and ( I b) and(c) areunambiguoussentences . In point of fact, ( I a) is a grammaticalnoun phrase, and ( I b) and (c) are ambiguous,the nonobviousreadingbeingasa nounphrase.Consider: anare is a measureof area, asin a hectareis a hundredares, andlettersof the alphabet
StatisticalMethodsandLinguistics
" may be usedasnounsin English( Writtenon the sheetwasa singlelowercase " " a, As describedin section2, paragraph b . . ." ). Thus ( 1a) hasa structurein which are andI are headnouns, anda is a modifier of are. This analysiseven becomesperfectlynaturalin the following scenario.Imaginewe aresurveyors, and that we havemappedout a pieceof land into large segments , designated with capitalletters, andsubdividedinto one-are subsegments , designatedwith lowercaseletters. ThenThea are of I is a perfectlynaturaldescriptionfor a particular parcelon our map. As for ( 1b), are is againthe headnoun, cowsis a premodifier, andgrazingin the meadowis a postmodifier. It might be objectedthat plural nounscannotbe nominalpremodifiers, but in fact they often are: considerthe bondsmarket, a securitiesexchange , he is vicepresidentandmediadirector, an in-homehealth care servicesprovider, Hartford ' s claims division, the financial -services industry, its line of systemsmanagementsoftware. (Severalof theseexamples areextractedfrom the Wall StreetJournal.) It mayseemthatexamples( la ) and(b) areillustrativeonly of a trivial andartificial problemthat arisesbecauseof a rare usageof a commonword. But the " " problemis not trivial: withoutanaccountof rareusage, we haveno way of distinguishing betweengenuineambiguitiesandthesespuriousambiguities.Alternatively one , might objectthat if onedoesnot know that are hasa readingasa noun, thenare is actuallyunambiguousin one' s idiolect, and ( la ) is genuinely . But in thatcasethequestionbecomeswhy a hectareis a hundred ungrammatical aresis notjudgedequallyungrammaticalby speakersof theidiolectin question. Further, ( lc ) illustratesthattherareusageis not anessentialfeatureof examples (a) and (b). Sawhasa readingasa noun, which may be lessfrequentthan the verb reading, but is hardly a rareusage. Propernounscanmodify (Gatling gun) and be modified by (TyphoidMary) commonnouns. Hence, John saw Mary hasa readingas a noun phrase, referring to the Mary who is associated with a kind of sawcalleda Johnsaw. It may be objectedthat constructionslike Gatling gun and TyphoidMary belongto the lexicon, not thegrammar, but howeverthat maybe, they arecompletely productive. I may not know whatCohenequations,theRussiahouse, or ' Abney sentencesare, but if not, then the denotataof Cohens equations, the ' Russianhouse, or thosesentencesof Abney s are surely equally unfamiliar.2 2. There are also syntactic groundsfor doubt about the assumptionthat noun-noun modificationbelongsto the lexicon. Namely. adjectivescanintervenebetweenthemodifying noun and the headnoun. (Examplesare given later in this section.) If adjective modification belongsto the syntax, and if thereare no discontinuouswords or movement of piecesof lexical items, thenat leastsomemodificationof nounsby nounsmust tllke placein the syntax.
Chapter}
LikewiseI maynotknowwhopeglegPeterefersto, or riverboatSally, butthat . doesnotmaketheconstructions or productive anylessgrammatical Theproblemis epidemic and it snowballs as sentences , growlonger. One unremarkable sentences oftenhearsin computational linguisticsaboutcompletely . Nor is it with hundreds of parses , andthatis in fact no exaggeration If theundesired a of a . one examines merely consequencehaving poorgrammar one finds that are , and , analyses generally they extremelyimplausible " " violenceto soft constraints like heaviness constraints oftendo considerable or thenumberandsequence of modifiers , butno onepieceof thestructureis . outrightungrammatical consider this sentence To illustrate , , drawnmoreor lessat randomfrom a ' book(Quines WordandObject) drawnmoreor lessatrandomfrommy shelf: relevant is epistemologically , assuggesting (2) In a generalwaysuchspeculation in the environment howorganisms and maturing evolving physical endupdiscoursing of abstract weknowmightconceivably objects aswedo [Quine, 1960,p. 123]. thissentence Oneof themanyspuriousstructures mightreceiveis thefollowing : (3)
s === : == : = = ---::== -----== --- ----- ------:: : = :::: : :: Absolute PP AdiP "'- - - - -- - - -- - - - - "'-----...",,~ PP organisms andevolving .. relevant Inageneral maturing wayRC epistemologically ...",,~ .....",,~ how such is assuggesting speculation ,
- - - - ----- NP - - - - - --.......VP -- - ....-........ I we know -S - - -- - - NP VP "' ...",,~ --- ----aswedo objects
AP might Ptcpl " " " -""of '"abstrac " " " " ' " end conceivably up discoursin
There are any number of criticisms one can direct at this structure, but I believe none of them are fatal . It might be objected that the PP-AdiP - Absolute sequenceof sentential premodifiers is illegitimate , but each is individually fine , and there is no hard limit on stacking them. One can even come up with relatively good examples with all three modifiers , e.g . [ppon the beach] [ AdiPnaked
StatisticalMedIads andLlDgni ~tic~
asjaybirds] [Absolute waveslappingagainsttheshore] the wild boyscarriedout their bizarre rituals. Anotherpoint of potentialcriticismis thequestionof licensingtheelidedsentence afterhow. In fact its contentcouldeitherbe providedfrom precedingcontext or from therest of the sentence , asin thoughas yet unableto explainhow, astronomersnow knowthat starsdevelopfrom specksof grit in giant oysters. Might is takenhereasa noun, asin mightand right. The AP conceivablyend up maybea bit mysterious: endup is hereanadjectival, asin we turnedthebox end up. Abstractis unusualasa massnoun, but can in fact be usedasone, as, for example, in the article consistedof threepagesof abstract and only two pagesof actual text. Onemight object that the NP headedby might is badbecauseof the multiple postmodifiers,but in fact thereis no absoluteconstraintagainststackingnominal postmodifiers,andgoodexamplescanbe constructedwith the samestructure : marlinespikes , businessend up, sprinkled with tabascosauce, can be a poweiful deterrent against pigeons. Even the commas are not absolutely required. The strengthof preferencefor themdependson how heavythe modifiers are: cf. strengthjudiciously applied increasesthe effectiveness of diplomacy , a cupof peanutsunshelledin the stockaddscharacter.3 In short, the structure(3) seemsto be best characterizedas grammatical, thoughit violatesany numberof parsingpreferencesandis completelyabsurd. Onemight think that onecouldeliminateambiguitiesby turning someof the into absoluteconstraints.But attemptingto eliminateunwanted dispreferences readingsthat way is like squeezinga balloon: everydispreferencethat is turned into an absoluteconstraintto eliminateundesiredstructureshasthe unfortunate side effect of eliminating the desiredstructurefor some other sentence . No matterhow difficult it is to think up a plausibleexamplethat violatesthe constraint , somewriter hasprobablyalreadythoughtone up by accident, and we will improperlyreject his sentenceas ungrammaticalif we turn the dispreference into an absoluteconstraint. To'illustrate: if a noun is premodifiedby both an adjective and anothernoun, standardgrammarsrequire the adjectiveto comefirst, inasmuchasthe nounadjoinsto ~ but the adjectiveadjoinsto N. It is not easyto th~ up goodexamplesthat violate this constraint. Perhapsthe readerwould careto try beforereadingtheexamplesin the footnote.4 3. Cf. this passagefrom Tolkien: "Their clotheswere mendedas well as their bruises their tempersand their hopes. Their bagswere filled with food and provisions light to carry but strongto bring themover the mountainpasses ." [ Tolkien, 1966, p. 61] 4. Maunder climatic cycles, ice-core climatalogical records, a Kleene-star transitive closure, Precambrianera solar activity, highlandigneousformations.
1 Chapter Not only is my absurdanalysis(3) arguablygrammatical, there are many, many equally absurdanalysesto be found. For example, general could be a noun (the army officer) insteadof an adjective, or evolving in could be analyzed as a particle verb, or the physical could be a noun phrase(a physical exam} - not to mention various attachmentambiguitiesfor coordinationand modifiers, giving a multiplicative effect. The consequenceis considerable . ambiguityfor a sentencethat is perceivedto be completelyunambiguous a I am. But it is and I I am Now perhapsit seems perversity suppose beingperverse, that is implicit in grammaticaldescriptionsof the usual sort, and it emergesunavoidablyassoonaswe systematicallyexaminethe structuresthat . Either the grammarassignstoo many structures the grammarassignsto sentences 2 it or like to sentences ( ), incorrectly predictsthat exampleslike three pages of abstract or a cup of peanutsunshelledin the stock have no well formedstructure. To sumup, thereis a problemwith grammarsof the usualsort: their predictions aboutgrammaticalityandambiguityaresimply not in accordwith human . The problemof how to identify the correctstructurefrom among perceptions the in-principle possiblestructuresprovidesone of the centralmotivationsfor the use of weighted grammarsin computationallinguistics. A weight is assignedto eachaspectof structurepermittedby the grammar, andthe weight of a particular analysisis the combinedweight of the structuralfeaturesthat make it up. The analysiswith the greatestweight is predictedto be the perceived . analysisfor a given sentence Before describingin more detail how weightedgrammarscontributeto a solution to the problem, though, let me addressan evenmore urgentissue: is this evena linguistic problem?
2.2 Is This Linguistics? Under the usual assumptions , the fact that the grammarpredicts grammaticality and ambiguity wherenoneis perceivedis not a linguistic problem. The usual opinion is that perceptionis a matter of performance, and that grammaticality alone does not predict performance; we must also include nonlinguisti factors like plausibility and parsing preferencesand maybe even probabilities. Grammaticality and Acceptability The implication is that perceptionsof grammaticalityand ambiguity are not linguistic data, but performancedata. ? And This stanceis a bit odd- are not grammaticalityjudgmentsperceptions what do we meanby " performancedata?" It would be one thing if we were
StatisticalMethodsandLinguistics
talking aboutdatathat clearly haveto do with the courseof linguistic computation , datalike responsetimesandreadingtimes, or regressiveeyemovement , or even more outlandish things like positron emission tomofrequencies graphicscansor early receptorpotentialtraces. But humanperceptions(judgments , intuitions) about grammaticalityand ambiguity are classic linguistic data. What makesthe judgments concerningexamples( 1a--c) performance data? AU linguistic dataarethe resultof little informal psycholinguistic experiments thatlinguistsperformon themselves , andtheexperimentalmaterialsare " questionsof the form Can you say this?" " Does this mean this?" " Is this " " ? Are thesesynonymous ?" ambiguous Part of the answeris that the judgmentsabout examples( la--c) are judgments aboutsentencesaloneratherthan aboutsentenceswith specifiedstructures . The usualsortof linguisticjudgmentis ajudgmentaboutthe goodnessof a particularstructure, andexamplesentencesareonly significantasbearersof the structurein question. If any choiceof wordsandany choiceof contextcan be found that makesfor a good sentence , the structureis deemedto be good. The basic data arejudgmentsabout structuredsentencesin context- that is, sentences plus a specificationof the intendedstructureandintendedcontextbut thesebasicdataare usedonly groupedin setsof structuredcontextualized sentenceshavingthe same( possiblypartial) structure. Sucha setis defmedto begoodjust in caseany structuredcontextualizedsentenceit containsis judged to be good. Hencea greatdealof linguists' time is spentin trying to find some choiceof wordsandsomecontextto get a clearpositivejudgment, to showthat a structureof interestis good. As a result, thereis actuallyno intent that thegrammarpredict- that is, generate - individual structuredsentencejudgments. For a given structuredsentence , the grammaronly predictswhetherthereis somesentencewith the same structurethat is judged to be good. For the examples( 1), then, we shouldsaythat the structure [NPthe [Na] [Nare] [ppof [NI ]]] is indeedgrammaticalin the technicalsense, sinceit is acceptablein at least one context, and since every piece of the structureis attestedin acceptable sentences . The groupingof databy structureis not the only way that standard grammars fail to predict acceptability and ambiguity judgments. Judgmentsare rather smoothly graded, but goodnessaccording to the grammar is allor nothing. Discrepanciesbetweengrammar and data are ignored if they involve sentencescontainingcenterembedding, parsingpreferenceviolations,
Chapterl
" garden-patheffects, or in generalif their badnesscanbe ascribedto processing " S complexity. Grammar and Computation The differencebetweensttucturejudgments " data" in somesense andsbingjudgmentsis not thatthe formerare competence " " andthe latterare performancedata. Rather, the distinctionrestson a working assumptionabouthow the dataareto be explained, namely, that the dataarea result of the interactionof grammaticalconstraintswith computationalconstraints . Certainaspectsof thedataareassumedto bereflectionsof grammatical constraints , andeverythingelseis ascribedto failuresof the processorto translate grammaticalconstraintstransparentlyinto behavior, whetherbecauseof memorylimits or heuristicparsingstrategiesor whateverobscuremechanisms of judgments.We arejustified in ignoringthoseaspectsof the creategradedness . to the idiosyncraciesof the processor we ascribe that data . But this distinctiondoesnot hold up underscrutiny Dividing the humanlanguage capacityinto grammarand processoris only a mannerof speaking, a . It is naiveto expectthe of way dividing things up for theoreticalconvenience to to division correspond anymeaningfulphysiologlogical grammar/processor , onefunctioning ical division say, two physically separateneuronalassemblies es asa storeof grammarrules and the other as an activedevicethat access the grammar-rule store in the course of its operation. And even if we did , we have believein a physiologicaldivision betweengrammarand processor with not a distinction no evidenceat all to supportthat belief; it is any empirical content. A coupleof examplesmight clarify why I say that the grammar/processor . Grammarsand syntacticsttucdistinction is only for theoreticalconvenience , but turesareusedto describecomputerlanguagesaswell ashumanlanguages typical compilersdo not accessgrammarrules or consttuctparsetrees. At the level of descriptionof the operationof the compiler, grammar-rules andparse" treesexist only " virtually asabstractdescriptionsof the courseof the compu5. In addition, therearepropertiesof grammaticalityjudgmentsof a different sort that arenot beingmodeled, propertiesthat arepoorly understoodandsomewhatworrisome. arisenot infrequentlyamongjudges - it is moreoften the casethan not Disagreements that I disagreewith at leastsomeof thejudgmentsreportedin syntaxpapers,andI think seemto changewith changingtheoretical my experienceis not unusual. Judgments " : a sentencethat sounds" not too good whenoneexpectsit to be bad may assumptions ' " " . Andjudg in the if a sound not too bad grammarchangesone s expectations change mentschangewith exposure.Someconstructionsthat soundterrible on a first exposure improveconsiderably with time.
Statistical MethodsandLinguistics
13
tation beingperformed. What is separatelycharacterizedas, say, grammarvs. parsingstrategyat the logical level is completelyintenningledat the level of compileroperation. At the other extreme, the constraintsthat probablyhavethe strongestcomputationalflavor are the parsingstrategiesthat are consideredto underliegarden -path effects. But it is equally possibleto characterizeparsingpreferences in grammaticalterms. For example, the low attachmentstrategycanbe characterizedby assigninga cost to structuresof the form [ Xi+l Xi YZ] proportional to thedepthof the subtreeY.The optimal structureis theonewith the leastcost. Nothing dependson how treesare actually computed: the characterizationis only in termsof the shapesof trees. If we wish to makea distinction betweencompetenceand computation, an appropriatedistinction is betweenwhat is computedand how it is computed. " " issuesare , most performance not computationalissuesat all. By this measure the of Characterizing perceptions grammaticalityand ambiguity describedin the previoussectiondoesnot necessarilyinvolve any assumptionsabout the computationsdoneduring sentenceperception. It only involvescharacterizing the set of structuresthat are perceivedas belongingto a given sentence . That canbedone, for example, by defining a weightedgrammarthat assignscoststo trees, andspecifyinga constantC suchthatonly structureswhosecostis within distanceC of the beststructurearepredictedto be perceived. How the setthus definedis actuallycomputedduring perceptionis left completelyopen. We may think of competencevs. performancein terms of knowledgevs. computation,but that is merelya mannerof speaking.What is really at issueis an idealizationof linguistic datafor the sakeof simplicity. The Frictionless Plane, Autonomy and Isolation Appeal is often madeto an analogybetweencompetenceandfrictionlessplanesin mechanics . Syntacticians focus on the data that they believe to contain the fewestcomplicating factors, and " cleanup" the datato removewhat they believeto be remaining complicationsthat obscuresimple, generalprinciplesof language. That is properandlaudable, but it is importantnot to lose sight of the original problem, andnot to mistakecomplexityfor irrelevancy. The testof whether the simpleprincipleswe think we havefound actuallyhaveexplanatorypower is how well they fare in makingsenseof the largerpicture. Thereis alwaysthe dangerthat the simpleprincipleswe arrive at areartifactsof our dataselection and data adjustment. For example, it is sometimesremarkedhow marvelous it is that a biological systemlike languageshouldbe so discreteandclean, but in fact there is abundantgradednessand variability in the original data; the
Chapter1 evidence for the discretenessand cleannessof language seemsto be mostly evidence we ourselves have planted . It has long been emphasized that syntax is autonomous. The doctrine is older " 6 than Chomsky ; for example, Tesniere [ Tesniere, 1959, p . 42 ] writes . . . la " ' ' syntaxe n a a chercher sa propre loi qu en elle meme. Elle est autonome . To illustrate that structure cannot be equated with meaning, he presents the sentence pair : Ie signal vert indique Ie Vole libre Ie symbole veritable impose Ie vitesse lissant ' The similarity to Chomsky s later but more famous pair revolutionary new ideas appear infrequently colorless green ideas sleep furiously is striking . But autonomy is not the same as isolation . Syntax is autonomous in the sense that it cannot be reduced to semantics; well - formedness is not identical to meaningfulness. But syntax in the sense of an algebraic grammar is only one piece in an account of language, and it stands or falls on how well it fits into the larger picture .
The Holy Grail The largerpicture, andthe ultimategoal of linguistics, is to describelanguagein the senseof that which is producedin languageproduction in languagecomprehension , acquiredin languageacquisition , comprehended varies in which that in and , , aggregate , languagevariation andchangesin languagechange. I havealwaystakenthe Holy Grail of generativelinguisticsto be to characterize a class of models, each of which representsa particular (potentialor actual) human languageL , and characterizesa speakerof L by defming the classof sentencesa speakerof L produces, the structuresthat a speakerof L ; in short, by predictingthe linguistic datathat characterize perceivesfor sentences a speakerof L. A "Turing test" for a generativemodelwould be somethinglike the following that are at random, the sentences . If we usethe modelto generatesentences producedarejudged by humansto be clearly sentencesof the language- to " soundnatural." And in the otherdirection, if humans judge a sentence(or non' 6. The cited work was completedbeforeTesnieres deathin 1954, thoughit was not publisheduntil 1959.
StatisticalMethodsandLinguistics
15
sentence) to have a particular sbUcture, the model should also assign precisely that sbUcture to the sentence.
Natural languagesare such that these tests cannot be passedby an unweighted grammar. An unweighted grammar distinguishes only between grammaticalandungrammaticalsb' uctures,andthat is not enough. " Sounding natural" is a matterof degree. What we must meanby " randomlygenerating " is that natural-soundingsentences sentencesare weightedby the degreeto which they sound natural, and we samplesentenceswith a probability that accordswith their weight. Moreover, the sb' ucturethat peopleassignto a sentence is the sb' ucturetheyjudge to havebeenintendedby the speaker,andthat judgmentis alsoa matterof degree. It is not enoughfor the grammarto define the set of sb' ucturesthat could possibly belong to the sentence ; the grammar shouldpredictwhich sb' uctureshumansactuallyperceive, andwhattherelative weights are in caseswhere humansare uncertainabout which sb' ucturethe speakerintended. The long andlittle of it is, weightedgrammars(andother speciesof statistical methods) characterizelanguagein sucha way asto makesenseof language , acquisition, variation, and change. Theseare linguistic production, comprehension , andnot computationalissues,a fact that is obscuredby labelingeverything " " perfonnance that is not accountedfor by algebraicgrammars.What is " " really at stakewith competence is a provisionalsimplifying assumption , or an expressionof interestin certain subproblemsof linguistics. There is certainly no indicting an expressionof interest, but it is importantnot to losesight of the largerpicture.
3 HowStatisticsHelps Acceptingthat thereare divergencesbetweentheory and data- for example, the divergencebetweenpredictedand perceivedambiguity- and accepting that this is a linguistic problem, and that it is symptomaticof the incompleteness of standardgrammars, how does adding weights or probabilities help makeup the difference? Disambiguation As alreadymentioned, the problemof identifying the correct parse- the parsethat humansperceive- amongthe possibleparsesis a central applicationof stochasticgrammarsin computationallinguistics. The problemof defmingwhich analysisis correctis not a computationalproblem, however; thecomputationalproblemis describingan algorithmto computethe correctparse. Thereare a variety of approach es to the problemof defining the
Chapter1
correctparse. A stochasticcontext-freegrammarprovidesa simpleillustration. Considerthe sentenceJohn walks, andthe grammar (4) 1. 2. 3. 4. 5. 6. 7.
S - NPV S - NP NP - N NP - NN N - John N - walks V - walks
.7 .3 .8 .2 .6 .4 1.0
, oneasa sentenceand Accordingto grammar(4), John walkshastwo analyses oneasa nounphrase. (The rule S NP representsan utteranceconsistingof a single noun phrase.) The nu. mbersin the rightmost column representthe weightsof rules. The weight of an analysisis the productof the weightsof the rules usedin its derivation. In the sententialanalysisof John walks, the derivation consistsof rules 1, 3, 5, 7, so the weight is (.7)(.8)(.6)( 1.0) = .336. In the noun-phraseanalysis, the rules 2, 4, 5, 6 are used, so the weight is (.3)(.2)(.6)(.4) = .0144. The weight for the sententialanalysisis muchgreater, predicting that it is the one perceived. More refined predictions can be obtainedby hypothesizingthat an utteranceis perceivedas ambiguousif the " " next-best caseis not too much worsethan the best. If not too much worse is interpretedasa ratio of , say, not more than 2:1, we predict that John walks is perceivedas unambiguous, as the ratio betweenthe weights of the parses is 23: 1: Gradations of acceptability are not accommodated Degrees of Grammaticality in algebraic grammars: a structure is either grammatical or not. The idea of degrees of grammaticality has been entertained from time to time , and some " " classes of ungrammatical structures are informally considered to be worse than others (most notably , Empty Category Principle (ECP) violations vs. sublacency violations ) . But such degrees of gramrnaticality as have been considered have not been accorded a formal place in the theory . Empirically , acceptability judgments vary widely across sentences with a given structure, depending on lexical choices and other factors. Factors that cannot be reduced 7. The hypothesisthat only the beststructure(or possibly, structures ) areperceptibleis esto syntaxin which grammaticalityis definedas somewhatsimilar to currentapproach optimal satisfactionof constraintsor maximal economyof derivation. But I will not . hazarda guesshereaboutwhetherthat similarity is significantor merehappenstance
StatisticalMethodsandLinguistics
to a binary grammaticalitydistinction areeither poorly modeledor ignoredin standardsyntacticaccounts. Degreesof grammaticalityarise as uncertaintyin answeringthe question " Can " you sayX? or perhapsmore accurately, " If you said X , would you feel " you had madean error? As such, they reflect degreesof error in speechproduction . The null hypothesisis that the samemeasureof goodnessis usedin both speechproduction and speechcomprehension , though it is actually an open question. At any rate, the measureof goodnessthat is important for is not degreeof grammaticalityalone, but a global measure speechcomprehension that combinesdegreesof grammaticalitywith at least naturalnessand " . structuralpreference(i.e., " parsingstrategies ) We must also distinguish degreesof grammaticality, and indeed, global , from theprobability of pro. ducinga sentence goodness . Measuresof goodness and probability are mathematicallysimilar enh cementsto algebraicgrammars ~ , but goodnessalonedoesnot determineprobability. For example, for an infinite language, probability must ultimately decreasewith length, though arbitrarily long sentences may be perfectlygood. Perhapsonereasonthat degreesof grammaticalityhavenot found a placein standardtheory is the questionof wherethe numberscomefrom, if we permit continuousdegreesof grammaticality. The answerto wherethe numberscome from is parameterestimation. Parameterestimationis well-understoodfor a numberof modelsof interest, andcan be seenpsychologicallyaspart of what goeson during languageacquisition. Naturalness It is a bit difficult to say preciselywhat I meanby naturalness . A large componentis plausibility, but not plausibility in the senseof world knowledge, but ratherplausibility in the senseof selectionalpreferences , that is, semanticsortalpreferencesthat predicatesplaceon their arguments . Another important componentof naturalnessis not semantic, though, but " " simply how you sayit. This is whathasbeencalledcollocationalknowledge, like the fact that one saysstrong tea and poweiful car, but not vice versa [Smadja, 1991], or that you say thick accent in English, but starker Akzent " " ( strongaccent ) in German. Thoughit is difficult to definejust what naturalnessis, it is not difficult to recognizeit. If onegeneratestext at randomfrom anexplicit grammarplus lexicon , the shortcomingsof the grammarare immediatelyobviousin the unnatural - sentencesthat are - even if not ungrammatical produced. It is alsoclear that naturalnessis not at all the samething as meaningfulness . For example, I think it is clearthatdifferential structureis morenaturalthandifferential child,
Chapterl
eventhoughI could not saywhat a differential structuremight be. Or consider thefollowing examplesthat werein fact generatedat randomfrom a grammar: (5) a. matter-like , complete, allegedstrips a stratigraphic,dubiousscattering a far alternativeshallowmodel b. indirect photographic-drill sources earlier stratigraphicallypreciseminimums ' Europes cyclic existence , but I think All theseexamplesare abouton a par asconcernsmeaningfulness . a than the natural more the (b) examplesarerather ( ) examples Collocationsand selectionalrestrictionshave beentwo importantareasof applicationof statisticalme~ ods in computationallinguistics. Questionsof , interesthavebeenboth how to include themin a global measureof goodness for as a tool both Resnik 1993 , ], and how to induce them distributionally [ investigations,andasa modelof humanlearning. , have , or parsing strategies Structural Preferences Structuralpreferences " " is one match A . example. The preference longest alreadybeenmentioned example (6) The emergencycrewshatemostis domesticviolence is a garden-path becauseof a strong preferencefor the longest initial NP, . (The Theemergencycrews, ratherthanthe correctalternative, Theemergency correctinterpretationis: Theemergency[that crewshatemost] is domesticviolence .) The longest-matchpreferenceplays an importantrole in the dispreference for the structure(3) that we examinedearlier. As alreadymentioned, thesepreferencescan be seenas structuralpreferences . They interactwith the otherfactorswe , ratherthanparsingpreferences . For example, in (6), an of goodness measure in a havebeenexamining global is evenlongermatch, The emergencycrewshate, actuallypossible, but it violates the dispreferencefor havingplural nounsasnominalmodifiers. Error Tolerance A remarkablepropertyof humanlanguagecomprehension that an algebraicgrammarwould simply is its error tolerance.Many sentences classifyasungrammaticalareactuallyperceivedto havea particularstructure. A simpleexampleis wesleeps,a sentencewhoseintendedstructureis obvious, . In fact, an erroneousstructuremay actuallybe preferred albeit ungrammatical to a grammaticalanalysis; consider
StatisticalMethodsandLinguistics
(7) Thanksfor all you help. which I believeis preferentiallyinterpretedasan erroneousversionof Thanks for all your help. However, there is a perfectly grammaticalanalysis: Thanks for all thosewhoyou help. We canmakesenseof this phenomenonby supposingthat a rangeof errorcorrectionoperationsare available, thoughtheir applicatiol~imposesa certain cost. This cost is combinedwith the otherfactorswe havediscussed , to determine a global goodness , andthe bestanalysisis chosen.In (7), thecostof error or correctionis apparentlylessthanthe costof the alternativein unnaturalness structuraldispreference . Generally, error detectionand correctionare a major selling point for statisticalmethods. They wereprimary motivationsfor Shannon' s noisychannelmodel[Shannon,1948], which providesthe foundationfor . manycomputationallinguistic techniques Learning on the Fly Not only is the languagethat one is exposedto full of errors, it is producedby otherswhosegrammarsand lexica vary from one' s own. Frequently, sentences thatoneencounterscanonly be analyzedby adding new constructionsor lexical entries. For example, when the averageperson hearsa hectareis a hundredares, they deducethat are is a noun, and succeed in parsingthe sentence . But therearelimits to learningon the fly , just asthere arelimits to error correction. Learningon the fly doesnot help oneparsethe a are of I . Learningon the fly can be treatedmuch like error correction. The simplest - for example, assigninga approachis to admit a spaceof learningoperations frameto a verb, new part of speechto a word, addinga new subcategorization etc.- andassigna costto applicationsof the learningoperations.In this way it is conceptuallystraightforwardto include learningon the fly in a global optimization . Peopleare clearly capableof error correctionand learningon the fly ; they are highly desirableabilities given the noise and variancein the typical linguistic environment. They greatly exacerbatethe problem of picking out the intendedparsefor a sentence , becausethey explodethe candidatespaceeven of candidatesthat the grammarprovides. To the set beyond already large how it is nonetheless possibleto identify the intendedparse, thereis no explain seriousalternativeto the useof weightedgrammars. the problemof identifying Lexical Acquisition A final factor that exacerbates the correctparseis the sheerrichnessof natural languagegrammarsand
Chapter1
lexica. A goal of earlierlinguistic work, andonethat is still a centtalgoalof the linguistic work that goeson in computationallinguistics, is to developgrammars that assigna reasonablesyntacticstructureto every sentenceof English, as or nearlyeverysentenceaspossible. This is not a goal that is currentlymuch -Binding theory in fashionin theoreticallinguistics. Especiallyin Government (GB), the developmentof large fragmentshas long sincebeenabandonedin favor of the pursuitof deepprinciplesof grammar. The scopeof the problemof identifying the correctparsecannotbe appreciated by examining behavior on small fragments, however deeply analyzed. Large fragmentsare not just small fragmentsseveraltimes over- there is a qualitativechangewhenonebeginsstudyinglargefragments. As the rangeof increases constructionsthat the grammaraccommodates , the numberof undesired increasesdramatically. parsesfor sentences In-breadthstudiesalso give a different perspectiveon the problemof language acquisition. When one attemptsto give a systematicaccountof phrase structure,it becomesclearjust how manylittle factstherearethat do not fallout from grandprinciples, butjust haveto belearned.Thesimple, generalprinciples in thesecasesarenot principlesof syntax, but principlesof acquisition. Examples arethecomplexconstraintson sequencingof prenominalelementsin English , or the syntaxof dateexpressions(MondayJune the 4th, MondayJune4, * MondayJune the4, *June4 Monday) or the syntaxof propernames(Greene . ) , or the syntaxof numeralexpressions CountySheriff' s DeputyJim Thurmond -setting The largestpieceof whatmustbe learnedis the lexicon. If parameter views of syntaxacquisitionare correct, thenlearningthe syntax(which in this casedoesnot includethe low-level messybits discussedin the previousparagraph ) is actuallyalmosttrivial. The really hardjob is learningthe lexicon. Acquisition of the lexicon is a primary areaof applicationfor distributional esto acquisition. Methodshavebeendevelopedfor the andstatisticalapproach acquisitionof partsof speech[Brill , 1993; Schiitze, 1993], terminologicalnoun compounds[Bourlgault, 1992], collocations [Smadja, 1991], support verbs frames[Brent, 1991; Manning, 1993], , 1995], subcategorization [Grefenstette 1993 restrictions selectional ], and low-level phrasestructurerules [ Resnik, [Finch, 1993; Smith andWitten, 1993]. Thesedistributionaltechniquesdo not so muchcompetewith parametersettingasa modelof acquisition, asmuchas -settingaccountspassover complementit , by addressingissuesthat parameter also not are . Distributional in silence adequatealoneas modelsof techniques humanacquisition whateverthe outcomeof the syntacticvs. semanticbootstrappingdebate, children clearly do make use of situationsand meaningto learn language- but the effectivenessof distributionaltechniquesindicatesat leastthat they might accountfor a componentof humanlanguagelearning.
StatisticalMethodsandLinguj ~tic~ 4
Objections
There are a couple of generalobjectionsto statistical methodsthat may be ' lurking in the backsof readers minds that I would like to address.First is the sentimentthat, however relevant and effective statistical methods may be, ' they areno morethan an engineers approximation, not part of a properscientific theory. Secondis the nagging doubt: did not Chomskydebunk all this agesago? 4.1 Are StochasticModels Only for Engineers? Onemight admit that onecan accountfor parsingpreferencesby a probabilistic model, but insistthat a probabilisticmodelis at bestan approximation,suitable for engineeringbut not for science. On this view, we do not needto talk about degreesof grammaticality, or preferences , or degreesof plausibility. Granted, humansperceiveonly one of the many legal structuresfor a given sentence , but the perceptionis completelydeterministic. We needonly give a properaccountof all the factorsaffectingthejudgment. Considerthe example:
threeshotswerefiredat Humberto Yesterday Calvados assistant to the , personal famous tenorEnrique Felicidad , whowasin Parisattending to unspecified personal matters . ' Supposefor arguments sakethat 60% of readerstake the tenorto be in Paris, and40% take the assistantto be in Paris. Or more to the point, supposea particular infonnant, John Smith, choosesthe low attachment60% of the time whenencounteringsentences with preciselythis structure(in the absenceof an infonnativecontext), andlow attachment40% of thetime. Onecouldstill insist thatno probabilisticdecisionis beingmade, but ratherthat therearelexical and semanticdifferencesthat we have inappropriatelyconflatedacrosssentences with " preciselythis structure," andif we takeaccountof theseothereffects, we endup with a detenninisticmodelafterall. A probabilisticmodelis only a stopgap in absenceof an accountof the missing factors: semantics , pragmatics, what topics I have beentalking to other peopleaboutlately, how tired I am, whetherI atebreakfastthis morning. By this speciesof argument,stochasticmodelsarepracticallyalwaysa stopgap approximation. Take stochasticqueuetheory, for example, by which one can give a probabilistic model of how many trucks will be arriving at given depotsin a transportationsystem. Onecould arguethat if we couldjust model everythingaboutthestateof thetrucksandtheconditionsof theroads, the location of every nail that might causea flat , and every drunk driver that might
Chapter1 cause an accident, then we could in principle predict detenninistically how many trucks will be arriving at any depot at any time , and there is no need of stochastic queue theory . Stochastic queue theory is only an approximation in lieu of infonnation that it is impractical to collect . But this argument is flawed . If we have a complex detenninistic system, and if we have access to the initial conditions in complete detail , so that we can compute the state of the system unerringly at every point in time , a simpler stochastic description may still be more insightful . To use a dirty word , some properties of the system are genuinely emergent, and a stochastic account is not just an approximation , it provides more insight than identifying every deterministic factor . Or to use a different dirty word , it is a reductionist error to stochastic account and insist that only a more complex , a successful reject lower level , detenninistic model advances scientific understanding. 4.2 Chomsky v. Shannon In one' s inttoductory linguistics course, one learns that Chomsky disabused the field once and for all of the notion that there was anything of interest to statistical models of language. But one usually comes away a little fuzzy on the question of what , precisely , he proved . " ' The arguments of Chomsky s that I know are from Three Models for the " Description of Language [Chomsky , 1956] and Syntactic Structures [Chomsky , 1957] (essentially the same argument repeated in both places), and from the Handbook of Mathematical Psychology , chapter 13 [Miller and Chomsky , 1963] . I think the fIrSt argument in Syntactic Structures is the best known . It goes like this. It is fair to assumethat neithersentence( 1) [colorlessgreenideassleepfuriously] nor ) has (2) [furiously sleepideasgreencolorless], (nor indeedany part of thesesentences ever occurredin an English discourse. . . Yet ( 1), thoughnonsensical , is grammatical, while (2) is not. [Chomsky, 1957, p. 16] This argument only goes through if we assumethat if the frequency of a sentence " or " part is zero in attaining sample, its probability is zero. But in fact , there is quite a literature on how to estimate the probabilities of events that do not occur in the sample, and in particular how to distinguish real zeros from zeros that just reflect something that is missing by chance. Chomsky also gives a more general argument: of a given lengthin orderof statisticalapproximationto English If we rank the sequences scatteredthroughout , we will fmd both grammaticalandungrammaticalsequences the list; thereappearsto be no particularrelation betweenorder of approximationand . [Chomsky, 1957, p. 17] grammaticalness
StatisticalMethodsandLinguistics Because for any n , there are sentences with grammatical dependencies spanning more than n words , so that no nth order statistical approximation can sort out the grammatical from the ungrammatical examples. In a word , you cannot define grammaticality in terms of probability . " " It is clear from context that statistical approximation to English is a reference to nth - order Markov models, as discussed by Shannon. Chomsky is saying that there is no way to choose n and E such that for all sentencess, grammatical (s) +-+Pn(s) > E " " where Pn(s) is the probability of s according to the best nth- order approximation to English . But Shannon himself was careful to call attention to precisely this point : that for any n , there will be some depende;ncies affecting the well -formedness of a ' sentence that an nth - order model does not capture. The point of Shannon s approximations is that , as n increases, the total mass of ungrammatical sentences that are erroneously assignednon-zero probability decreases. That is , we in can fact defme grammaticality in terms of probability , as follows : Jim P n(s) > 0 grammatical (s) +-+ n~ A third variant of the argument appears in the Handbook . There Chomsky states that parameter estimation is impractical for an nth -order Markov model " " where n is large enough to give a reasonablefit to ordinary usage. He emphasizes that the problem is not just an inconvenience for statisticians , but renders " the model untenable as a model of human language acquisition : we cannot parameters in a childhood seriously propose that a child learns the values of " lasting only 108 seconds. This argument is also only partially valid . If it takes at least a second to estimate each parameter, and parameters are estimated sequentially , the argument is correct. But if parameters are estimated in parallel , say, by a high -dimensional iterative or gradient -pursuit method , all bets are off . Nonetheless, I think even the most hardcore statistical types are willing to admit that Markov models represent a brut~ force approach, and are not an adequatebasis forpsycho logical models of language processing. However , the inadequacy of Markov models is not that they are statistical , but ' that they are statistical versions of fmite -state automata. Each of Chomsky s arguments turns on the fact that Markov models are fmite state, not on the fact that they are stochastic. None of his criticisms are applicable to stochastic models generally. More sophisticated stochastic models do exist: stochastic context free grammars are well understood, and stochastic versions of Tree Adjoining
Chapter1
Grammar[Resnik, 1992], GB [ FordhamandCrocker, 1994], andHPSG[Brew, 1995] havebeenproposed. m fact, probabilitiesmakeMarkov modelsmoreadequatethantheir nonprobabilistic . Markov modelsare surprisinglyeffective , not lessadequate counterparts . For example, they are the workhorseof , given their finite statesubstrate . Stochastic grammarscan alsobe easierto learn speechrecognitiontechnology . For than their nonstochastic counterparts example, thoughGold [Gold, 1967] showedthattheclassof context-freegrammarsis not learnable , Horning [ Horning . , 1969] showedthattheclassof stochasticcontext-freegrammarsis learnable m short, Chomsky's argumentsdo not bearat all on the probabilisticnature of Markov models, only on thefact that they arefinite-state. His argumentsare not by any stretchof the imaginationa sweepingcondemnationof statistical methods. 5
Conclusion
In closing, let me repeatthe main line of argumentas conciselyas I can. Statistical methods- by which I meanprimarily weightedgrammarsand distributional induction methods- are clearly relevant to languageacquisition, languagechange, languagevariation, languagegeneration,andlanguagecomprehens . Understandinglanguagein this broadsenseis the ultimategoal of linguistics. The issuesto which weightedgrammarsapply, particularlyasconcernsperception of grammaticalityand ambiguity, one may be temptedto dismissas " " performanceissues. However, the set of issueslabeled performance are not " " essentiallycomputational,asoneis often led to believe. Rather, competence representsa provisionalnarrowingandsimplificationof datain orderto understand " is a the algebraicpropertiesof language." Performance misleadingterm " " for everything else. Algebraic methodsare inadequatefor understanding many importantpropertiesof humanlanguage,suchas the measureof goodness that permitsoneto identify the correctparseout of a largecandidatesetin the faceof considerablenoise. Many other propertiesof language, as well, that are mysteriousgiven unweighte grammars, propertiessuch as the gradualnessof rule learning, the gradualnessof languagechange, dialect continua, and statisticaluniversals, makea greatdeal more senseif we assumeweightedor stochasticgrammars. ? tec?mlquesmat computatIolla?1mguists Thereis a hugebody of mamemadca have begunto tap, yielding tremendousprogresson previously intransigent problems. The focusin computationallinguisticshasadmittedlybeenon tech-
Statistical MethodsandLinguistics
25
nology. But the sametechniquespromiseprogresson issuesconcerningthe natureof languagethat haveremainedmysteriousfor so long. The time is ripe to apply them. Acknowledgments
I thankTilmanHoehle , GrahamKatz, MarcLight, andWolfgangStemefeld for theircomment ~onanearlierdraftof thischapter . All errorsandoutrageous
opinions are, of course, my own . References
DidierBourigault . Surfacegrammatical analysisfor theextractionof tenninoiogical -92, Vol. 3, pp. 977- 981, 1992 nounphrases . In COLING . MichaelR. Brent. Automaticacquisitionof subcategorization framesfrom untagged , free-text corpora . In Proceedings of the29thAnnualMeetingof theAssociation for Computational , pp. 209-214, 1991. Linguistics ChrisBrew. Stochastic HPSG . In Proceedings of EACL-95, 1995. -BasedLearning Eric Brill. Transformation . PhiD. thesis , Universityof Pennsylvania , , 1993. Philadelphia NoamChomsky . Threemodelsfor thedescription of language . IRE Transactions on InformationTheory , IT-2(3): 113- 124, 1956.NewYork, Instituteof RadioEngineers . NoamChomsky . Syntactic Structures . TheHague , Mouton,1957.
NoamChomsky . Thelogicalbasisof linguistic . In Horace Lunt, editor theory , Proceedings of theNinthInternational Congress , pp. 914-978,TheHague of Linguists , Mouton . 1964 . Steven PaulFinch . FindingStructure in Language . PhDthesis of Edinburgh , University . , 1993 Andrew Fordham andMatthew Crocker . Parsing withprinciples andprobabilities . In TheBalancing Act: Combining andStatistical Symbolic estoLanguage . Approach , 1994 E. MarkGold.Language identification in thelimit. Information andControl , 10(5): 447-474, 1967 . -based Grefenstette . Corpus Gregory method for automatic identification of support -95, 1995 verbsfornominalizations . InEACL . JohnA. Hawkins . WordOrderUniversals . NewYork,Academic Press . , 1983 JohnA. Hawkins . A parsing of wordorderuniversals . Linguistic theory , 21(2): Inquiry 223-262, 1990 . James . A Studyof Grammatical JayHoming . PhiD . thesis Inference , Stanford (Computer Science . ), 1969 Terrence Kaufman . Thenative ofLatinAmerica : languages remarks . In 's general . 31- Christopher andRE. Asher Moseley , editors ,AtlasoftheWorld , Languages pp 33, London . , Routledge , 1994
Chapter1
in IrishGaelic.In EACL-95, 1995. . Computational BrettKessier dialectology of a largesubcategorization . Automatic . Manning dictionary acquisition ChristopherD , . In 31stAnnualMeetingof theAssociation fromcorpora Linguistics for Computational . 242 1993 . 235 , pp users . In R. D. . Finitarymodelsof language GeorgeA. Miller andNoamChomsky , chapter Luce, R. Bush,andE. Galanter , Handbook Psychology , editors of Mathematical 13. NewYork, Wiley, 1963. natural for statistical asaframework Tree-AdjoiningGrammar . Probabilistic PhilipResnik . In COUNG-92, pp. 418- 424, 1992. language processing . PhiD. thesis , . SelectionandInformation , Universityof Pennsylvania Philip Resnik . 1993 , Philadelphia . In Proceedings HinrichSchUtte. Part-of-speechinductionfrom scratch of the 31st . 251 258 the Association , 1993. , AnnualMeetingof Linguisticspp for Computational Technical . TheBellSystem . A mathematical ClaudeE. Shannon theoryof communication Journal,27(3- 4): 379- 423, 623- 656, 1948. . In Uri Zemik, editor, thelexiconfor language . Microcoding FrankSmadja generation . Cambridge to Builda Lexicon : UsingOn-LineResources , Mass.: LexicalAcquisition TheMIT Press , 1989. : Language Generation . ExtractingCollocations FrankSmadja from Text.AnApplication . PhiD. thesis , NewYork, 1991. , ColumbiaUniversity inferencefrom functionwords. Manuscript TonyC. SmithandIan H. Witten. Language , January1993. , Universityof CalgaryandUniversityof Waikato Model. PhDthesis : A Connectionist , Stanford WhitneyTabor. SyntacticInnovation . 1994 , University . In Christopher NorthAmerica ,Atlasof the MoseleyandRE . Asher,editors MaryTait. . World's Languages , 1994 , Routledge , pp. 3- 30, London . 2ndedition. Paris: Klincksieck , . Elementsde SyntaxeStructurale LucieneTesniere 1959. Mifflin, 1966. J. R. R. Tolkien.TheHobbit. Boston , Houghton , 1960. WillardvanOrmanQuine.WordandObject.Cambridge , Mass., TheMIT Press asan.errorcorrectingcode.TechnicalReport33:XV, Quarterly VictorYngve. Language Instituteof of Electronics , TheMassachusetts Laboratory Reportof theResearch , April 15, 1954,pp. 73- 74. Technology
Chapter 2 Qualitative and Quantitative Models of SpeechTranslation
Hiyan Alshawi
Alshawi achieves two goals in this chapter . First , he challenges the notion that the identification of a statistical -symbolic distinction in language processing is an instance of the empirical vs. rational debate. Second, Alshawi proposes models for speech translation that retain aspects of qualitative design while moving toward incorporating quantitative aspectsfor structural dependency, lexical transfer , and linear order . On the topic of the place of the statistical -symbolic distinction in natural lan guage processing , Alshawi points to the fact that rule -based approach es are becoming increasingly probabilistic . However , at the same time , since language is symbolic by nature , the notion of building a "purely " statistical model may not be meaningful . Alshawi suggests that the basis for the contrast is in fact a distinction between qualitative systemsdealing exclusively with combinatoric constraints , and quantitative systems dealing with the computation of numerical functions . Of course, the problem still remains of how and where to introduce quantitative modeling into language processing . Alshawi proposes a model to do just this , specifically for the language translation task. The design reflects the conventional qualitative transfer ' approach , that is , starting with a logic -based grammar and lexicon to produce a set of logical forms that are then filtered , passed to a translation component , and then given to a generation component mappinglogicalforms to surface syntax , which is then fed to the speech synthesizer. Alshawi then methodically analyzes which of these steps would be improved by the introduction of quantitative modeling . His step-by-step analysis , considering specific ways to improve the basic qualitative model , illustrates the variety of possibilities still to be explored in achieving the optimal balance among types ofmodels .- Eds .
Chapter2 1
Introduction
In recent years there has been a resurgence of interest in statistical approaches to natural language processing . Such approaches are not new , witness the statistical approach to machine translation suggested by Weaver [ 1955] , but the current level of interest is largely due to the success of applying hidden Markov models and N - gram language models in speech recognition . This success was directly measurable in terms of word recognition error rates, prompting language processing researchers to seek corresponding improvements in performance and robustness. A speech translation system, which by necessity combines speech and language technology , is a natural place to consider combining the statistical and conventional approaches, and much of this chapter describes probabilistic models of structural language analysis and translation . My aim is to provide an overall model for translation with the best of both worlds . Various factors lead us to conclude that a lexicalist statistical model with dependency relations is well suited to this goal . As well as this quantitative approach, we consider a constraint , logic -based approach and try to distinguish characteristics that we wish to preserve from those that are best replaced by statistical models. Although perhaps implicit in many conventional approaches to translation , a characterization in logical terms of what is being done is rarely given , so we attempt to make that explicit here, more or less from fIrst principles . Before proceeding, I first examine some fashionable distinctions in section 2 in order to clarify the issues involved in comparing these approaches. I argue that the important distinction is not so much a rational - empirical or symbolic statistical distinction but rather a qualitative - quantitative one. This is followed in section 3 by discussion of the logic -based model , in section 4 by the overall quantitative model , in section 5 by monolingual models, in section 6 by translation models, and, in section 7 , some conclusions. I concentrate throughout on what information about language and translation is coded and how it is expressed as logical constraints in one model or statistical parameters in the other. At Bell Laboratories , we have built a speechtranslation system with the same underlying motivation as the quantitative model presentedhere. Although the quantitative model used in that system is different from the one presented here, they can both be viewed as statistical models of dependency grammar . In building the system, we had to address a number of issuesthat are beyond the scope of this chapter, including parameter estimation and the development of efficient search algorithms .
QualitativeandQuantitativeModelsof SpeechTranslation 2
Qualitative and Quantitative Models
One conttastoften takenfor grantedis the identification of a statistical-symbolic distinction in languageprocessingas an instanceof the empirical vs. rationaldebate.I believethis contrasthasbeenexaggerated , thoughhistorically it hashad somevalidity in termsof acceptedpractice. Rule-basedapproach es have becomemore empirical in a number of ways: First, a more empirical approachis being adoptedto grammardevelopmentwherebythe rule set is modified accordingto its performanceagainstcorpora of natural text (e.g. [ Tayloret al., 1989]). Second, there is a classof techniquesfor learningrules from text, a recentexamplebeing [Brill , 1993]. Conversely, it is possibleto imagine building a languagemodel in which all probabilities are estimated accordingto intuition without referenceto any real data, giving a probabilistic modelthat is not empirical. Most languageprocessinglabeled as statisticalinvolves associatingrealnumber- valuedparametersto configurationsof symbols. This is not surprising giventhatnaturallanguage,at leastin written form, is explicitly symbolic. Presumably, classifyinga systemassymbolicmustrefer to a different setof (internal ) symbols, but eventhis doesnot rule out manystatisticalsystemsmodeling eventsinvolving nonterminalcategoriesandword senses . Giventhatthenotion of a symbol, let alonean " internal symbol," is itself a slipperyone, it may be unwiseto build our theoriesof language,or eventhe way we classifydifferent theories, on this notion. Instead, it would seemthatthe real conttastdriving the shift towardstatistics in languageprocessingis a conttastbetweenqualitativesystemsdealingexclusively with combinatoricconstraints, and quantitative systemsthat involve computingnumericalfunctions. This bearsdirectly on the problemsof brittleness and complexity that discreteapproach es to languageprocessingshare with, for example, reasoningsystemsbasedon traditional logical inference. It relatesto the inadequacyof the dominant theoriesin linguistics to capture " shades " of meaningor degreesof acceptabilitywhich areoften recognizedby outside the field as importantinherentpropertiesof natural language . people The qualitative-quantitativedistinctioncan also be seenasunderlyingthe difference betweenclassificationsystemsbasedon featurespecifications , asused in unification formalisms[Shieber, 1986], and clusteringbasedon a variable degreeof granularity(e.g. [Pereiraet al., 1993]). It seemsunlikely that thesecontinuouslyvariableaspectsof fluent natural languagecanbe capturedby a purely combinatoricmodel. This naturallyleads
2 Chapter to the questionof how bestto introducequantitativemodelinginto language . It is not, of course, necessaryfor the quantitiesof a quantitative processing modelto beprobabilities. For example,we may wish to definereal-valuedfunctions on parsetreesthat reflect the extentto which the treesconform to, say, minimal attachmentand parallelismbetweenconjuncts. Such functionshave beenusedin tandemwith statisticalfunctionsin experimentson disambiguation (e.g. [Alshawi andCarter, 1994]). Anotherexampleis connectionstrengthsin neuralnetwork approach es to languageprocessing , thoughit hasbeenshown thatcertainnetworksareeffectivelycomputingprobabilities[RichardandLippmann, 1991]. Nevertheless , probability theory doesoffer a coherentand relatively wellunderstoodframeworkfor selectingbetweenuncertainalternatives,making it a naturalchoicefor quantitativelanguageprocessing . The casefor probability is a well theory strengthened by developedempiricalmethodologyin the form of statisticalparameterestimation. Thereis alsothe strongconnectionbetween probability theoryandthe formal theoryof informationandcommunication,a connectionthat hasbeenexploited in speechrecognition, for example, using the conceptof entropyto provide a motivatedway of measuringthe complexity of a recognitionproblem[Jelineket al., 1992] . Evenif probability theoryremains, asit currentlyis, themethodof choicein makinglanguageprocessingquantitative, this still leavesthefield wide openin termsof carving up languageprocessinginto an appropriateset of eventsfor probability theory to work with. For translation, a very direct approachusing parametersbasedon surfacepositionsof words in sourceandtargetsentences was adoptedin the Candidesystem[Brown et al., 1990]. However, this does not captureimportantstructuralpropertiesof naturallanguage.Nor doesit take into accountgeneralizationsabouttranslationthat areindependentof the exact word orderin sourceandtargetsentences . Suchgeneralizationsare, of course, central to qualitative structural approach es to translation(e.g. [Isabelle and Macklovitch, 1986; Alshawi et al., 1992]) . The aim of thequantitativelanguageandtranslationmodelspresentedin sections 5 and6 is to employprobabilisticparameters thatreflectlinguisticstructure withoutdiscardingrich lexical informationor makingthemodelstoo complexto train automatically.In termsof a traditionalclassification,this would be seenas a " hybrid symbolic-statistical" systembecauseit dealswith linguistic structure. From our perspective , it canbe seenasa quantitativeversionof the logic-based model becauseboth modelsattemptto capturesimilar information(aboutthe organizationof wordsinto phrasesandrelationsholdingbetweenthesephrases or their referents ), thoughthetoolsof modelingaresubstantiallydifferent.
andQuantitative Modelsof Speech Translation Qualitative 3 Dissectinga Logic-BasedSystem We now consider a hypothetical speech ttanslation system in which the language processing components follow a conventional qualitative ttansfer design . Although hypothetical , this design and its components are similar to those used in existing database query [Rayner and Alshawi , 1992] and ttanslation systems [ Alshawi et al., 1992] . More recent versions of these systems have been gradually taking on a more quantitative flavor , particularly with respect to choosing between alternative analyses, but our hypothetical system will be more purist in its qualitative approach. The overall design is as follows . We assume that a speech recognition subsystem delivers a list of text sttings corresponding to ttanscriptions of an input utterance. These recognition hypoth ~ses are passedto a parser which applies a logic - basedgrammar and lexicon to produce a set of logical forms , specifically formulas in first order logic corresponding to possible interpretations of the utterance. The logical forms are filtered by contextual and word - sense constraints , and one of them is passedto the ttanslation component . The ttanslation relation is expressedby a set of first order axioms which are used by a theorem prover to derive a target language logical form that is equivalent ( in some context ) to the source logical form . A grammar for the target language is then applied to the target form , generating a syntax ttee whose fringe is passed to a speech synthesizer. Taking the various components in turn , we make a note of undesirable properties that might be improved by quantitative modeling . 3.1 Analysis and Generation A grammar , expressed as a set of syntactic rules ( axioms ) Gsynand a set of semantic rules (axioms ) Gsemis used to support a relation form holding between sbings s and logical fonD S cp expressed in fIrSt order logic : GsynU GsemFform (s, cp) The relation form is many -to -many , associating a sbing with linguistically possible logical fonD interpretations . In the analysis direction , we are given s and search for logical fonD Scp, while in generation we search for sbings s given cp. For analysis and generation, we are treating sbings s and logical fonD S cp as object level entities. In interpretation and translation , we will move down from this meta- level reasoning to reasoning with the logical fonD S as propositions . The list of text strings handed by the recognizer to the parser can be assumed to be ordered in accordance with some acoustic scoring scheme internal to the
Chapter2
recognizer.The magnitudeof the scoresis ignoredby our qualitativelanguage esthehypothesesoneat a time until it fmds onefor ; it simply process processor which it canproducea completelogical form interpretationthat passesgrammatical andinterpretationconstraints,at which point it discardsthe remaining . Clearly, discardingthe acousticscoreandtaking the fIrSthypothesis hypotheses that satisfiesthe constraintsmay leadto an interpretationthat is lessplausible than one derivablefrom a hypothesisfurther down in the recognitionlist. But there is no point in processingtheselater hypothesessince we will be forcedto selectoneinterpretationessentiallyat random. " " Syntax The syntacticrulesin Gsynrelate category predicatesco' CI, C2holding of a string and two spanningsubstrings(we limit the rules here to two daughtersfor simplicity) : Co(So) A daughters(so, sl ' S2) CI(SI) AC2(S2) A (So= concat(SI ' S2 , variableslike Soand SI are implicitly universally (Here, and subsequently quantified.) Gsynalsoincludeslexical axiomsfor particularstringsw consisting of singlewords: CI(W), ... cm(w) For a feature-basedgrammar, theserules can include conjunctsconstraining the values, aI , a2," ' , of discrete-valuedfunctionsfon the strings: f (w) = aI , f (so) = f (sl) The main problemhereis that suchgrammarshaveno notion of a degreeof grammaticalacceptability- a sentenceis either grammaticalor ungrammatical . For small grammarsthis meansthat perfectly acceptablestringsare often rejected; for large grammarswe get a vast numberof alternativetreesso the chanceof selectingthe correcttree for simple sentencescan get worseas the . There is also the problem of requiring increasgrammarcoverageincreases ingly complexfeaturesetsto describeidiosyncrasiesin the lexicon.
Semantics Semanticgrammaraxiomsbelongingto Gsemspecify a " composition " function for g deriving a logical form for a phrasefrom thosefor its subphrases :
form(so, g(cf>l ' cf>2 ) daughters(so, Sl, SVACI(SI) AC2(SV A Co(SO Aform (sl' 1) Aform (S2' v
QualitativeandQuantitativeMntip;1~ of SpeechTranslation
The interpretationrulesfor stringsbottomout in a setof lexical semanticrules " ." associatingwords with predicates(PI,p2,...) correspondingto word senses For a particularword and syntacticcategory, there will be a (small, possibly : empty) fmite setof suchword-sensepredicates c;(w) - form ( w, p; ) .. .
c;(w) - +form(w, pin) First order logic was assumedas the semanticrepresentationlanguage becauseit comes with well-understood , if not very practical, inferential for constraint . However , applying this machineryrequires machinery solving makinglogical forms fme-grainedto a degreeoften not warrantedby the information the speakerof an utterance.intendedto convey. An exampleof this is explicit scopingwhich leads(again) to largenumbersof alternativeswhich the qualitative model has difficulty choosingbetween. Also, many natural language cannotbe expressedin first orderlogic without resortto elaborate sentences formulasrequiringcomplexsemanticcompositionrules. Theserulescan be simplified by usinga higherorderlogic but at theexpenseof evenlesspractical inferentialmachinery. In applyingthe grammarin generationwe arefacedwith the problemof balancing over- and undergeneration by tweakinggrammaticalconstraints,there no to being way prefer fully grammaticaltargetsentencesover more marginal ones. Qualitativeapproach es to grammartendto emphasizethe ability to capture generalizationsasthemainmeasureof successin linguistic modeling. This might explain why producing appropriatelexical collocations is rarely addressed seriouslyin thesemodels, eventhoughlexical collocationsareimportant for fluent generation.The studyof collocationsfor generationfits in more , as illustratedby Smajdaand McKeown naturally with statisticaltechniques [ 1990] . 3.2 Interpretation In the logic-basedmodel, interpretationis the processof identifying from the possibleinterpretationsof s for which! orm(s, 2coefficientintroduced by [GaleandChurch,1991]:
or the Loglike coefficientintroducedby [Dunning, 1993] : Loglike = a log a + b log b + clog c + d log d - (a + b) log (a + .b) - (a + c) log (a + c) - (b + d) log (b + d) - (c + d) log (c + d) + (a + b + c + d) log (a + b + c + d)
(4)
A propertyof thesescoresis that their valuesincreasewith the strengthof the bond of the lemmas. We havetried out severalscores(more than 10) including 1M, 2, and Loglike, and we have sorted the pairs following the score value. Eachscoreproposesa conceptualsort of the pairs. This sort, however, could put at the top of the list compoundsthat belong to generallanguage ratherthan to the telecommunications domain. Sincewe want to obtain a list
CombinedTechniquesfor Extractionof Terminology
of telecommunicationterms, it is essentialto evaluatethe correlationbetween the score values and the pairs and to fmd out which scoresare the best to extract terminology. Therefore, we comparedthe values obtained for each scoreto a referencelist of the domain. We obtaineda list of over 6,000 French terms from EURODICAUTOM, the terminology data bank of the EEC, telecommunicationssection, which was developedby experts. We purchased the evaluationon 2,200 Frenchpairs3of N 1de (DET ) N2structure, the mostfrequent and common French term structure, extractedfrom our corpus SCH (200,000 words). To limit the size of the referencelist , we retainedthe intersection betweenour list of candidatesand the EURODICAUTOM list , 1,200 thus gettingrid of termswhich we would not fmd in our corpusanyway, pairs, evenif they belongto this technicaldomain. We assumethat the referencelist , 1,200 pairs of our list of 2,200 candidatepairs, is as completeas possible, so that base-terms that we might identify in our corpusare indeedfound in the referencelist. Eachscoreyields a list wherethe candidatesare sortedaccording to the decreasingscore value. We have divided this list in equivalence classeswhich generallycontain50 successivepairs. The resultsof a scoreare representedgraphically by a histogramin which the x-axis representsdifferent classes, and the y-axis the ratio of good pairs. If all the pairs in a class belongto the referencelist , we obtain the maximumratio of 1; if noneof the pairs appearsin the referencelist , the minimum ratio of 0 is reached.The ideal scoreshouldassignhigh (low) valuesto good (bad) pairs, that is, candidates which belong (which do not belong) to the referencelist (in other words, the histogramof the ideal scoreshould assignto equivalenceclassescontaining the high (low) valuesof the scorea ratio closeto 1 [0]) . We are not going to presenthereall the histogramsobtained(see[Daille, 1994]). All of themshow a generaltrendthat confirms that the scorevaluesincreasewith the strengthof the bond of the lemma. However, the growth is more or lessclear, with more or lesssharpvariations. The most beautiful histogramis the simple frequency of the pair (seefigure 3.1). This histogramshowsthat the more frequentthe pair is, the more likely the pair is a term. Frequencyis the most significant scorefor detectingtermsof a technicaldomain. This resultcontradictsnumerous resultsof lexical resources , which claim that associationcriteria are more : for than example, all the mostfrequentpairs whoseterfrequency significant minological statusis undoubtedsharelow valuesof associationratio [equation ( 1)], as, for example, reseaua satellites (satellite network) 1M = 2.57, 3. Only pairs which appearat leasttwice in the corpushavebeenretained.
Chapter3
Figure 3.1 Frequencyhistogram.
liaisonpar satellite (satellite link) 1M = 2.72, circuit telephonique(telephone circuit ) 1M = 3.32, station spatiale (space station) 1M = 1.17, etc. The remainingproblemwith the sort proposedby frequencyis that it very quickly , that is, pairs which are not terms. So, we havepreferred integratesbad candidates to elect the Loglike coefficient [equation(3)] the best score. Indeed, Loglike coefficient, which is a real statisticaltest, takesinto accountthe pair frequencybut acceptsvery little noisefor high values. To give an elementof comparison, the fIrSt bad candidatewith frequencyfor the generalpatternN 1 (PREP (OET N2 is the pair (cas , transmission ) which appearsin 56th this which is also the fIrst bad candidate with , ; place pair Loglike, appearsin 176thplace. We give in table 3.2 the topmost 11 Frenchpairs sortedby the Loglike coefficient (Logl) (Nbc is the numberof the pair occurrencesand1M the value of associationratio). 3.2 Diversity the marginaldistribution Diversity, introducedby [Shannon, 1949], characterizes of the lemmaof a pair throughthe rangeof pairs. Its computationusesa
CombinedTechniquesfor Extractionof Terminology
Table3.2 Topmostpairs Pairsof Nl (PREP (DET N2 sbucture
The mostfrequent pair sequence
(largeur, bande)
largeurde bande( 197) (bandwidth) temperaturede bruit 110 (noisetemperature ) bandede base( 142) (baseband ) amplificateur(s) de puissance( 137) (poweramplifier) tempsdepropagation (93) (propagationdelay) reglementdes radiocommunications (60) (radio regulation) produit{s) d' intermodulation(61) (intermoduationproduct) taux d' erreur (70) (error ratio) miseen oeuvre(47) (implementation ) telecommunications ) par satellite(88) (satellite communications) bilan de liaison (37) (link budget)
, bruit) (temperature (bande, base) (amplificateur, puissance )
(temps, propagation ) (reglement. radiocommunication)
{produit, intermodulation}
(taux, en-eur) (mise, oeuvre)
, satellite) (telecommunication
(bilan, liaison)
Logl Nbc 1M 1328 223
5.74
777
126 6.18
745
145 5.52
728
137 5.66
612
94 6.69
521
60 8.14
458
61 7.45
420
70
6.35
355
47
7.49
353
99
4.09
344
55 6.42
contingencytable of length n: we give below as an examplethe contingency tablethat is associatedwith the pairsof the N ADJsttucture: N; Adjj
progressif
onde cornet . .. Total
porteur
Total
.. .
19
4
nb
( onde , . )
9
0
nb
( cornet , . )
...
.. .
.. . nb
...
( . , progress
i />
nb
( . , porteur
)
nb(.,.)
Chapter3
Thelinecountsnb;., whicharefoundin therightmostcolumn,represent thedistributio of theadjectives widtregardto a givennoun.ThecolumncountsnbJt whicharefoundon thelastline, represent thedistributionof thenounswith to a . These distributions arecalled" marginaldistributions regard givenadjective " of thenounsandthe for theN ADJstructure . Diversityis computed adjectives for eachlemmaappearing in a pair, usingtheformula: $ ~ nb.. .. HI. = nbI.. lognI.. - L. lJlognbIJ . J- 1 (4) $ ~ .. lognb.. HJ. = nbJ.IognJ. L. nbIJ lJ ;=1 Forexample tableof theN ADJstructure above , usingdIecontingency , diversity of the noun onde is equal to: = H (onde . .) nb(onde ..) log nb(onde ..) (nb(onde . progress . progress if) log nb(onde if) + nb(onde + . l KJrteur . porteur ) log nb(onde ) ...) We note HI , diversity of the first lemma of a pair , and H2' diversity of the second lemma. We take into account the diversity normalized by the number of occurrences of the pairs :
Hi hi= "iiij
Hj hj='iiij The normalizeddiversitieshI andh2aredefinedfrom HI andH2. The normalizeddiversity providesinterestinginformationaboutthe distribution of the pair lemmasin the set of pairs. A lemma with a high diversity meansthat it appearsin severalpairs in equalproportion; conversely,a lemma that appearsonly in one pair owns a zero diversity (minimal value) and this whateverthe frequencyof the pair. High valuesof hI appliedto the pairs of N ADI structurecharacterizesnounsthat could be seenas keywordsof the domain : reseau(network), signal, antenne(antenna), satellite. Conversely, high valuesof h2appliedto the pairsof N ADIstructurecharacterizes adjectivesthat do not takepart in base-termssuchasnecessaire(necessary ), suivant(following ), important, different (various), tel (such), etc. The pairswith a zerodiversity on one of their lemmasreceivehigh valuesof associationratio and other associationcriteria and a nondefinitevalue of Loglike coefficient However, thediversity is moreprecisebecauseit indicatesif the two lemmasappearonly togetherasfor (ocean , indien ) (indian ocean) (HI = hi = H2 = h2 = 0),
Combined ITechniquesfor Extractionof Tenninolog~ or if not, which of the two lemmas appear only with the other , as for ( r6s eau , mail16 ) ( mesh network ) ( H2 = h2 = 0), where the adjective maille appears = = only with reseau or for ( codeur , id6al ) (ideal coder ) ( H. h . 0 ) where the noun codeur appears only with the adjective ideal. Oilier examples are: ( ile , salomon ) (solomon island) , ( h6lium , gazeux ) (helium gas), , 6cho ) (echo suppressor) . These pairs collect many frozen (suppresseur compounds and collocations of the current language. In future work, we will investigate how to incorporate the good results provided by diversity into an automatic extraction algorithm .
3.3 DistanceMeasures Frenchbase-termsoften acceptmodificationsof their internalstructure, ashas beendemonstratedpreviously. Eachtime an occurrenceof a pair is extracted and counted, two distancesare computed: the numberof items, Dist, and the numberof main items, MDist, which occurbetweenthe two lemma.~. Then, for eachpair, the meanandthe varianceof thenumberof itemsandmainitemsare computed.The varianceformula is:
V(X) =*L (Xi- xf U (X ) = JV ( X) The distance measuresbring interesting information concerningthe morphosyntacticvariationsof the base-terms, but theydo not allow makinga decision on thestatusof term or non-tenDof a candidate.A pair thathasno distance variation, whateverthe distance, is or is not a term; we give now someexamples of pairswhich haveno distancevariationsandwhich arenot terms: paire designal (a pair of signal), typed 'antenne(a typeof antenna), organigramme de la figure (diagramof thefigure), etc. We illustratebelow how the distance measuresallow attributing to a pair its elementarytype automatically, for DETN2, or N I AD] PREP ) N2for (OET N2, N I PREP example,eitherN I N2, N I PREP . the generalN I (PREP OET structure ( N2 1. Pairswith no distancevariation V(X ) = 0 (a) N I N2: Dist = 2 MDist = 2 . liaison semaphore , liaisonssemaphores(commonsignalinglink(s . canal support, canauxsupport, canauxsupports(bearerchannel) (b) N I PREP N2: Dist = 3 MDist = 2 . accuses) de reception(acknowledgement of receipt) . refroidissementa air , refroidissementpar air (cooling by air )
Chapter3
DETN2: Dist = 4 MDist= 2 (c) N} PREP
. sensibiliteau bruit (susceptibilityto noise) . reconnaissance dessignaux(signal recognition) (d) Nt AD] PREP N2: Dist = 4 MDist = 3 . reseaulocal de lignes, reseauxlocauxde lignes(local line networks ) . servicefixe par satellite Yj ' the two pairs arediscordant. In general, if the distributionsof the two randomvariablesX and Y acrossthe various modified nounsare similar, we , and consequentlya small numberof expecta large numberof concordances sincethetotal numberof pairsof observationsis fixed for thetwo discordances indicate variables. This is justified by the fact that a largenumberof concordances " that whenoneof the variablestakesa " large valuethe other also takesa " " " " large value on the correspondingobservation; of course, large is a term that is interpretedin a relativemannerfor eachvariable.
Chapter4
Kendall' s Tis defmedasPc- Pd, wherePc andPd are the probabilitiesof observinga concordanceor discordancerespectively. It rangesfrom - 1 to + 1, with + 1 indicating completeconcordance , - 1 completediscordance , and 0 = i.e. no correlation between X and Y. T We use an unbiased estimator , ( Pc Pd) for T , which incorporatesa correctionfor ties, that is, pairs of observations whereXi = Xj or Yi = Yj [Kendall, 1975, p. 75] . The estimatoris also made asymmetricby ignoring observationsfor which the nounfrequencyis zerofor bothX and Y. For moredetailson the similarity measureandits advantages for our task, see[Hatzivassiloglou, 1995a]. Using the computedsimilarity scoresand, optionally, the establishedrelationship of non-relatedness , a nonhierarchical clustering method [Spath, 1985] assignsthe adjectivesto groupsin a way that maximizesthe withingroup similarity (and therefo~ also maximizesthe between-group dissimilarity .1 ). The systemis given the numberof groupsto form asan input parameter The clusteringalgorithm operatesin an iterative manner, startingfrom a random partitionof theadjectives.An objectivefunction is usedto scorethecurrent clustering. Eachadjectiveis consideredin turn and all possiblemovesof . The move that leadsto the that adjectiveto anothercluster are considered largestimprovementin the value of is executed, and the cycle continues throughthe setof wordsuntil no moreimprovementsto thevalueof arepossible . Note that in this way a word may be movedseveraltimesbeforeits fmal groupis detennined. This is a hill -climbing methodand thereforeis guaranteedto convergein fmite time, but it may leadto a local minimum of , inferior to the global minimum that correspondsto the optimal solution. To alleviatethis problem, the partitioningalgorithmis calledrepeatedly2with different randomstartingpartitions andthe bestsolutionproducedfrom theserunsis kept. Figure4.1 showsan exampleclusteringproducedby our systemfor one of the adjectivesetsanalyzedin this chapter. 2. 3 Evaluation In many natural language processing applications , evaluation is performed either by using internal evaluation measures ( such as perplexity [ Brown et al., 1. Detennining this numberfrom the data is probably the hardestproblem in cluster , 1990] . However, a reasonably good analysisin general; see[KaufmanandRousseeuw valuefor this parametercanbe selectedfor our problemusingheuristicmethods. 2. In thecurrentimplementation , 50 timesfor eachvalueof thenumberof clustersparameter .
Do We NeedLinguisti C5WhenWe HaveStatistics ~ 13. communistleftist 14. astonishingmeagervigorous 15. catasttophicdisastrousharmful 16. dry exotic wet 17. chaoticturbulent 6. generousoub' ageousunreasonable 7. endlessprob' acted
18. confusingmisleading 19. dismalgloomy
8. plain 9. hostile unfriendly
20. dual multiple pleasant 21. fat slim
10. delicatefragile unstable 11. affluent impoverishedprosperous
22. affordableinexpensive 23. abruptgradualstunning
12. brilliant cleverenergeticsmartstupi~
24. flexible lenientrigid strict stringent
Figure 4.1
foundby thesystemusingall linguisticmodules . Exampleclustering 1992]) or by havinghumanjudgesscorethe system's output. However, thefIrSt approachproducesresultsthat dependon the adoptedmodel, while the second approachfrequently introducesbias and inflation of the scores, especially whenthe " correct" answeris not well defmed(asis the casewith mostnatural languageprocessingproblems). To addressthesedeficienciesof traditional evaluationapproach es, we employ model solutionsconstructedby humans ' independentlyof the systems proposedsolution. The humansreceivethe list of adjectivesthat are to be clustered, a descriptionof the domain, and general instructionsabout the task. To avoid introducing bias in the evaluation, the instructionsdo not include low-level detailssuchasthe numberof clustersor specifictestsfor decidingwhetherany two wordsshouldbe in the samegroup. In orderto comparetwo partitionsof the samesetof words, we converteach " " partition to a list of yes/no decisionsasfollows: We view eachpossiblepair of the words in the set as a decisionpoint, with a " yes" answerin the current " " partition if the two adjectivesareplacedin the samegroup anda no answer otherwise.Then, we canapplythestandardinformationretrievalmeasuresprecision and recall [ Frakesand Baeza-Yates, 1992] to measurehow close any two suchlists of decisionsare, or moreprecisely, how similar onesuchlist is to anotherwhich is consideredasthereferencemodel. Consideringtheanswersin the referencelist as the correctones, precision is defmedas the percentageof correct" yes" answersreportedin the list that is evaluatedover the total number of " yes" answersin that list. Similarly, recall is defmedasthe percentage
Chapter4
of correct" yes" answersin the testedlist over the total numberof (by definition " " , correct) yes answersin the referencelist. The two evaluationmeasuresdefinedaboveratecomplementaryaspectsof the correctnessof the evaluatedpartition. In order to perform comparisons betweendifferent variantsof the grouping system, correspondingto the use of different combinationsof linguistic modules, we needto convert this pair of scoresto a single number. For this purposewe usethe F-measurescore [ Van Rijsbergen, 1979], which producesa number betweenprecision and recall that is larger when the two measuresareclosetogether, and thusfavors partitionsthat arebalancedin the two typesof errors(falsepositivesandfalse negatives). Placing equal weight on precision and recall, the F -measureis definedas
Up to this point, we haveconsideredcomparisonsof onepartition of words againstanothersuchpartition. However, given the considerabledisagreement that existsbetweengroupingsof the sameset of wordsproducedby different humans, we decidedto incorporatemultiple m~ ls in the evaluation. Previously " " , multiple modelshavebeenusedindirectly to constructa single best or most representativemodel, which is then usedin die evaluation[Gale et al., and Litman, 1993]. Although this approachreducesthe 1992a; Passonneau problemscausedby relying on a singlemodel, it doesnot allow thedifferences betweenthe modelsto be reflectedin the evaluationscores.Consequently , we that uses models an evaluation method , simultaneously multiple developed in the of between tlte models reflects the produced directly degree disagreement scores,andautomaticallyweighsthe importanceof eachdecisionpoint according to tlte homogeneityof the answersfor it in the multiple models. We have extendedthe informationretrievalmeasuresof precision, recall, fallout, andFmeasurefor this purpose;themathematicalformulationof thegeneralizedmeasures is givenin [HatzivassiloglouandMcKeown, 1993] and[Hatzivassiloglou , 1995a]. In the experimentsreportedin this chapter, we employeight or nine humanconstructedmodelsfor eachadjectiveset. We baseour comparisonson and report the generalizedF~measurescores.In addition, sincethe correctnumber of groupingsis somethingthat the systemcannotyet determine(and, incidentally , somethingthat humanevaluatorsdisagreeabout), we run the systemfor the five casesin the range- 2 to + 2 aroundthe averagenumber of clusters employedby the humansand averagethe results. This smoothingoperation
Do We NeedLinguisticsWhenWe HaveStatistics?
preventsan accidentalhigh or low scorebeingreportedwhena small variation in the numberof clustersproducesvery different scores. It shouldbe notedherethat the scoresreportedshouldnot be interpretedas linearpercentages . The problemof interpretingthe scoresis exacerbated in our contextbecauseof the structuralconstraintsimposedby the clusteringand the presenceof multiple models. Even the bestclusteringthat could be produced would not receivea scoreof 100, becauseof the disagreementamonghumans on what is the correctanswer; applying the sameevaluationmethodto score eachmodelconstructedby humansfor thethreeadjectivesetsusedin this comparative studyagainstthe otherhuman-constructedmodelsleadsto an average scoreof 60.44 for the humanevaluators.To clarify the meaningof the scores, we accompanythem with lower and upper boundsfor eachadjectiveset we examine. Theseboundsare obtained.by the performanceof a systemthat creates randomgroupings(averagedover manyruns) andby the averagescoreof the human-producedpartitions when evaluatedagainstthe other human-produced modelsrespectively. 3
Motivation
3.1 Applications The outputof the word groupingsystemthat we describedin the previoussection is usedas the basisfor the further processingof the retrievedgroups: the classificationof groups into scalar and nonscalarones, the identification of synonymsandantonymswithin eachsemanticgroup, the labelingof wordsas positive or negativewithin a scale, and the orderingof scalartermsaccording to semanticstrength. In this way, thegroupingsystemis usedasthefirst part of a larger systemfor corpus-basedcomputationallexicography, which in turn producesinformationusefulfor a variety of naturallanguageprocessingapplications . We briefly list below someof theseapplications: . The organizationof words into semanticgroupscan be exploitedin statistical languagemodeling, by pooling togetherthe estimatesfor the variouswords in eachgroup [Sadier, 1989; Hindle, 1990; Brown et al., 1992]. This approach of the data, especiallyfor low-frequency significantly reducesthe sparseness words. . A study of medical casehistories and reports has shown that frequently physiciansusemultiple modifiersfor the sametermthat areincompatible(e.g., they are synonyms,contradicteachother, or onerepresentsa specializedcase of the other) [Moore, 1993]. Given the technicalcharacterof thesewords, it is
Chapter4
, evento identify quite hard for non-specialiststo edit theseincorrectexpressions suchproblematiccases.But the outputof the word groupingsystem, which identifiessemanticallyrelatedwords, canbe usedto flag occurrencesof incompatible modifiers. . Knowledgeof the synonymsand antonymsof particularwordscan be used during both understandingand generationof text. Such knowledgecan help with the handling of unknown words during understandingand increasethe paraphrasingpowerof a generationsystem. . Knowledgeof semanticpolarity (positive or negativestatuswith respectto a norm) can be combined with corpus-based collocation extraction tools [Smadja, 1993] to automaticallyproduceentriesfor the lexicalfunctions used in Meaning-Text Theoryfor text generation[Mel ' cuk andPertsov, 1987]. For example, if the collocationextractiontool identifiesthe phraseheartyeater as a recurrentone, thenknowing that heartyis a positiveterm enablesthe assignment of hearty to the lexical function MAGN(standingfor magnify), that is, MAGN (eater) = hearty. . Therelativesemanticstrengthof scalaradjectivesdirectly correlateswith the , the relative argumentativeforce of the adjectivesin the text. Consequently semanticstrengthinformationcan be usedin languageunderstandingto properly interpretthemeaningof scalarwordsandin generationto selectthe appropriate word to lexicalize a semanticconceptwith the desiredargumentative force [Elhadad, 1991] . . Scalar words obey pragmaticconstraints, for example, scalar implicature [Levinson, 1983; Hirshberg, 1985] . If the position of the word on the scaleis known, the systemcan draw the implied pragmatic inferencesduring text analysis, or usethem for appropriatelexical choice decisionsduring generation . In particular, such information can be usedfor the proper analysisand . For example, not hot usuallymeanswarm, generationof negativeexpressions but not warm usuallymeanscold. 3.2 The Needfor Automatic Methods for Word Grouping In recentyears, the importanceof lexical semanticknowledgefor language processinghasbecomerecognized.Someof the latestdictionariesdesignedfor humanuseincludeexplicit lexical semanticlinks; for example, the COB~ D dictionary [Sinclair, 1987] explicitly lists synonyms,antonyms,andsuperordinates for manyword entries. WordNet [Miller et al., 1990] is perhapsthe bestknown example of a large lexical databasecompiled by lexicographers specificallyfor computationalapplications,andit hasbeenusedin severalnatural languagesystems(e.g., [Resnik, 1993; Resnik and Hearst, 1993; Knight
Do We NeedLinguisticsWhenWe HaveStatistics?
and Luk, 1994; Basili et al., 1994]). Yet, WordNet and the machine-readable versionsof dictionariesand theusaristill suffer from a numberof disadvantages when comparedwith the alternative of an automatic, corpus-based approach: . All entriesmust be encoded hand, which by representssignificant manual effort. . Changesto lexical entries may necessitatethe careful examinationand potential revision of other relatedentriesto maintain the consistencyof the database . . Many typesof lexical semantic knowledgearenot presentin currentdictionaries or in WordNet. Most dictionariesemphasizethe syntacticfeaturesof words, suchas part of speech,number, and form of complement.Even when dictionarydesignerstry to focuson the semanticcomponentof lexical knowledge , the results have not yet been fully satisfactory. For example, neither COBUll. .O nor WordNet includesinformationaboutscalarsemantic strength. . The lexical information is not specific to any domain. Rather, the entries attemptto capturewhat appliesto the languageat large, or representspecialized sensesin a disjunctive manner. Note that semanticlexical knowledgeis most sensitiveto domainchanges.Unlike syntacticconstraints, semanticfeatures tend to change as the word is used in a different way in different domains. For example, our word grouping systemidentifiedpreferred as the word mostcloselysemanticallyrelatedto common,. this association may seem peculiarat fIrst glance, but is indeeda correctonefor thedomainof stockmarket reports and financial information from which the training material was collected. . Time-varying information, that is, the currencyof words, compounds , and collocations, is not adjustedautomatically. . The validity of any particular entry dependson the assumptionsmadeby the particularlexicographer (s) who compiledthat entry. In contrast, an automatic systemcan be more thorough and impartial, since it basesits decisionson actualexamplesdrawnfrom the corpus. An automaticcorpus-basedsystemfor lexical knowledgeextractionsuchas our word grouping systemoffsets thesedisadvantagesof static human-constructed knowledgebasesby automaticallyadaptingto the domain sublanguage . Its disadvantageis that while it offers potentially higher recall, it is generallylessprecisethan knowledgebasescarefully constructedby human . This disadvantagecan be alleviatedif the output of the automatic lexicographers systemis modified by humanexpertsin a post-editing phase.
Chapter4
4 Linguistic Featuresand Alternative Values for Them We haveidentified severalsourcesof symbolic, linguistic knowledgethat can be incorporatedin the word groupingsystem, augmentingthe basicstatistical component. Each suchsourcerepresentsa parameterof the system, that is, a featurethat canbe presentor absentor moregenerallytakea valuefrom a predefined set. In this sectionwe presentfirst oneof theseparametersthatcantake severalvalues, namelythe methodof extractingdatafrom the corpus, andthen severalotherbinary-valuedfeatures.
4.1 Extracting Data from the Corpus When the word-clusteringsystempartitions adjectivesin groupsof semantically relatedones, it determinesthedistributionof related(modified) nounsfor eachadjectiveand eventuallythe similarity betweenadjectivesfrom pairs of the form (adjective, modified noun) that have beenobservedin the corpus. Direct information about semanticallyunrelatedadjectives(in the form of appropriateadjective-adjectivepairs) can also be collectedfrom the corpus. Therefore, a first parameterof the systemand a possibledimensionfor comparisons is the methodemployedto identify suchpairsin free text. Thereareseveralalternativemodelsfor this taskof datacollection, with different degreesof linguistic sophistication.A fIrSt model is to useno linguistic : we collect for eachadjectiveof interestall words that fall knowledgeat al13 within a window of some predeterminedsize. Naturally, no negativedata (adjective-adjectivepairs) can be collected with this method. However, the methodcan be implementedeasily and doesnot require the identification of any linguistic constraintsso it is completely general. It has been used for diverseproblemssuchasmachinetranslationandsensedisambiguation[Gale et al., 1992b; Schutte, 1992] . A secondmodelis to restrictthe wordscollectedto the samesentenceasthe adjective of interest and to the word class(es) that we expect on linguistic groundsto berelevantto adjectives.For our application, we collect all nounsin the vicinity of an adjectivewithout leaving the current sentence . We assume that thesenounshave somerelationshipwith the adjectiveand that semantically different adjectiveswill exhibit different collectionsof suchnouns. This model requires only part-of-speechinformation (to identify nouns) and a methodof detectingsentenceboundaries . It usesa window of fixed lengthto 3. Aside from the conceptof a word, which is usually approximatedby defining any string of charactersseparatedby white spaceor punctuationmarksas a word.
Do We NeedLinguisticsWhenWe HaveStatistics?
defmethe neighborhoodof eachadjective. Again, negativeknowledgesuchas pairs of semanticallyunrelatedadjectivescannotbe collectedwith this model. Nevertheless , it hasalsobeenwidely used, for example, for collocationextraction [Smadja, 1993] andsensedisambiguation[Liddy andPaik, 1992] . Sincewe are interestedin nounsmodified by adjectives, a third modelis to collect a nounimmediatelyfollowing an adjective, assumingthat this impliesa modificationrelationship. Pairsof consecutiveadjectives, which arenecessar unrelated can also be collected. ily semantically , Up to this point we have successivelyrestrictedthe collectedpairs on linguistic grounds, sothat lessbut moreaccuratedataarecollected. For the fourth model, we extendthe simplerule given above, using linguistic informationto catch more valid pairs without sacrificing accuracy. We employ a pattern matcherthat retrievesany sequenceof oneor moreadjectivesfollowed by any sequenceof zero or more nouns. These sequencesare then analyzedwith heuristicsbasedon linguisticsto obtainpairs. The regularexpressionandpatternmatchingrulesof the previousmodelcan be extendedfurther, forming a grammarfor the constructsof interest. This approachcandetectmorepairs, andat the sametime addressknown problematic casesnot detectedby the previousmodels. We implementedthe abovefive dataextractionmodels, using typical window sizesfor the first two methods(50 and 5 on each side of the window respectively) which have beenfound appropriatefor other problemsbefore. Unfortunately, thefirst modelprovedto beexcessivelydemandingin resources 4 for our comparativeexperiments , so we droppedit from further consideration andusethe secondmodelasthe baselineof minimal linguistic knowledge. For the fifth model, we developeda fmite-stategrammarfor nounphraseswhich is ableto handleboth predicativeandattributivemodificationof nouns, conjunctions of adjectives,adverbialmodificationof adjectives,quantifiers, andapposition of adjectivesto nounsor otheradjectives.5 A detaileddescriptionof this grammarandits implementationcanbe found in [Hatzivassiloglou, 1995b]. 4.2 Other Linguistic Features In additionto the dataextractionmethod, we identifiedthreeotherareaswhere linguistic knowledgecan be introducedin our system. First, we can employ 4. For example, 12,287,320word pairsin a 151MB file wereextractedfor the21 adjectives in our smallesttest set. Other researchers havealso reportedsimilar problemsof excessiveresourcedemandswith this " collect all neighbors" model[Galeet al., 1992b]. 5. For efficiency reasonswe did not considera morepowerful formalism.
Chapter4
morphologyto convert plural nounsto the correspondingsingular ones and adjectivesin comparativeor superlativedegreeto their baseform. This conversion combinescountsof similar pairs, thus raising the expectedand estimated frequenciesof eachpair in any statisticalmodel. Another potentialapplicationof symbolic knowledgeis the useof a spellchecking procedureto eliminate typographical errors from the corpus. We implementedthis componentusing the UNIX spell program and associated . Unfortunately, sincea word list, with extensionsfor hyphenatedcompounds fixed anddomain independentword list is usedfor this process,somevalid but overspecialized wordsmay be discardedtoo. Finally, we have identified severalpotential sourcesof additional knowledge that can be extractedfrom the corpus(e.g., conjunctionsof adjectives) and can supplementthe primary similarity relationships. In this comparison studywe implementedandconsideredthe significanceof one of theseknowledge sources, namely the negativeexamplesoffered by adjective-adjective where the two adjectiveshavebeenobservedin a syntacticrelationship pairs . that stronglyindicatessemanticunrelatedness 5
The Comparison Experiments
In the previoussectionwe identifiedfourparametersof the system, the effects of which we wantto analyze. But in additionto theseparameters , which canbe variables values several other varied and have , predetenninedpossible directly canaffect the perfonnanceof the system. First, the perfonnanceof the systemdependsnaturally on the adjectiveset that is to be clustered. Presumably , variationsin the adjectivesetcanbe modeled as the sizeof the set, the numberof semantic several such , by parameters relatednessamongits members,plus in it and the of semantic , strength groups severalparametersdescribingthe propertiesof the adjectivesin the set in isolation , suchasfrequency, specificity, etc. A secondvariablethat affectsthe clusteringis the corpusthat is usedasthe main knowledgesource, throughthe observedco-occurrencepatterns. Again theeffectsof different corporacanbe separatedinto severalfactors, for example , the sizeof the corpus, its generality, the genreof the texts, etc. Sincein theseexperimentswe are interestedin quantifyingthe effect of the linguistic knowledgein the system, or morepreciselyof the linguistic knowledge that we can explicitly control through the four parametersdiscussed above, we did not attemptto model in detail the various factors enteringthe systemas a result of the choiceof adjectiveset and corpus. However, we are
Do We NeedLinguisticsWhenWe HaveStatistics?
interestedin measuringthe effectsof the linguistic parametersin a wide range of contexts. Therefore, we included in our experimentmodel two additional , representingthe corpusandthe adjectivesetused. parameters We usedthe 1987Wall StreetJournal articlesfrom the ACL -DCI (Association for ComputationalLinguistics-Data Collection Initiative) as our corpus. We selectedfour subcorporato study the relationshipof corpussize with linguistic featureeffects: subcorporaof 330,000 words, 1 million words, 7 million words, and21 million words(the lastconsistingof the entire 1987corpus) were selectedasrepresentative . Eachselectedsubcorpuscontainedthe selectedsubcorpora of smaller sizes, and was constructedby samplingacrossthe whole rangeof theentirecoIpus at regularintervals. Sincewe usesubsetsof the same corpus, we areessentiallymodelingthe corpussizeparameteronly. For eachcorpus, we analyzedthr~ different setsof adjectives, listed in figures 4.2, 4.3, and4.4. The first of theseadjectivetest setswas selectedfrom a similar corpus, contains21 words of varying frequenciesthat all associate strongly with a particular noun (problem), and was analyzedin [Hatzivassiloglou and McKeown, 1993]. The secondset (43 adjectives) was selected with the constraintthat it containhigh-frequencyadjectives(more than 1000 occurrencesin the 21-million -word corpus). The third set(62 adjectives) satisfies the oppositeconstraint, containingadjectivesof relatively low frequency (between50 and 250). Figure 4.1 on page73 showsa typical groupingfound by our systemfor the third setof adjectives, whenthe entirecorpusandalllin guistic moduleswereused. Thesethreesetsof adjectivesrepresentvariouscharacteristics of theadjective setsthat the systemmay be calledon to cluster. First, they explicitly represent increasingsizesof thegroupingproblem. The secondandthird setsalsocontrast the independentfrequenciesof their memberadjectives.Furthermore , the less of the third set tend to be more than the more frequent frequentadjectives specific ones. The humanevaluatorsreportedthat the task of classificationwas antib' Ust
international
big economic
legal little
fmancial foreign
major mechanical
global
new
old political potential real serious severe
Figure 4.2 Test set 1: Adjectivesstronglyassociatedwith the word problem.
staggering technical unexpected
Chapter4 annual
hard
big chief
high
negative net new
commercial
important initial
current
international
old
daily different
likely local
past
difficult
low
easy fmal
military modest
possible pre-tax
future
national
next
positive
previous
public quarterly recent regional senior significant similar small strong weak
private
Figure 4.3 Test set2: High-frequencyadjectives. abrupt affluent affordable
dismal
hostile
slim
dry dual
impoverished inexpensive insufficient leftist lenient
smart
astonishing brilliant
dumb
capitalist
energetic exotic
catastrophic chaotic
endless
fat
clean
fatal
clever
flexible
communist
fragile
confusing
generous
deadly delicate
gloomy
dirty disastrous
Figure 4.4
gradual hanDful hazardous
meager misleading multiple outrageous plain pleasant prosperous protracted rigid scant
Test set 3: Low - to medium -frequency adjectives .
socialist strict stringent stunning stupid toxic turbulent unfriendly unreasonable unstable vigorous wet
Do We NeedLinguisticsWhenWe HaveStatistics?
easierfor the third set, and their modelsexhibitedabout the samedegreeof agreementfor the secondand third sets, althoughthe third set is significantly larger. " " " " By including the parameters corpussize and adjectiveset, we have six . Any remainingfactorsaffecting parametersthat we canvary in the experiments the performanceof the systemaremodeledasrandomnoise,6 so statistical methodsare usedto evaluatethe effects of the selectedparameters . The six chosenparametersarecompletelyorthogonal, with the exceptionthat parameter " " " " negativeknowledge must have the value not used when parameter " extractionmodel" has " " the value nounsin vicinity . In order to avoid introducing imbalancein the experiment, we constructeda complete designed experiment [Hicks, 1982] for all their (4 X 2 - 1) X 2 X 2 X 4 X 3 = 336 valid combinations! 6
Experimental Results
6.1
Average Et Tectof Each Linguistic Parameter Presenting the scores obtained in each of the 336 individual experiments performed , which correspond to all valid combinations of the six modeled parameters , is both too demanding in spaceand not especially illuminating . Instead, we present several summary measures. We measured the effect of each particular setting of each linguistic parameter of section 4 by averaging the scores obtained in all experiments where that particular parameter had that particular value. In this way , table 4.1 summarizes the differences in the performance of the system caused by each parameter. Because of the complete design of the experiment , each value in table 4.1 is obtained in runs that are identical to the runs used for estimating the other values of the same parameter except for the difference in the parameter itselfS Table 4.1 shows that there is indeed improvement with the introduction of any of the proposed linguistic features, or with the use of a linguistically more sophisticated extraction model. To assessthe statistical significance of these differences , we compared each run for a particular value of a parameter with 6. Includingsomelimited in extentbut ttuly randomeffectsfrom our nondetenninistic clusteringalgorithm. 7. Recallthat a designedexperimentis completewhenat leastone trial, or run, is performed for everyvalid combinationof the modeledpredictors. 8. The slight asymmetryin parameters" extractionmodel" and " negativeknowledge" is accountedfor by leavingout non-matchingruns.
Chapter4 Table 4.1 AverageF-measurescoresfor eachvalueof eachlinguistic feature Parameter
Value
Averagescore
Extract inn model
Parsing Patternmatching Observedpairs Nounsin vicinity Yes No
30.29 28.88 27.87 22.36 28.60 27.53 28.12 28.00 29.40 28.63
Morphology Spell-checking
Yes No
Useof negativeknowledge
Yes No
thecorrespondingidentical(exceptfor thatparameter ) run for a differentvalue of theparameter . Eachpair of valuesfor a parameterproducesin this way a set of pairedobservations . On eachof thesesets, we perfonneda sign test [Gibbons and Chakraborti, 1992] of the null hypothesisthat thereis no real difference in the system's perfonnancebetweenthe two values, that is, that any observeddifferenceis dueto chance.We countedthe numberof timesthat the fIrStof thetwo comparedvaluesled to superiorperfonnancerelativeto the second , distributingties equallybetweenthe two casesasis the standardpractice in classifierinductionandevaluation. Underthe null hypothesis,thenumberof timesthat the fIrStvalueperfonnsbetterfollows the binomial distributionwith parameterp = 0.5. Table4.2 givestheresultsof thesetestsalongwith theprobabilities that the same or more extreme results would be encounteredby chance.We canseefrom the tablethat all typesof linguistic knowledgeexcept spell-checking have a beneficial effect that is statistically significant at, or below, the 0.1% level. 6.2 Comparison Among Linguistic Features In orderto measuredie significanceof die conbibutionof eachlinguisticfeature relative to die oilier linguistic features , we fitted a linear regressionmodel [Draperand Smidt, 1981] to die data. We usedie six parametersof die experiments asdie predictors,anddie F-measurescoreof die corresponding clustering9 9. Averagedoverfive adjacentvaluesof thenumberof clustersparameter , asexplained in section2.3.
Do We NeedLinguisticsWhenWe HaveStatistics?
Table4.2 Statistical of thedifference in perfonnance offeredby eachlinguisticfeature analysis
asthe responsevariable. In sucha modelthe responseR is assumedto bea linearfunction (weightedsum) of the predictorsVi. that is.
n R = /30+ L .B;V; =
; 1
(1)
where V; is the ;-th predictor and .8; is its correspondingweight. Table 4.3 showsthe weightsfound by the fitting processfor the experimentaldatacollected for all valid combinationsof the six parametersthat we model. These weightsindicateby their absolutemagnitudeandsignhow importanteachpredictor is and whetherit contributespositively or negativelyto the final result. Numericalvaluessuchasthe corpussizeenterequation( 1) directly aspredictors, so table 4.3 indicatesthat eachadditionalmillion words of ttaining text
Chapter4 Table 4.3 Fitted coefficientsfor the linear regressionmodel that contrasts the effects of various parametersin overall systemperformance Variable Intercept Corpussize(in millions of words) Extractionmethod(pairsvs. nounsin vicinity ) Exttactionmethod(sequences vs. nounsin vicinity) Extractionmethod(parservs. nounsin vicinity) Morphology Spell- checking Adjective set(2 vs. 1) Adjective set(3 vs. I ) Useof negativeknowledge
Weight
18.7997 0.9417 5.1307 6.1418 7.5423 0.5371 0.0589 2.5996 - 11.4882 0.3838
increasesthe performanceof the systemby 0.9417on average . For binary features , theweightsin table4.3 indicatethe increasein the system's performance when the featureis present, so the introductionof morphologyimprovesthe ' systems performanceby 0.5371 on average.The different possiblevaluesof thecategoricalvariables" adjectiveset" and" extractionmodel" areencodedas contrastswith a basecase; the weightsassociatedwith eachsuchvalue show the changein scorefor the indicatedvalue in contrastto the basecase(adjective set I and the minimal linguistic knowledgerepresentedby extraction model " nounsin vicinity ," respectively). For example, using the fmite-state " " parserinsteadof the nounsin vicinity modelimprovesthescoreby 7.5423on the score , while going from adjectiveset2 to adjectiveset3 decreases average = 2.5996 11 .4882 14.0878 on . the ( ) by averageFinally intercept.8ogives a baselineperformanceof a minimal systemthat usesthe basecasefor each ; the effectsof corpussizeareto be addedto this system. parameter From table 4.3 we can seethat the dataextractionmodel has a significant effect on the quality of the producedclustering, andamongthe linguistic parameters is the mostimportantone. Increasingthe sizeof the corpusalsosignificantly increasesthe score. The adjectiveset that is clusteredalso hasa major influenceon the score, with rareradjectivesleadingto worseclusterings.Note, however, thattheseareaverageeffects, takenover a wide rangeof differentsetringsfor the system. In particular, while the systemproducesbadpartitionsfor
Do We NeedLinguisticsWhenWe HaveStatistics?
adjectiveset 3 when the corpusis small, when the largestcorpus(21 million words) is usedthe partitionsproducedfor test set3 areequalin quality or better thanthepartitionsproducedfor theothertwo setswith the samecorpus. The two linguistic features" morphology" and" negativeknowledge" havelesspronounced althoughstill significanteffects, while spell-checkingoffers minimal improvementthat probably does not justify the effort of implementingthe moduleandthe costof activatingit at run-time. 6.3 Overall Et Tectof Linguistic Knowledge Up to this point we havedescribedaveragesof scoresor of scoredifferences, takenover manycombinationsof featuresthat areorthogonalto the one studLed. Theseaveragesare good for establishingthe existenceof a performance differencecausedby the different v~ ues of eachfeature, acrossall possible combinationsof theotherfeatures.Theyarenot, however, representative of the performanceof the systemin a particular settingofparameters, nor are they suitablefor describingthe differencebetweenfeaturesquantitatively, since ' they are averagestakenover widely differing settingsof the systems parameters . In particular, the inclusionof very smallcorporadrivesthe averagescores down, aswe haveconfirmedby computingaveragesseparatelyfor eachvalue of the corpussizeparameter ."To give a feeling of how importantthe introduction of linguistic knowledgeis quantitatively, we comparein table 4.4 the
Randompartitions
linguistic
Adjective set 1
Adjective set2
Adjective set3
9.66(17.90%) 24.51(45.41%)
6.21(9.66%) 38.51(59.92%)
3.80(6.03%) 33.21(52.66%)
/( 39.06(72.360
44.73(69.60%)
46.17(73.20%)
53.98
64.27
63.07
components active
All linguistic components active
Humans
Two versionsof the system(with all or none of the linguistic modulesactive) are contrastedwith the performanceof a randomclassifier and that of the humans. The scoreswereobtainedon the 21-million-word corpus, usinga smoothingwindow of three adjacentvaluesof thenumberof clustersparametercenteredat theaveragevaluefor that parameterin the human-preparedmodels. We also showthe percentageof the scoreof the humansthat is attainedby the randomclassifierandeachversionof the system.
Chapter4
resultsobtainedwith the full corpusof 21 million words for the two casesof having all or none of the linguistic componentsactive. The scoresobtained by a randomsystemthat producespartitions of the adjectiveswith no knowledge exceptthe numberof groupsare included asa lower bound. Theseestimates are obtained after averaging the scores of 20,000 such random partitions for eachadjective set. The averagescoresthat eachhumanmodel receiveswhencomparedwith all otherhumanmodelsarealso included, asan estimateof the maximum score that can be achievedby the system. That maximum dependson the disagreementbetweenmodels for eachadjective set. For thesemeasurementswe use a smaller smoothingwindow of size 3 insteadof 5, which is fairer to the systemwhen its performanceis compared with the humans. We also give in figure 4.5 the grouping producedby the systemfor adjectiveset 3 using the entire 21-million -word corpusbut without any of the linguistic modules active. This partition is to be contrasted with the one given in figure 4.1 on page 73 which was producedfrom the samecorpusand with the samenumberof clusters, but with all the linguistic modulesactive. 1. catastrophichannful 2. dry wet 3. lenientrigid strict stringent 4. communistleftist 5. clever 6 . abrupt chaotic disasttous gradual turbulent vigorous
7. affluent affordableinexpensive prosperous 8. outrageous
13. brilliant energetic 14. dual multiple stupid 15. hazardoustoxic unreasonable unstable 16. plain 17. confusing 18. flexible hostileprob' acted unfriendly 19. endless
11. generousinsufficient meager scantslim
20. cleandirty impoverished 21. deadlyfatal 22. astonishingmisleading stunning 23. dumbfat smart
12. delicatefragile
24. exotic
9. capitalistsocialist 10. dismalgloomy pleasant
Figure 4.5 Partitionwith 24 clustersproducedby the systemfor the adjectivetestset3 of figure 4.4 usingthe entire 21-milIion -word corpusandno linguistic modules.
? DoWeNeedLinguisticsWhenWeHaveStatistics 7 Cost of Incorporating the Linguistic Knowledge in the System The cost of incorporatingthe linguistics-basedmodulesin the systemis not prohibitive. The effort neededto implement all the linguistic moduleswas about5 person-months, in contrastwith 7 person-monthsneededto developthe basic statisticalsystem. Most of this time was spentin designingand implementing the fmite-stategrammarthat is usedfor extractingadjective-nounand adjective-adjectivepairs [Hatzivassiloglou, 1995b]. Furthermore, the run-time overheadcausedby the linguistic modules is not significant. Each linguistic module takes from 1 to 7 minutes on a Sun SparcStation 10 to processa million items (words or pairs of words, as appropriate for the module), and all except the negative knowledge module needprocessa corpusonly once, reusing the sameinformation for different probleminstances(word sets). This shouldbe comparedto the approximately 15 minutes needed by the statistical component for grouping about 40 adjectives.
8 Generalizingto Other Applications In section6 we showedthat the inttoduction of linguistic knowledgein the word groupingsystemresultsin a performancedifferencethat is not only statistically observablebut also quantitatively significant (cf. table 4.4). We believethat thesepositive resultsshouldalso apply to other corpus-basednatural languageprocessingsystemsthat employ statisticalmethods. essharethe samebasicmethodologywith our system Many statisticalapproach : a setof words is preselected , relatedwords areidentified in a corpus, the words and of of pairsof relatedwordsareestimated,and a statistical frequencies model is usedto makepredictionsfor the original words. Acrossapplications , thereare differencesin what words are selected,how relatedwords are defmed, andwhat kinds of predictionsaremade. Nevertheless , the basiccomponents in the . For the same adjectivegroupingapplicationthe example, stay original words are the adjectives and the predictions are their groups; in machinettanslation, the predictionsare the ttanslationsof the words in the sourcelanguagetext; in sensedisambiguation , the predictionsare the senses in of interest words of to the ; part speechtaggingor in classification, assigned the predictionsare the tagsor classesassignedto eachword. Becauseof this underlyingsimilarity, the comparativeanalysispresentedin this chapteris relevant to all theseproblems.
Chapter4
For a concreteexample, considerthe caseof collocationextractionthat has beenaddressedwith statisticalmethodsin the past. Smadja[ 1993] describesa " " systemthat initially usesthe nounsin vicinity extractionmodelto collectco occurrenceinformation about words, and then identifies collocationson the basisof distributionalcriteria. A later componentfilters the retrievedcollocations , removing the oneswhere the participatingwords are not usedconsistently in thesamesyntacticrelationship.This post-processingstagedoublesthe precisionof the system. We believethat using from the start a more sophisticated extractionmodelto collect thesepairsof relatedwordswill havesimilar , such as a morphologymodule positive effects. Other linguistic components that combinesfrequencycounts, shouldalso improvethe performanceof that system. In this way, we canbenefitfrom linguistic knowledgewithout having to usea separatefiltering processafter expendingthe effort to collect the collocation . Similarly, the sensedisambiguationproblem is' typically attackedby comparing the distributionof the neighborsof a word s occurrenceto prototypical ' distributionsassociatedwith each of the word s senses[Gale et al., 1992b; SchUtte, 1992] . Usually, no explicit linguistic knowledgeis usedin defining theseneighbors, which are taken as all words appearingwithin a window of .10Many wordsunrelated fIXedwidth centeredat theword beingdisambiguated to the word of interestarecollectedin this way. In contrast, identifying appropriate word classesthat can be expectedon linguistic groundsto convey significant information aboutthe original word shouldincreasethe performance of the disambiguationsystem. Suchclassesmight be modifiednounsfor adjectives , nounsin a subjector object position for verbs, etc. As we have shown in section6, less but more accurateinformation increasesthe quality of the results. An interestingtopic is theidentificationof parallelsof the linguistic modules thathavebeendesignedwith thepresentsystemin mind for theseapplications, at leastfor thosemoduleswhich, unlike morphology, arenot ubiquitous. Negative knowledge, for example, improvesthe performanceof our system, suppleme the positiveinformationprovidedby adjective-nounpairs. It could be useful for other systemsas well if an appropriateapplication-dependent methodof extractingsuchinformationis identified. , haveusedlimitedlinguisticknowledgein selecting 10. Althoughsomeresearchers and 1991 Hearst for see , ] these , [ and ; , example neighbors , processing classifying , 1994 ]. [Yarowsky
Do We NeedLinguisticsWhenWe HaveStatistics? 9
Conclusions and Future Work
We haveshownthat all linguistic featuresconsideredin this studyhad a positive contributionto the perfonnanceof the system. Exceptfor spell-checking, all thesecontributionswere both statistically significant and large enoughto makea differencein practicalsituations. The costof incorporatingthe linguistics -basedmodulesin the systemis not prohibitive, both in tenDSof development time and in tenDsof actualrun-time overhead.Furthennore, the results canbeexpectedto generalizeto a wide varietyof corpus-basedsystemsfor different applications. We shouldnote herethat in our comparativeexperimentswe havefocused on analyzingthe benefitsof symbolicknowledgethat is readily availableand canbe efficiently incorporatedinto the system. We haveavoidedusinglexical semanticknowledgebecauseit is not generallyavailableand becauseits use would defeatthe very purposeof the word groupingsystem. However, on the basisof the measurableperfonnancedifferenceofferedby the shallowlinguistic knowledgewe studied, it is reasonableto conjecturethat deeperlinguistic knowledge, if it becomesreadily accessible , would probablyincreasethe perfonnanceof a hybrid systemevenmore. In the future, we plan to extendthe resultsdiscussedin this chapterwith an of theeffectsof eachparameteron the valuesof the analysisof the dependence other parameters . We are currently stratifying the experimentaldataobtained to studytrendsin the magnitudeof parametereffectsasother parametersvary in a controlled manner, and we will examinethe interactionswith corpussize and specificity of clusteredadjectives. Preliminary results indicate that the importanceof linguistic knowledgeremains high even with large corpora, showingthat we cannotoffset the advantagesof linguistic knowledgejust by increasingthe corpussize. We plan to investigatethesetrendsandinteractions with extendedexperimentsin the future.
Acknowledgments Thisworkwassupported Research jointly by theAdvanced ProjectsAgency andtheOfficeof NavalResearch undergrantNOOO 14-89-J- 1782,by theOffice of NavalResearch undergrantNOOOI4 -95- 1-0745,andby theNationalScience Foundation undergrantGER-90-24069 . It wasperfonned undertheauspices of theColumbiaUniversityCATin HighPerfonnance andCommunications Computing in Health care, a New York StateCenterfor AdvancedTechnology
Chapter4
supportedby the New York StateScienceand TechnologyFoundation. Any opinions, findings, conclusions, or recommendationsexpressedin this publication aremine anddo not necessarilyreflect the views of the New York State Science and Technology Foundation. I thank Kathy McKeown, Jacques Robin, the anonymousreviewers, andthe BalancingAct workshoporganizers andeditorsof this book for providing useful comnientson earlier versionsof the chapter. References , and PaolaVelardi. The Noisy Channelandthe RobertoBasili, Maria TeresaPazienza the ACL WorkshopTheBalancingAct: Combining In . Braying Donkey Proceedingsof to es and Statistical Language, pp. 21- 28, Las Cruces, New Mexico Approach Symbolic , July 1994, Associationfor ComputationalLinguistics. PeterF. Brown, Vincent J. della Pietra, PeterV . de Souza, JenniferC. Lai, and Robert L. Mercer. Class-Basedn-gram Models of Natural Language.ComputationalLinguistics , 18(4) : 467- 479, 1992. KennethW. Church. A StochasticParts Programand Noun PhraseParserfor Unrestricted Text. In Proceedingsof the SecondConferenceon Applied Natural Language Processing,pp. 136- 143, Austin, Texas, February1988. , andPenelopeSibun. A Practical DouglasR. Cutting, Julian M. Kupiec, JanO. Pedersen Part-of-SpeechTagger. In Proceedingsof the Third Conferenceon AppliedNatural LanguageProcessing,pp. 133--140, Trent, Italy, April 1992. Nonnan R. Draper and Harry Smith. Applied RegressionAnalysis, 2nd edition, New York, Wiley , 1981. ' Michael Elhadad. GeneratingAdjectives to Express the Speakers Argumentative Intent. In Proceedingsof the 9th National Conferenceon Artificial Intelligence(AAAI91), pp. 98- 104, Anahelm, California, July 1991. AmericanAssociationfor Artificial Intelligence. William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data StructuresandAlgorithms. EnglewoodCliffs , N.J., PrenticeHall, 1992.
William A. Gale, KennethW ~Church, and David Y arowsky. EstimatingUpper and Lower Boundson the Perfonnanceof Word-SenseDisambiguationPrograms.In Proceeding of the30thAnnualMeetingof theACL, pp. 249--256, Newark, Del., June1992. Associationfor ComputationalLinguistics. William A. Gale, KennethW. Church, andDavid Y arowsky. Work on StatisticalMethods for Word SenseDisambiguation. In Probabilistic Approaches to Natural Language , : Papersfrom the 1992Fall Symposium , pp. 54- 60, Cambridge, Massachusetts October 1992. Menlo Park, Calif., American Associationfor Artificial Intelligence, AAAI Press, 1992. JeanDickinsonGibbonsandSubhabrataChakraborti. NonparametricStatisticalInference , 3rd edition, New York, Marcel Dekker, 1992.
Do We NeedLinguisticsWhenWe HaveStatistics?
Vasileios . Automatic Retrieval ofSemantic Hatzivassiloglou andScalar WordGroups fromFreeText -OI8-95, NewYork . Technical CUCS Report , Columbia , University 1995 . VasileiosHatzivassiloglou -Noun, Adjective . RetrievingAdjective -Adjective , and -AdverbSyntagmatic Adjective from Corpora : Extractionvia aFiniteRelationships StateGrammar , HeuristicSelection , andMorphological . TechnicalReport Processing CUCS-019-95, NewYork, ColumbiaUniversity , 1995. VasileiosHatzivassiloglou andKathleen McKeown . Towards theAutomaticIdentification of AdjectivalScales : Clustering . In Proceedings Adjectives Accordingto Meaning ofthe31stAnnualMeetingof theACL, pp. 172- 182,Columbus , Ohio, June1993.Association for Computational . Linguistics MartiA. Hearst . NounHomograph Disambiguation UsingLocalContextin LargeText . In Proceedings Corpora of the7thAnnualConference of theUniversityof Waterloo Centre for thetheNewOEDandTextRe~earch: UsingCorpora,Oxford, 1991. CharlesR. Hicks. Fundamental in theDesignof Experiments Concepts , 3rd edition, NewYork, Holt, Rinehart , andWilson, 1982. DonaldHindle. NounClassification -ArgumentStructures fromPredicate . InProceedings of the28thAnnualMeetingof theACL, pp. 268-275, Pittsburgh , June1990.Association for Computational . Linguistics JuliaB. Hirshberg . A Theoryof ScalarImplicature . Pill thesis of Computer , Department andInformationScience , Universityof Pennsylvania , Philadelphia , 1985. LeonardKaufmanandPeterJ. Rousseeuw . FindingGroupsin Data: AnIntroductionto ClusterAnalysis . NewYork, Wiley, 1990. MauriceG. Kendall. A New Measureof RankCorrelation . Biometrika , 30: 81- 93, 1938. MauriceG. Kendall.RankCorrelationMethods , 4thedition.London . , Griffin, 1975 KevinKnightandSteveK. Luk. Buildinga Large-ScaleKnowledge Basefor Machine Translation . In Proceedings of the12thNationalConference onArtijiciallntelligence (AAAl -94), vol. I , pp. 773- 778, Seattle , July- August1994.AmericanAssociation for ArtificialIntelligence . JulianM. Kupiec. RobustPart-of-SpeechTaggingUsinga HiddenMarkovModel. andLanguage Computer Speech , 6: 225-242, 1992 . AdrienneLehrer.Semantic FieldsandLexicalStructure . Amsterdam , NorthHolland. 1974. C. Levinson . Pragmatics . Cambridge Stephen , England , Cambridge , UniversityPress 1983. ElizabethD . LiddyandWoojinPaik. Statistically -GuidedWordSense . Disambiguation In Probabilistic esto NaturalLanguage : Papersfromthe1992Fall Symposium Approach , pp. 98--107,Cambridge , Massachusetts . October1992.MenloPark.Calif., American Association for ArtificialIntelligence , AAAI Press . , 1992 JohnLyons.Semantics , vol. I . Cambridge , England , Cambridge . 1977. UniversityPress
Chapter4 Igor A . MeI'cuk and Nikolaj V. Pertsov. SurfaceSyntaxof English: a Formal Model within theMeaning-TextFramework. Amsterdam, Benjamins, 1987. GeorgeA . Miller , RichardBeckwith, ChristianeFellbaum, DerekGross, andKatherine . /nternationaI Journal J. Miller . Introductionto WordNet: An On-Line Lexical Database of Lexicography(specialissue), 3(4) : 235- 312, 1990. JohannaD. Moore. Personalcommunication.June1993. : Human and Diane J. Litman. Intention-BasedSegmentation RebeccaJ. Passonneau Reliability and Correlationwith Linguistic Cues. In Proceedingsof the 31st Annual Meetingof the ACL, pp. 148- 155, Columbus, Ohio, June 1993. Associationfor Com . putationalLinguistics FernandoPereira, Naftali Tishby, andLillian Lee. DistributionalClusteringof English Words. In Proceedingsof the31stAnnualMeetingof theACL, pp. 183- 190, Columbus, Ohio, June1993. Associationfor ComputationalLinguistics. Philip Resnik. SemanticClassesandSyntacticAmbiguity. In Proceedingsof theARPA Workshopon Human LanguageTechnology, pp. 278- 283, Plainsboro, N.J., March 1993. ARPA Software and Intelligent SystemsTechnology Office, San Francisco, MorganKaufmann, 1993. Philip Resnikand MartiA . Hearst. StructuralAmbiguity and ConceptualRelations. In Proceedingsof the ACL Workshopon Very Large Corpora, pp. 58- 64, Columbus, Ohio, June1993. Associationfor ComputationalLinguistics.
: DisambiguationTechniquesin Victor Sadier. Working with Analogical Semantics DLT. Dordrecht, The Netherlands , Foris Publications, 1989. . In Proceedings Hinrich Schiitze. Word SenseDisambiguationWith SublexicalRepresentations . 109 - 113, NLP Based on , A A Al 92 Techniquespp Workshop Statistically of the SanJose, Calif., July 1992. AmericanAssociationfor Artificial Intelligence. John M. Sinclair (editor in chief) . Collins COBU/W English LanguageDictionary. London, Collins, 1987. Frank Smadja. RetrievingCollocationsfrom Text: Xtract. ComputationalLinguistics, 19( 1) : 143- 177, March 1993. HelmuthSpith. ClusterDissectionandAnalysis: Theory, FORTRANPrograms, Examples . Chichester,WestSussex,England, Ellis Horwood, 1985. . NeueJahrbecherfUr WissJostTrier. Das sprachlicheFeld. Eine Auseinandersetzung . 1934 10: 428449 , , wenschaftundJugenbildung C. J. van Rijsbergen./ nformation Retrieval, 2nd edition, London, Butterworths, 1979. Alex Waibel and Kai-Fu Lee, editors. Readingsin SpeechRecognition. San Mateo, Calif., MorganKaufmann, 1990. David Y arowSky. Decision Lists for Lexical Ambiguity Resolution: Application to AccentRestorationin SpanishandFrench. In Proceedingsof the 32ndAnnual Meeting of the ACL, pp. 88- 95, Las Cruces, NiM ., June 1994. Associationfor Computational Linguistics.
Chapter 5 The Automatic Construction of a Symbolic Parser via Statistical Techniques
Shyam Kapur and Robin Clark
At thecoreof contemporarygenerativesyntaxis thepremisethat all languages obeya set of universalprinciples, and syntacticvariation amonglanguagesis confinedto afinite numberof parameters. On this model, a child' s acquisition of the syntaxof his or her native languagedependson identifying the correct parameter settingsfor that language based on observation- for example, determiningwhethertoform questionsbyplacing questionwordsat the beginning of the sentence(e.g., English: Who did Mary say John saw) or leaving themsyntacticallyin situ (e.g., Chinese: Mary said John saw who) . Prevalent work onparametersettingfocuseson thewaythat observedeventsin thechild' s " " input might trigger settingsforparameters (e.g. [ Manzini and Wexier, 1987J), to the exclusionof inductiveor distributional analyses. In " TheAutomaticConstructionof a SymbolicParser via StatisticalTechniques " , Kapur and Clark build on their previouswork onproving learnability resultsin a stochasticsetting[Kapur, 1991Jand exploring the complexityof parametersetting in theface of realistic assumptionsabout how parameters interact [ Clark, 1992J. Here theycombineforces to presenta learning model in which distributional evidenceplays a critical role - while still adheringto an orthodox, symbolicview of languageacquisitionconsistentwith the Chomskianparadigm. Notably, they validatetheir approachby meansof an implementedmodel,testedon naturally occurringdata of the kind availableto child languagelearners.- Eds.
1 Motivation We report on the progresswe have madetoward developinga robust " selfconstructing " parsingdevicethat usesindirect negativeevidence[Kapur and Bilardi, 1991] to setitsparameters.Generally, by parameterwe meananypoint of variationbetweenlanguages ; thatis, a propertyon which two languagesmay
Chapter5
differ. Thus, therelativeplacementof anobjectwith respectto theverb, a determiner with respectto a noun, the differencebetweenprepositionaland postposition languagesand the presenceof long distanceanaphorslike Japanese " zibun" and Icelandic " " are all . A self-constructingparsing parameters sig devicewould be exposedto an input text consistingof simpleunpreprocessed . On the basisof this text, the devicewould induceindirect negative sentences evidencein supportof someoneparsingdevicelocatedin the parameterspace. The developmentof a self-constructingparsingsystemwould havea number of practical and theoreticalbenefits. First, such a parsing device would reducethe developmentcostsof new parsers.At the moment, grammarsmust be developedby hand, a techniquerequiringa significantinvestmentin money andman-hours. If a basicparsercouldbedevelopedautomatically,costswould be reducedsignificantly, evenif the parserrequiredsomefme-tuning after the initial automaticlearningprocedure. Second, a parsercapableof self-modificationis potentiallymorerobustwhenconfrontedwith novel or semigrammatical input. This type of parserwould haveapplicationsin informationretrieval as well as languageinstruction and grammarcorrection. As far as linguistic , the developmentof a parsercapableof self-modification theory is concerned would give us considerableinsight into the formal propertiesof complexsystems aswell asthetwin problemsof languagelearnabilityandlanguageacquisition , the researchproblemsthat haveprovidedthe foundationof generative grammar. Given a linguistic parameterspace,theproblemof locatinga targetlanguage somewherein the spaceon the basisof a text consistingof only grammatical is far from trivial. Clark [ 1990, 1992] hasshownthat the complexity sentences of the problemis potentially exponentialbecausethe relationshipbetweenthe pointsof variationandthe actualdatacanbe quite indirect andtangled. Since, given nparameters, thereare 2npossibleparsingdevices, enumerativesearch throughthe spaceis clearly impossible. Becauseeachdatummay besuccess fully parsedby a number of different parsing deviceswithin the spaceand the properties becausethe surfacepropertiesof grammaticalstringsunderdetermine of the parsing device which must be fixed by the learni~g algorithm, standarddeductivemachinelearningtechniquesareascomplexasa bruteenumerati search[Clark, 1992, 1994a]. In order to solve this problem, robust techniquesthat canrapidly eliminateinferior hypothesesmustbe developed. We proposea learningprocedurethat unitessymboliccomputationwith statistical tools. Historically, symbolic techniqueshave proved to be a versatile of . Thesetechniqueshavethe disadvantage tool in naturallanguageprocessing and error user or ) costly (as being both brittle (easily brokenby new input by
The AutomaticConSb'uctionof a SymbolicParservia StatisticalTechniques
97
grammars are extended to handle new constructions , development becomes more difficult owing to the complexity of rule interactions within the grammar ) . Statistical techniques have the advantage of robustness, although the resulting grantmars may lack the intuitive clarity found in symbolic systems. We propose to fuse the symbolic and statistical techniques, a development we view both as inevitable and welcome ; the resulting system will use statistical learning techniques to output a symbolic parsing device. We view this development to provide a nice middle ground between the problems of overtraining vs. undertraining . That is , statistical approaches to learning often tend to overfit the training set of data. Symbolic approaches, on the other hand, tend to behave as though they were undertrained (breaking down on novel input) since the grammar tends to be compact . Combining statistical techniques with symbolic parsing would give the adv~ tage of obtaining relatively compact descriptions (symbolic processing) with robustness ( statistical learning) that is not overtuned to the training set. We believe that our approach not only provides a new technique of obtaining robust parsers in natural language systems but also provides partial explanation for child language acquisition . Traditionally , in either of these separate fields of inquiry , two widely different approaches have been pursued. One of them is largely statistical and heavily data-driven ; another one is largely symbolic and theory -driven . Neither approach has proved exceptionally successful in either field . Our approach not only bridges the symbolic and statistical approaches but also tries to bring closer the two disparate fields of inquiry . We claim that the final outcome of the learning process is a grammar that is not simply some predefmed template with slots that have been filled in but rather crucially a product of the process itself . The result of setting a parameter to a certain value involves not just the fixing of that parameter but also apotential ' reorganization of the grantmar to reflect the new parameter s values. The final result must not only be any parser consistent with the parameter values but one that is also self -modifiable and .furthermore one that can modify itself along one of many directions depending on the subsequent input . Exactly for this reason, the relevance of our solution to a purely engineering solution to parser building remains- the parser builder cannot simply look up the parameter values in a table. In fact , parameter setting has to be a part , even just a small part , of the parser construction process. If this were not the case, we probably would have had little difficulty in building excellent parsers for individual languages . Equally , the notion of self- modification is of enormous interest to linguistic typologists and diachronic linguists . In particular , a careful study of self-modification would place substantive limits on linguistic variation and on
Chapter5
the ways in which languagescould, in principle, changeover time. The information -theoretic analysisof linguistic variation is still in its infancy, but it promisesto providean importanttheoreticaltool for linguists. (See[Clark and Roberts, in preparation] for applicationsto linguistic typology and diachronic change.) As far aschild languageacquisitionis concerned , viewing the parametersetting problemin an information theoreticlight seemsto be the bestperspective onecanput togetherfor this problem[Clark, 1994b; KapurandClark, in press]. Linguistic representationscarry information, universal grammar encodes informationaboutall naturallanguages , andthe linguistic input from the target languagemustcarry informationaboutthe targetlanguagein someform. The taskof the learnercanbe viewedasthat of efficiently andaccuratelydecoding the informationcontainedin !;heinput in order to haveenoughinformationto build the grammarfor the targetlanguage. To date, the information-theoreticprinciples underlying the entire process have not receivedadequateattention. For example, the most commonlyconsidered learningalgorithmis onethat simply movesfrom oneparametersetting to anotherparametersettingbasedonly on failure in parsing. That suchan algorithm is entirely implausibleempirically is one issue; in addition, it can be shownthat one of the fastestways for this algorithm to convergeis to take a randomwalk in the parameterspace, which is clearly grosslyinefficient. Such an approachis also inconsistentwith a maxim true about much of learning: " " We learn at the edgeof what we alreadyknow. Furthermore, in no sense would one be able to maintainthat thereis a monotonicincreasein the information thechild hasaboutthetargetlanguagein anyreal sense.We know from observationandexperimentationthat children' s learningappearsto be largely monotonic and fairly uniform acrosschildren. Finally and most important, thesealgorithmsfail to accountfor how certain information is necessaryfor children' s learningto proceedfrom stagen to stagen + I . Justas somebackground information is necessaryfor children' s learningto proceedfrom stage 0 (the initial state) to stageI , thereis goodreasonto believethat theremustbe somebackground+ acquiredinformationthat mustbe crucial to takethe child from stagen to stagen + I . In the algorithmswe consider, we provide arguments that the child canproceedfrom onestageto the next only becauseat the earlierstagethechild hasbeenableto acquireenoughinformationto be ableto build enoughstructure. This, in turn, is necessaryto in fact efficiently extract further informationfrom the input to learnfurther. The restrictivelearningalgorithmsthat we considerhereallow the process of informationextractionfrom a plausibleinput text to be investigatedin both
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
99
complete formal and computational detail . We hope to show that our work is leading the way to establish precisely the information -theoretic details of the entire learning process, from what the initial state needs to be to what can be learned and how . For example, another aspect in which previous attempts have been deficient is in their varying degrees of assumptions about what information the child has accessto when in the initial state. We feel that the most logical approach is to assumethat the child has no accessto any information unless it can be argued that without some information , learning would be impossible or at least infeasible . Some psycholinguistic consequences of our proposal appear to be empirically valid . For example, it has been observed that in relatively free word order languages such as Russian, the child fIrSt stabilizes on some word order , although not the same word order across children . Another linguistic and psycholinguistic consequenceof this proposal is that there is no need to stipulate markedness or initial preset values. Extensional relationships between languages and their purported consequences, such as the Subset Principle , are irrelevant . Furthermore , triggers need not be single utterances; statistical properties of the corpus may trigger parameter values. 2
Preliminaries
In this section, we fIrSt list some parameters that give some idea of the kinds of variations between languages that we hope our system is capable of handling . We then illustrate why parameter setting is difficult by standard methods. This provides some additional explanation for the failure so far in developing a truly universal parameterized parser.
2.
LioeuisticParameters Naturally, a necessarypreliminaryto our work is to specifya setof parameters that will serveas a testinggroundfor the learningalgorithm. This set of parameters must be embeddedin a parsingsystemso that the learningalgorithm can be testedagainstdata setsthat approximatethe kind of input that parsing devicesarelikely to encounterin real-world applications. Our goal, then, will be to first developa prototype. We do not requirethat the prototypeacceptany arbitrarily selectedlanguageor that the coverageof the prototype parser be complete in any given language. Instead, we will developa prototypewith coveragethat extendsto somebasic structuresthat any languagelearningdevicemustaccountfor , plus somestructuresthat have proved difficult for various learning theories. In particular, given an already
100
Chapter5
existing parser, we will extendits coverageby parameterizingit , asdescribed below. Our initial setof parameterswill includethe following otherpointsof variation :
1. Relative order of specifiersand heads: This parametercoversthe placement of determinersrelative to nouns, relativeposition of the subject, andthe placementof certainVP-modifying adverbs. : This parameterdealswith the 2. Relativeorder of headsand complements or orders), placementofnomi relative to the verb VO OV of ( position objects betweenprepositionsand as well as the choice nal andadjectivalcomplements , . postpositions 3. Scrambling: Somelanguagesallow (relatively) free word order. For example , Germanhas rules for displacing definite NPs and clausesout of their allows relatively free orderingof NPs andpostpositio canonicalpositions. Japanese phrasesso long as the verbal complex remainsclausefmal. Other languagesallow evenfreer word orders. We will focuson GermanandJapanese scrambling, bearingin mind that the model shouldbe extendibleto other typesof scrambling. 4. Relativeplacementof negativemarkersand verbs: Languagesvary as to wheretheyplacenegativemarkers, like Englishnot. Englishplacesits negative markerafterthefIrSttensedauxiliary, thusforcing do insertionwhenthereis no other auxiliary, whereasItalian placesnegationafter the tensedverb. French usesdiscontinuouselementslike ne . . . pas or ne . . . plus, which arewrapped aroundthe tensedverb or occur as continuouselementsin inflnitivals. The , giventherangeof propertreatmentof negationwill requireseveralparameters variation. 5. Root word order changes: In general, languagesallow for certain word order changesin root clausesbut not in embeddedclauses. An exampleof a root word orderchangeis subject-auxiliary inversionin English, which occurs in root questions(Did John leave? vs. * 1 wonder did John leave?). Another examplewould be inversionof the subjectclitic with the tensedverb in French " " (QuelIe pommeat -il mangee[ Which appledid he eat? ]) andthe processof subjectpostpositionand PP prepositionin English (A man walked into the room vs. Into the room walkeda man). 6. Rightwarddislocation: This includesextrapositionstructuresin English (That John is late amazesme. vs. It amazesme that John is late.), presentational there structures(A man was in the park. vs. There was a man in the park.), and stylistic inversionin French(Quellepiste Marie at -elle choisie?
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
101
" [ What path has Marie chosen?" ]) . Each of these constructions presentsunique problems so that the entire data set is best handled by a system of interacting parameters. 7. Wh -movement vs. wh - in situ: Languages vary in the way they encode whquestions. English obligato rily places one and only one wh phrase (e.g ., who or which picture ) in first position . In French the wh -phrase may remain in place ( in situ ) although it may also form wh questions as in English . Polish allows wh phrasesto be stacked at the beginning of the question. 8. Exceptional casemarking , structural casemarking : These parameters have little obvious effect on word order , but involve the treatment of infmitival complements . Thus , exceptional case marking and structural case marking allow " " for the generation of the order V [+tense ] is a ], where V [ +tense ] NP VP [-tense " " tensed verb and VP [-tense ] is a VP ~eaded by a verb in the infinitive . Both parameters involve the semantic relations between the NP and the infmitival VP as well as the treatment of case marking . These relations are reflected in constituent structure rather than word order and thus pose an interesting problem for the learning algorithm . 9. Raising and control : In the case of raising verbs and control verbs, the learner must correctly categorize verbs that occur in the same syntactic frame into two distinct groups based on semantic relations as reflected in the distribution of elements (e.g ., idiom chunks) around the verbs. " 10. Long - and short- distance anaphora: Short-distance anaphors, like himself " in English , must be related to a coreferential NP within a constrained " " " " local domain. Long - distance anaphors (Japanese zibun , Korean caki ) must also be related to a coreferential NP , but this NP need not be contained within the same type of local domain as in the short-distance case. The above sampling of parameters has the virtue of being both small (and therefore possible to implement relatively quickly ) and posing interesting learnability problems which will appropriately test our learning algorithm . Although the above list can be described succinctly , the set of possible targets will be large and a simple enumerative search through the possible targets will not be efficient .
2.2 Complexities of Parameter Setting Theoriesbasedon the principlesandparameters(P&P) paradigmhypothesize that languagessharea centralcore of universalpropertiesand that language variationcanbe accountedfor by appealto a fmite numberof points of variation . The parametersthemselvesmay take on only a , the so-calledparameters
102
Chapter5
finite numberof possiblevalues, prespecifiedby UniversalGrammar(UG). A fully specifiedP&P theorywould accountfor languageacquisitionby hypothesizing that the learnersetsparametersto the appropriatevaluesby monitoring the input stteamfor " triggering data" ; triggersare sentenceswhich causethe learnerto set a particular parameterto a particular value. For example, the " " imperativein ( 1) is a trigger for the order V (erb) O(bject) : ( 1) Kiss grandma. underthehypothesisthat the learneranalyzesgrandmaasthe patientof kissing andis predisposedto tteat patientsasstructuralobjects. Notice that trigger-basedparametersettingpresupposes that for eachparameter p andeachvaluev the learnercan identify the appropriatetrigger in the input stream. This is the prob~em of trigger detection. That is, given aparticular input item, the learnermust be able to recognizewhetheror not it is a trigger and, if so, what parameterand value it is a trigger for. Similarly, the learnermustbe able to recognizethat a particular input datumis not a trigger for a certainparametereventhoughit may sharemany propertieswith a trigger . In order to make the discussionmore concrete, considerthe following example: (2) a. Johnithinks that Mary likes himib. * Johnthinks that Maryj likes herj. Englishallowspronounsto becoreferentwith a c-comrnandingnominaljust in casethatnominalis not containedwithin thesamelocal syntacticdomainasthe pronoun; this is a universalpropertyof pronounsandwould seemto presentlittle problemto the learner. Note, however, that somelanguages , including Chinese, Icelandic, Japanese . Theseare elementswhich , and Korean, allow for long-distanceanaphors areobligatorily coreferentwith anothernominalin the sentence , but which may be separatedfrom that nomin.al by severalclauseboundaries . Thus, the following examplefrom Icelandicis grammaticaleventhoughthe anaphorsig is separated from its antecedent Jon by a clauseboundary[Anderson, 1986] : (3) J6nisegirad Maria elski sigi/ hanni Johnsaysthat Mary lovesself/him Johnsaysthat Mary loveshim. Thus, UG includes a parameterthat allows some languagesto have longdistanceanaphorsand that, perhaps,fiXescertainotherpropertiesof this class of anaphora .
The Automatic Construction of a Symbolic Parser via Statistical Techniques
103
Note that the example in ( 3 ) is of the same sttucture as the pronominal example in ( 2a ) . A learner whose target is English must not take examples like ( 2a ) as a bigger for the long distance anaphor parameter ; what prevents the ' learner from being deceived ? Why doesn t the learner conclude that English him is comparable to Icelandic sig ? We would argue that the learner is sensitive evidence . For example , the learner is aware of examples to disbibutional like ( 4 ) : ( 4 ) Johni likes himj ' where the pronoun is not co referential with anything else in the sentence . The existence of ( 4 ) implies that him cannot be a pure anaphor , long - distance or otherwise . Once the learner is aware of this disbibutional property of him , he or she can correctly rule out ( 2a ) as a potential bigger for the long - distance anaphor parameter . evidence , then , is crucial for parameter setting ; no theory of Disbibutional parameter setting can avoid statistical properties of the input text . How far can we push the statistical component of parameter setting ? In this chapter , we suggest that statistically based algorithms can be exploited to set parameters involving phenomena as diverse as word order , particularly verb second con sttuctions , and cliticization , the difference between free pronouns and proclitics . The work reported here can be viewed as providing the basis for a theory of bigger detection ; it seeks to establish a theory of the connection between the raw input text and the process of parameter setting .
-SettingProposal 3 Parameter Let us supposethattherearenbinaryparameterseachof which cantakeoneof . Thecoreof a naturallanguage two values(' + ' or ' - ' ) in a particularnaturallanguage I is uniquelydefmedonceall thenparametershavebeenassigneda value. Considera randomdivision of the parametersinto somem groups. Let us -Setting Machine (PSM) fIrSt call thesegroupsPI , P2,..., Pm. The Parameter goesaboutsettingall the parameterswithin the fIrStgroupP I concurrently, as . From a 1. Parameterscan be looked at as fixed points of variation amonglanguages correspond of a values different view two of may simply , parameter computationalpoint to two differentbits of codein the parser. We arenot committedto anyparticular schemefor the translationfrom a tuple of parametervaluesto the correspondinglanguage . However, the sortsof parameterswe considerhavebeenlisted in the previous section.
104
Chapter5
sketched below . After these parameters have been fixed , the machine next tries to set the parameters in group P2 in similar fashion , and so on. Allparameters are unset initially , that is, there are no presetvalues. The parser is organized to only obey all the universal principles . At this stage, utterances from any possible natural language are accommodated with equal ease, but no sophisticated structure can be built . 2. Both the values of each of the parametersPi E PI are " competing " to establish themselves. 3. Corresponding to Pi, a pair of hypotheses are generated, say H ~ and H ~ . 4. Next , these hypotheses are tested on the basis of input evidence. 5. If H ~ fails or H ~ succeeds, set Pi ' S value to ' + ' . Otherwise , set Pi ' S value to ' - ' .
3.1 Formal Analysis of the PSM We next considera particularinstantiationof the hypothesesandtheir testing. The way we havein mind involvesconsttuctingsuitablewindow sizesduring which the algorithm is sensitiveto occurrenceas well as non-occurrenceof . Regularfailure of a particularphenomenonto occur in a specificphenomena suitablewindow is onenatural, robustkind of indirect negativeevidence. For example, the pair of hypothesesmay be: 1. HypothesisH ~ : Expectnot to observephenomenafrom a fixed set O~ of '- ' phenomenawhich . supportthe parametervalue . i 2. HypothesisH,-: Expectnot to observephenomenafrom a fixed set 0 + of ' ' phenomenawhich supportthe parametervalue + . Let Wiandki be two small numbers. Testingthe hypothesisH ~ involvesthe following procedure: 1. A window of size Wi sentencesis consttuctedand a record is maintained whetheror not a phenomenonfrom within the seto ~ occurredamongthoseWi sentences . 2. This consttuctionof the window is repeatedki different times anda tally Ci is madeof the fraction of times the phenomenaoccurredat leastonce in the durationof the window. 3. ThehypothesisH + succeedsif andonly if theratio of Cito ki is lessthan0.5. Note that the phenomenaunder scrutiny are assumedto be such that the parseris alwayscapableof analyzing(to whateverextentnecessary ) the input. This is becausein our view the parserconsistsof a fixed, coreprogramwhose
TheAutomaticConstruction of a SymbolicParservia Statistical Techniques
105
behaviorcan be modified by selectingfrom amonga fmite set of " flags" (the ). Therefore, evenif not all of the flags havebeensetto the correct parameters values, the parseris suchthat it canat leastpartially representthe input. Thus, the parseris alwayscapableof analyzingthe input. Also, there is no needto explicitly storeany input evidence. Suitablewindow sizescan be constructed during which thealgorithmis sensitiveto occurrenceaswell asnon-occurrence of specificphenomena . By using windows, just the relevantbit of information from the input is extractedand maintained. (For detailedargumentationthat this is a reasonabletheoreticalargument, see[Kapur andBilardi, 1991; Kapur, 1993] .) Note alsothat we haveonly sketchedandanalyzeda particular, simple version of our algorithm. In general, a whole rangeof window sizesmay be usedandthis may be governedby thedegreeto which the different hypotheses haveearnedcorroboration. (For someideasalongthis directionin a moregeneral setting, see[Kapur, 1991; Kapur andBilardi , 1992] .) 3.2 Order in Which Parameters Get Set Note that in our approach certain parameters get set quicker than others. These are the ones that are expressed very frequently . It is possible that these parameters also make the infonnation extraction more efficient quicker , for example , by enabling structure building so that other parameters can be set. If our proposal is right , then, for example, the word -order parameters which are presumably the very first ones to be set must be set based on a very primitive parser capable of handling any natural language. At this early stage, it may be that word and utterance boundaries cannot be reliably recognized and the lexicon is quite rudimentary . Furthennore , the only accessible property in the input stream may be the linear word order. Another particular difficulty with setting word -order parameters is that the surface order of constituents in the input does not necessarily reflect the underlying word order. For example, even though Dutch and Gennan are SOY languages, there is a preponderance of SVO fonD S in the input due to the V2 (verb - se~ond) phenomenon. The finite verb in root clauses moves to the second position and then the first position can be occupied by the subject, objects (direct or indirect ), adverbials , or prepositional phrases. As we shall see, it is important to note that if the subject is not in the fIrSt position in a V2 language, it is most likely in the first position to the right of the verb. Finally , it has been shown by Gibson and Wexier [ 1992] that the parameter spacecreated by the head- direction parameters along with the V2 parameter has local maxima , that is , incorrect parameter settings from which the learner can never escape.
106
Chapter5
3. 3 Computational Analysis of the PSM
3.3.1 V2 Parameter In this section, we summarize results we have obtained which show that wordorderpara can plausibly be set in our model? The key concept we use is that of entropy , an information -theoretic statistical measureof randomness of a random variable. The entropy H (X ) of a random variable X , measured in bits , x . To give a concrete example, the outcome of a fair coin X is LxP ( ) logp ( - ) has an entropy of ( .5 * log (.5) + .5 * log ( .5 = 1 bit . If the coin is not fair and has .9 chance of heads and .1 chance of tails , then the entropy is around .5 bit . There is less uncertainty with the unfair coin - it is most likely going to turn up heads. Entropy can also be thought of as the number of bits on the average required to describe a random variable. Entropy of one variable , say X , conditioned on another, say Y, denoted as H (X I f ), is a measure of how much better the first variable can be predicted when the value of the other variable is known . We considered the possibility that by investigating the behavior of the entropy of positions in the neighborhood of verbs in a language, word order 3 characteristics of that language may be discovered. For a V2 language, we expect that there will be more entropy to the left of the verb than to its right , that is , the position to the left will be less predictable than the one to the right . We first show that using a simple distributional analysis technique based on the five verbs the algorithm is assumedto know , another 15 words , most of which turn out to be verbs, can readily be obtained. Consider the input text as generating tupies of the form (v , d , w) , where v is one of the top 20 words (most of which are verbs), d is either the position to the 4 left of the verb or to the right , and w is the word at that position . V , D , and W are the corresponding random variables. 2. Preliminaryresultsobtainedwith Eric Brill werepresentedat the 1993Georgetown Roundtableon LanguageandLinguistics: Presessionon Corpus-basedLinguistics. 3. In the competitionmodel for languageacquisition[MacWhinney, 1987], the child considerscuesto determinepropertiesof the language, but while thesecuesare reinforced in a statisticalsensethe cuesthemselvesarenot information-theoreticin the way that ours are. In somerecentdiscussionof triggering, Niyogi and Berwick [ 1993] formalize parametersetting as a Markov process. Crucially, there again the statistical assumptionon the input is merelyusedto ensurethat convergenceis likely andtriggers . aresimplesentences 4. We thankSteveAbney for suggestingthis formulationto us.
TheAutomatic ConSb "uction ofaSymbolic Parser viaStatistical Techniques 107 Theprocedure for settingtheV2 parameter is thefollowing: If H (WIVD = left) > H (WIVD = right) then+ V2 else- V2. Oneachof theninelanguages onwhichit hasbeenpossibleto testouralgorithm the correct result was obtained . (Onlythelastthreelanguages in table , 5.1 areV21anguages .) Furthennore , in almostall cases , pairedt testsshowed that the resultswerestatisticallysignificant . The amount(only 3000utterances unannotated ) andthequalityof theinput(unstructured inputcaretaker from the CHll ..DES database , 1991]), andthe [MacWhinney speech subcorpus resources neededfor parameter arepsychologically computational settingto succeed . Furthertestsweresuccess in orderto establish plausible fully conducted boththerobustness andthesimplicityof thislearningalgorithm . It is also clearthatoncethevalueof theV2 parameter hasbeencorrectlyset, theinputis farmorerevealingwithregardto otherword-orderparameters andtheytoocan besetusingsimilartechniques . In orderto makeclearhowthis procedure fits into ourgeneralparameter we out what the are . In the case of the V2 , spell settingproposal hypotheses are not separately sinceonehypothesis , the two hypotheses parameter necessary is theexactcomplement of theother. Sothehypothesis H + maybe asshown. H +: Expectnotto observe thattheentropyto theleft of theverbs Hypothesis is lowerthanthatto theright. Thewindowsizethatmaybeusedcouldbearound300utterances andthe numberof repetitionsneedto be around10. Our previousresultsprovide empiricalsupportthatthisshouldsuffice. Table5.1 Theconditional entropyresults
English French Italian Polish Tamil Turkish Dutch Danish German
H (W I v , D = left )
H(WI Y, D = right)
4.22 3.91 4.91 4.09 4.01 3.69 4.84 4.42 5.55
4.26 5.09 5.33 5.78 5.04 4.91 3.61 4.24 4.97
108
ChapterS
By assumingthat besidesknowing a few verbs, as before, the algorithm alsorecognizessomeof the first andsecondpersonpronounsof the language, we cannotonly determineaspectsof the pronoun system(seesection3.3.2) but also get information aboutthe V2 parameter.The first stepof learning is the sameas above; that is, the learneracquiresadditional verbsbasedon distribution analysis. We expectthat in the V2languages(Dutch andGerman), the pronounswill appearmoreoften immediatelyto the right of the verb than to the left. For French, English, and Italian, exactly the reverseis predicted. Our results (2- 1 or better ratio in the predicted direction) confirm these 5 predictions. 3.3.2 Clitic Pronouns We now show that our techniquescan lead to 6 sttaightforwardidentificationandclassificationof clitic pronouns. In orderto correctly set the parametersgoverningthe syntaxof pronominals, the learner mustdistinguishclitic pronounsfrom freeandweakpronounsaswell assortall pronounsystemsaccordingto their propercasesystem(e.g., nominativepronouns , accusativepronouns). Furthennore,the learnermusthavesomereliable methodfor identifying the presenceof clitic pronounsin the input stteam. The algorithm we report, which is also basedon the observationof entropiesof positionsin the neighborhoodof pronouns, not only distinguishes accurately betweenclitic andfreestandingpronounsbut alsosuccess fully sortsclitic pronouns . into linguistically naturalclasses It is assumedthat the learnerknows a set of first and secondpersonpronouns . The learning algorithm computesthe entropy profile for three positions to the left and right of the pronouns(H (W I P = p) for the six different ' positions, where p s are the individual pronouns. These profiles are then comparedand those pronouns that have similar profiles are clustered together . Interestingly, it turns out that the clustersare syntactically appropriate categories. In French, for example, basedon the Pearsoncorrelation coefficients we could deducethat the object clitics me and te, the subjectclitics je and tu, the non-clitics moi and toi, and the ambiguouspronounsnousand vousare most closelyrelatedonly to the otherelementin their own class. 5. We also verified that the object clitics in Frenchwere not primarily responsiblefor the correctresult. 6. Preliminaryresultswerepresentedat the Berneworkshopon L 1- andL2-acquisition of clause-internalrules: scramblingandcliticization in January1994.
of a SymbolicParservia Statistical TheAutomaticConstruction Techniques
109
VOUS 1 TOI
0.62
1
M Ol
0.57
0.98
1
ME
0.86
0.24
0.17
1
JE
0.28
0.89
0.88
-0.02
1
TU
0.41
0.94
0.94
0.09
0.97
1
TE
0.88
0.39
0.30
0.95
0.16
0.24
1
NOUS 0.91
0.73
0.68
0.82
0.53
0.64
0.87
1
VOUS TOI
M Ol
ME
JE
TU
TE
NOUS
In fact, the entropy signaturefor the ambiguouspronounscan be analyzed as a mathematicalcombinationof the signaturesfor the conflatedforms. To distinguishclitics from non-clitics we usethe measureof sticklness(proportion of timesthey aresticking to the verbscomparedto the times they aretwo or threepositionsaway). Theseresultsarequite good. The sticklnessis ashigh as54% to 55% for the subjectclitics; non-clitics havesticklnessno more than 17%. The resultscanbe seenmostdramaticallyif we chartthe conditionalentropy of positionsaroundthe pronounin question. Figure 5.1 showsthe unambiguous freestandingpronounsmoi andtoi. Comparemoi, the freestandingpronoun, the other fIrSt personpronounje (the nominativeclitic ), and me (the non-nominativeclitic ). The freestanding pronoun, moi is systematicallylessinformativeaboutits surroundingenvironment , correspondingto a slightly flatter curvethaneitherje or me. This distinctionin the slopesof the curvesis alsoapparentif we comparethe curve associatewith toi againstthe curvesassociatedwith tu (nominative) and te (non-nominative) in Figure5.2; toi hasthe gentlestcurve. This suggeststhat ' the learnercould distinguishclitic pronounsfrom freestandingpronounsby checkingfor sharpdrops in conditional entropy aroundthe pronoun; clitics shouldstandout ashavingrelatively sharpcurves. Note that we havethreedistinct curvesin figure 5.3. We havealreadydiscussed the differencebetweenclitic and freestandingpronouns. Do nominative and non-nominative clitics sort out by our method? Figure 5.3 suggests they might sinceje hasa sharpdip in conditionalentropyto its right , whereas me has a sharp dip to its left. Consider figure 5.4 where the conditional entropy of positions around je , tu, and on have been plotted. We have
110
Chapter5
FREE-STANDING UNAMBIGUOUS PRONOUNS ENTROPY ffij Ol
5
80 .
5
70 .
M
5.60 5.50 40 . . . .
:
.
" "
. .
. . . . .
. .
.
.
.
.
.
.
.
.
.
.
.
.
' .
.
.
.
.
.
.
.
.
.
.
.
"
"
.
. . . .
r
. .
'
.
,
:
. . . .
"
.
. . .
"
. .
5
5.30 5.20 5.10 5.00 4.90 4.80 4.70 4.60
. .
. . . " / ' \ . . . . . \ . . . . . / \ ; "
50 .
. . .
4
. ,
/ ,
:
. " "
4
40 .
4
30 .
4
20 .
4
.
4
00 .
" '
! " " ' /
. . . " . . . ,
!
" : " " 10 " " / " " ' !
POSITION
3.00
4.00
5.00
6
2.00
00 .
1.00
5.] Figure Entropyconditionedon position.
includedon with the first and secondpersonclitics sinceit is often usedas a first personplural pronounin colloquial French. All threeareunambiguously nominative clitic pronouns. Note that their curves are basically identical, showing a sharpdip in conditional entropy one position to the right of the clitic . Figure5.5 showsthe non-nominativeclitic pronounsmeandte. Onceagain, the curvesareessentiallyidentical, with a dip in entropyonepositionto theleft of the clitic. The positionto the left of the clitic will tend to be part of the subject (often a clitic pronounin the samplewe considered , it is ). Nevertheless
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
111
SECOND PERSON PRONOUNS ENTROPY
7.50
,. - - - - - - - - -
---,
"
7.00
"
,,
,,
,"
,,
"
,, ,,
6.50
6.00
5.50
5.00
4.50
4.00
", ,, ,, ,
'' ,, , ,
: ',
''
''
: ''
''
'
, ''
,'
"
'
, ',
'
VOUs TOI ----TU - - TE
,, : .." . ... - - '' , .- .- - -, .' , .' , , .... '' , ........ ., .. .- '' "" ........ .: .- ................ ' -- ' , ......... ,... , " .. . ... ... ; , , - .::' ., '." . . ." '..' , ... .'.. ' ' ". " " ... "" . " '. . , . '....... '... , ..... ... " ... ..~ .... " ~. .... , ' .....'". , " ' " ' , ' , ' , ' , ' , '" , "
3.50
POSITION 1.00
2.00
3.00
4.00
clear that the learner will have evidence to partition the clitic pronouns on the basis of where the dip in entropy occurs.
Let us turn, finally , to the interestingcasesof nous and vous. Thesepronouns areunusualin that they are ambiguousbetweenfreestandingandclitic pronounsand, furthermore, may occur as either nominative or non-nominative clitics. We would expectthem, therefore, to distinguishthemselvesfrom the other pronouns. If we considerthe curve associatedwith vous in figure 5.2, it is immediatelyapparentthat it hasa fairly gentle slope, as one would , the conditional entropy of expect of a freestandingpronoun. Nevertheless
112
Chapter5
FIRSTPERSONPRONOUNS ENTROPY ------------, ,"
.
" "
6 .60
.
," ,"
6 .40
. ,
," .
, "
6 .20
.
, , , "
6 .00
, , , '
5 .80 . .. ... . ..
5 .60 5 .40 5 . 20
... .
... . .. ...
, , , , " " " "
, ,
... .
, " ...
! " i ,i ,' I , i '' :
., "', ," '. " " "" " ". . " '" " " " " " " " "" "
. 500 4 .80 4 .60 4 .40
! f
4 .00
.-
"
.-
:
. ... ,:: .. .. . . .. ~ .. ... .. .. .. .. . , ' , :
.!
.-
M Ol . ... ... ME ---. JE
.. .. .
. .. ..
... .
" " "
' , ,, "
,
:
! II f " ". " '" " '
4 . 20
.
.
.
" , ' , ,
.. .. .. ...
" ...
, ,
.
'
,
.
! / .., :f ..,
POSITION
3 .80 1,00
2 ,00
3 ,00
5 .00
Figure5.3 Entropyconditionedon position.
vous is rather low both to its right and its left , a property we associatewith clitics ; in fact, its conditional entropy is systematicallylower than the unambiguo clitics tu andte, althoughthis fact may be due to our sample. Figure 5.6 comparesthe conditional entropy of positions surroundingvousand nous. Onceagain, we seethat nousand vousareassociatedwith very similar curves. Summarizing, we haveseenthat conditionalentropycan be usedto distinguish ' freestandingandclitic pronouns.This solvesat leastpart of the learners problem in that the methodcan form the basisfor a practical algorithm for
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
113
PRONOUNS NOMINA TIVECLITIC ENTROPY
IE TO ON
6.00 5.80 5 .60
4.80 1.00
3.00
4.00
5.00
POSITION
Figure 5.4 Entropyconditionedon position. detecting the presenceof clitics in the input stream. Furthennore , we have seen that conditional entropy can be used to break pronouns into further classeslike nominative and non- nominative . The learner can use these calculations as a robust, noise- resistant means of settingparameters . Thus , at least part of the problem of trigger detection has been answered. The input is such that the learner can detect certain systematic cues and exploit them in detennining grammatical properties of the target. At the very least, the learner could use " " these cues to fonn a rough sketch of the target grammar , allowing the learner to bootstrap its way to a full fledged grammatical system.
114
5 Chapter
-NOMINA TTVF CLITICPRONOUNS NON
ENTROPY
ME TE
'
00 .
6
5.80
\ ~ . .
. : .
.
-
.
- ".
.
.
.
-
'
:
:
.
.
5
60 .
5
20 .
5
00 .
4
80 .
4
60 .
r ! ~ , :
' ~
ij
"
' "
.
I
:
' ,
.
. '
1 ; j
4.40
,
4
20 .
4
00 .
3
80 . \
3.00
4.00
5.00
6.00
POsmON
00 .
2
00 .
1
Figure 5.5 Entropyconditionedon position. The Dutch clitic system is far more complicated than the French pronoun system (see, e.g., [Zwart , 1993]) . Even so, our entropy calculations made some headway toward classifying the pronouns. We are able to distinguish the weak and strong subject pronouns. Since even the strong subject pronouns in Dutch tend to stick to their verbs very closely and two clitics can come next to each other, the raw sticklness measureseemsto be inappropriate . Although the Dutch case is problematic owing to the effects of V2 and scrambling , we are in the processof treating thesephenomenaand anticipate that the pronoun calculations in Dutch will sort out properly once the influence of these other word -order processes are factored in appropriately .
The AutomaticConstructionof a Symbolic Parservia StatisticalTechniques
115
PRONOUNS AMBIGUOUS ENTROPY
V'5""i:fs NOiJS
4.20 4.15 4.10 4.05 4.00 3.95 3.90 3.85 3.80 3.75 3.70 3.65 3.60 3.55 3.50 3.45 3.40 3.35 3.30 3.25
POSITION
3.20 1.00
Figure Entropy
3.00
2.00
4.00
5.00
5 .6 conditioned
on position
.
4 Conclusions It needs to be emphasized that in our statistical procedure there is a mechanism available to the learning mechanism by which it can determine when it has seen enough input to reliably determine the value of a certain parameter. ( Such means are nonexistent in any trigger -based error -driven learning theory .) In principle at least, the learning mechanism can determine the variance in the quantity of interest as a function of the text size and then know when enough text has been seen to be sure that a certain parameter has to be set in a particular way .
116
Chapter5
We are currently extending the results we have obtained to other parameters and other languages. We are convinced that the word - order parameters [e.g ., ( 1) and ( 2)] should be fairly easy to set and amenable to an infonnation -theoretic analysis along the lines sketched earlier . Scrambling also provides a case where calculations of entropy should provide an immediate solution to the parameter-setting problem . Note however that both scrambling and V2 interact in an interesting way with the basic word -order parameters; a learner may be potentially misled by both scrambling and V2 into missetting the basic word order parameters since both parameters can alter the relationship between heads, their complements, and their specifiersiParameters involving adverb placement, extraposition , and wh - movement should be relatively more challenging to the learning algorithm given the relatively low frequency with which adverbs are found in adult speechto children . These cases provide good examples which motivate the use of multiple trials by the learner. The interaction between adverb placement and head movement , then, will pose an interesting problem for the learner since the two parameters are interdependent; what the learner assumesabout adverb placement is contingent on what it assumesabout head placement and vice versa. Acknowledgments We fIr Stly thank two anonymous referees for some very useful comments. We are also indebted to Isabelia Barbier , Eric Brill , Bob Frank , Aravind Joshi, Barbara Lust , and Philip Resnik along with the audience at the Balancing Act workshop at the Annual meeting of the Association of Computational Linguistics for comments on various parts of this chapter. References Anderson, S. 1986. The typology of anaphoricdependencies : Icelandic (and other) reflexives. In L . Helian and K. Christensen , editors. Topics in ScandinavianSyntax, pp. 65- 88. Dordrecht, The NetherlandsD. Reidel. Robin Clark. 1990. Paperson learnability and natural selection. TechnicalReport 1, Universite de Geneve, Departementde Linguistique generale et de linguistique fran~aise, FacultedesLettres, CH- 1211, Geneve4, 1990. TechnicalReportsin Formal andComputationalLinguistics. Robin Clark. 1992. The selectionof syntacticknowledge. LanguageAcquisition, 2(2): 83- 149. Robin Clark. 1994a. Hypothesisformation asadaptationto an environment: Learnability andnaturalselection. In BarbaraLust, Magui Suner, andGabrieliaHermon, editors, . PreSyntacticTheory and First LanguageAcquisition: Crosslinguistic Perspectives
The AutomaticConstructionof a SymbolicParservia StatisticalTechniques
117
sentedat the 1992symposiumon SyntacticTheory and First LanguageAcquisition: at Cornell University, Ithaca. Hillsdale, N.J., Erlbaum. CrossLinguistic Perspectives Robin Clark. 1994b. Kolmogorov complexity and the information contentof parameters . TechnicalReport. Philadelphia, Institute for Researchin Cognitive Science, University of Pennsylvania , 1994. . Complexityis theEngineof Variation, manuscript RobinClark andIan Roberts.In preparation . Universityof Pennsylvania , Philadelphia,andUniversityof Wales, Bangor. EdwardGibsonand KennethWexier. 1992. Triggers. Presentedat GLOW. Linguistic Inquiry, 25, pp. 407- 454. ShyamKapur. 1991. ComputationalLearningof Languages.PhiD. thesis, Cornell University . ComputerScienceDepartmentTechnicalReport91- 1234. ShyamKapur. 1993. How much of what? Is this what underliesparametersetting? In Proceedingsof the 25th StanfordUniversityChild LanguageResearchForum. ShyamKapur. 1994. Someapplicationsof formalleaming theoryresultsto naturallanguage acquisition. In BarbaraLust, Magui Suffer, andGabrieliaHermon, editors, Syntactic . Lawrence. Theoryand First LanguageAcquisition: Crosslinguistic Perspectives Presentedat the 1992symposiumon SyntacticTheoryandFirst LanguageAcquisition: at Cornell University. Hillsdale, N.J., Erlbaum. CrossLinguistic Perspectives ShyamKapur and GianfrancoBilardi. 1992. Languagelearningfrom stochasticinput. In Proceedingsof theFifth Conferenceon ComputationalLearningTheory. SanMateo, Calif., Morgan-Kaufman. ShyamKapur and Robin Clark. In press. The Automatic Identification and Classifica L2 the LJ and on at the Berne . Presented tion of Clitic Pronouns Acquisition workshop of Clause-Internal Rules: Scramblingand Cliticization, January1994. R. Manzini and KWexlerParameters , binding theory, and learnability. Linguistic Inquiry, 18: 413- 444, 1987. Brian MacWhinney. 1987. The competition model. In Brian MacWhinney, editor, Mechanismsof LanguageAcquisition. Hillsdale, N.J., Erlbaum. Brian MacWhinney. 1991. TheCHIWE S Project: Toolsfor AnalyzingTalk. Hillsdale, N.J., Erlbaum. ParthaNiyogi andRobertC. Berwick. 1993. Formalizingtriggers: A learningmodelfor . TechnicalReport A .I. Memo No. 1449. Cambridge, Mass., Institute of fmite spaces Technology. Also Center for Biological ComputationalLearning, Whitaker College PaperNo. 86. C. Jan-WouterZwart. 1993. Noteson clitics in Dutch. In Lars Helian, editor, Clitics in Germanicand Slavic, pp. 119- 155. Eurotyp Working Papers,ThemeGroup 8, vol. 4, . University of Tilburg, The Netherlands
6 Chapter Combining Linguistic with Statistical Methods in Automatic Speech Understanding
Patti Price
Speechunderstanding is an application that consists of two major components: the natural language processing component, which has traditionally been based on algebraic or symbolic approach es, and the speech recognition component, which has traditionally used statistical approach es. Price reviews the culture clash that has resulted as these two areas have been linked into the larger speech understanding task. Her position is that balancing the symbolic and the statistical will yield results that neither community could achieve alone. Price points out that the best performing speech recognition systems have been based on statistical pattern matching techniques. At the same time , the mostfully developed natural language analysis systemsof the 1970s and 1980s were rule -based, using symbolic logic , and often requiring large sets of handcrafted rules. When these two were put together in the late 1980s - most notably in the United States, in the context of projects funded by the Defense Advanced Research Projects Agency (DARPA ) - the result was to have an effect on both communities. Initially , that effect tended to be the fostering of skepticism, as ' shown in Price s table 6.1 , but increasingly the result has been a tendency to combine symbolic with statistical and engineering approach es. Price concludes her thoughtful review by presenting some of the challenges in achieving the balance and some of the compromises required by both the speechand naturallan guage processing communities in order to reach their shared goal .- Eds . 1
Introduction : The Cultural Gap
This chapterpresentsanoverviewof automaticspeechunderstanding techniques es with statisticalpatternmatchingmethods. that combinesymbolic approach arisefrom different The two majorcomponenttechnologiesin speechunderstanding : naturallanguage( NL) understandingtechnologyhas cultural heritages es, and speechrecognition used algebraicor symbolic approach traditionally
120
Chapter6
es. Integrationof thesetechnologies technologyhastraditionallyusedstatisticalapproach " " in speechunderstanding escultural requiresa balancingact that address and technicaldifferencesamongthe componenttechnologiesand their . representatives As arguedin Price and Ostendorf[ 1995], representatives of symbolic apand of es based on statisticalpatternmatchingmay view each proaches approach other with somesuspicion. Psychologistsand linguists, representingsymbolic es, may view automaticalgorithmsas "uninterestingcollectionsof ad approach hocungeneralizable methodsfor limited domains." The automaticspeechrecognition community, on the other hand, may argue that automaticspeechrecognition shouldnot be modeledafterhumanspeechrecognition; sincethetasks and goalsof machinesare very different from thoseof humans, the methods shouldalsobedifferent. Thus, ~ this view, symbolicapproach esare" uninteresting collectionsof ad hoc ungeneralizable methodsfor limited domains." The samewordsmaybe used, but meandifferentthings, asindicatedin table6.1. It is the thesisof this chapterthat balancingthe symbolicand the statistical es can yield results that neither community alone could achieve approach because : . Statisticalapproach es alonemay tend to ignore the importantfact that spoken is a social mechanismevolvedfor communicationamongentities language whosebiological propertiesconstrainthe possibilities. Mechanismsthat are successfulfor machinesarelikely to sharemanypropertieswith thosesuccessful for people, andmuchof our knowledgeof humanpropertiesis expressedin symbolicfonn. . Symbolictechniquesalonemay not be powerful enoughto model complex humanbehavior; statisticalapproach eshavemanyvaluabletraitsto beleveraged . Table6.1 -culturalmini-lexicon Cross
Uninteresting
Ad hoc U ngeneralizable
Linguists
Engineers
Providesno explanation of cognitiveprocess es. Without theoretical motivation. " Techniquesthat help you climb a treemay not help " you get to the moon.
Providesno useful applications. Must be providedby hand. Expenseof knowledge engineeringprohibits new or more assessing complexdomains.
CombiningLinguistic with StatisticalMethods
After a brief historical survey(section2), this chaptersurveysthe fields of speechrecognition(section3), of NL understanding(section4), and of their integration(section5), and concludeswith a discussionof currentchallenges (section6). 2
Historical Considerations
Activity andresultsin automaticspeechunderstandinghaveincreasedin recent years. The DARPA (DefenseAdvancedResearchProjectsAgency) program mergerof two previously independentprograms(speechand NL ) has had a profound impact. Previously, the speechrecognitionprogramfocusedon the automatic transcriptionof speech, whereasthe NL understandingprogram focusedon interpretingthe meaningsof typedinput. In the DARPA speechunderstandingprogramof the 1970s(see, e.g. [Klatt, 1977]), artificial intelligence(AI ) was a relatively new field full of promise. Systemswere developedby separatingknowledgesourcesalong traditional linguistic divisions: for example, acousticphonetics, phonology, morphology, lexical access , syntax, semantics , discourse. The approachwas largely symbolic and algebraic; rules were devised, measurements were made, thresholds wereset, anddecisionsresulted. A key weaknessof the approachprovedto be the numberof modulesandthedecision-makingprocess.Wheneachmoduleis forced to make irrevocabledecisionswithout interactionwith other modules, errorscanonly propagate ; a seven-stageserialprocessin which eachmoduleis 90% accuratehasan overall accuracyof lessthan 50%. As statisticalpattern matchingtechniqueswere developedand performedsignificantly better than the symbolicapproach es with significantly lessresearchinvestment,the funding focusandthe researchcommunity' s activities shifted. Thedifferencesin performancebetweenthetwo approach esduringthe 1970s could be viewedasa lessonfor both symbolicandstatisticalapproach es: making irrevocabledecisionsearly (beforeconsideringmore knowledgesources ) can severelydegradeperformance . Statistical models provide a convenient mechanismfor suchdelayeddecision-making, and subsequenthardwareand algorithmicdevelopmentsenabledthe considerationof increasinglylargersets of hypotheses . Although statisticalmodelsare certainly not the only tool for , they do provideseveralimportantfeatures: investigatingspeechandlanguage . They canbe trainedautomatically (providedtherearedata), which facilitates porting to new domainsand uses. . They can provide a systematicand convenientmechanismfor combining . multiple knowledgesources
122
Chapter6
. They can express the more continuous properties of speech and language (e.g ., prosody , vowel changes, and other sociolinguistic processes) . . They facilitate use of large corpora , which is important since the more abstract linguistic units are relatively rare compared to the phones modeled in speechrecognition; hence large corpora are needed to provide enough in -
stancesto be modeled. . They provide a meansfor assessingincompleteknowledge. . They can provide a meansfor acquiringknowledgeabout speechand language . The advantages summarizedabovearefurtherelaboratedin PriceandOstenof statisticalmodelsmaybelack of familiarity dorf [ 1995]. Thebiggestdisadvantage es. The following to those more comfortablewith symbolic approach sectionsoutline how cultural andtechnicalchallengesare being met througha esto speech , NL understanding , andtheir integration. varietyof approach 3
Speech Recognition Overview
For severalyears, the best perfonning speechrecognitionsystemshavebeen basedon statisticalpatternmatchingtechniques[Pallett et al., 1990; Pallett, 1991; Pallettet al., 1992, 1993, 1994, 1995] . The mostcommonlyusedmethod is probably hidden Markov models (HMMs) (see, e.g. [Bahlet al., 1983; Rabiner 1989; Picone 1990]), althoughthere is significant work using other patternmatchingtechniques(see, e.g. [OstendorfandRoukos1989; Zue et al., es (seee.g. [Hampshireand 1992]), including neuralnetwork- basedapproach es(seee.g. [Abrash Weibel 1990]) andhybrid HMM neuralnetworkapproach et al., 1994]). One can think of the symbolic componentsas representingour knowledge, and of the statisticalcomponentsas representingour ignorance. The words, phones, and stateschosenfor the modelare manipulatedsymbolically . Statisticalmethodsare usedto estimateautomaticallythoseaspectswe cannotor do not want to modelexplicitly . Typically, developmentof recognition systemsinvolvesseveralissues.Samplesareoutlinedbelow. Feature Selection and Extraction If the raw speechwavefonn is simply sampledin time and amplitude, there are far too much data; some feature extractionis needed.The mostcommonfeaturesextractedarecepstralcoefficients (derivedfrom a spectralanalysis), andderivativesof thesecoefficients. Although therehasbeensomeincorporationof knowledgeof the humanauditory systeminto featureextractionwork, little hasbeendonesincethe 1970sin
CombiningLinguistic with StatisticalMethods
123
implementinglinguistically motivatedfeatures(e.g., high, low, front, back) in a recognitionsystem. (See, however, the work of Ken Stevensand colleagues for significantwork in this areanot yet incorporatedin automatic speechrecognition systems[ Stevenset aI., 1992]) . A representation of phonesin termsof a small setof featureshasseveraladvantagesin speechrecognition: fewer parameters could be betterestimatedgiven a fixed corpus; phonesthat are rare or unseenin the corpuscould be estimatedon the basisof the more frequently occurringfeaturesthat composethem; and sincefeaturestend to changemore slowly thanphones, it is possiblethat samplingin time could be lessfrequent. Acoustic and Phonetic Modeling A Markov model representsthe probabilities of sequencesof units, for example, words or sounds. The " hidden" Markov model, in addition, modelsthe uncertaintyof the current" state." By analogywith speechproduction, and using phonesas states, the mechanism can be thoughtof as modeling two probabilitiesassociatedwith each phone: the probability of the acousticsgiven the phone(to modelthe variability in the realizationof phones), andthe probability of transitionto anotherphonegiven the currentphone. ThoughsomeHMMs are usedthis way, most systemsuse statesthat aresmallerthana phone(e.g., fIrSt, middle, andlastpart of a phone). Such models have more parameters , and hencecan provide greaterdetail. Adding skipsand loops to the statescan model the temporalvariability of the realizationof phones.Giventhemodel, parametersareestimatedautomatically from a corpusof data. Thus, modelscanbe " tuned" to a particular(representative ) sample, an importantattributefor porting to new domains. Model Inventory Although many systemsmodelphones, or phonesconditioned on the surroundingphonetic context, others claim improved performance through the selection of units or combination of units determined automaticallyor semiautomatically(see, e.g. [Bahlet al., 1991]). The SRI system combinesphonemodelsbasedon a hierarchyof linguistic contextsdiffering in detail, combinedasa functionof theamountof trainingdatafor each(see [Butzbergeret al., 1992]) . Distributions In the HMM fonnulation, the stateoutput disbibutionshave beena topic of researchinterest~Generally speaking, modeling more detail improvesperfonnance,but requiresmoreparametersto estimate,which in turn requiresmore data for robust estimation. Methods have been developedto reducethenumberof parametersto estimatewithout degradingaccuracy,some
124
Chapter6
. Seeexamples in [Kimballand of which include constraints basedonphonetics Ostendorf, 1993] and [DigalakisandMurveit 1994]. Pronunciation Modeling Individual HMMs for phonescanbe concatenated to model words. Linguistic knowledge, perhapsin the form of a dictionary or by rules, typically determinesthe sequenceof phonesthat make up a word. Linguistic knowledgein the form of phonologicalrules can be usedto model possiblevariationsin pronunciation, suchas the flap or stoprealizationof It!. For computationalefficiency (at the expenseof storage), additionalpronunciations canbe addedto the dictionary. This solutionis not ideal for the linguist, sincedifferent pronunciationsof the sameword are treatedastotally independent even thoughthey may shareall but one or two phones. It is also not an ideal engineeringsolution, sincerecognitionaccuracymay be lost depending on the implementation, since words with more pronunciationsmay be disfavored . The work of Cohen (e.g. relative to those with few pronunciations and others see ( , e.g. [ Withgottand Chen, [Cohenet al., 1987]; Cohen, 1989) essomeof theseissues,but this areacould likely benefitgreatly 1993]) address from a betterintegrationof symbolicknowledgewith statisticalmodels.
Language Modeling Any method that can be used to consttain the sequence of occurringwordscan be thoughtof as a languagemodel. Modeling of wordsthe way word pronunciationsaretypically modeled, (i.e., a sequences ) is not a solutiona linguist or an engineer dictionaryof all possiblepronunciations for the most constrainedapplications). A simple would propose(except alternativeis to model all words in parallel andadd a loop from the end to the " " " " beginning, whereoneof the words is the end-of-sentence word so that the sentencesare not infinitely long. Of course, this simple model hasthe disadvantage of assumingthat the endsof all words areequivalent(the samestate). This model assumesthat at eachpoint in an utterance, all words are equally likely , which is not true of any humanlanguage.Alternatively, Markov models canbe usedto estimatethe likelihoodsof wordsgiven the previousword (or N words or word classes ), basedon a training corpusof sentencetranscriptions. are more likely than others, little Exceptfor the intuition that somesequences is used . That intuition is difficult to call " linguistic" since, knowledge linguistic althoughtheremay be somerecognitionof doubtful cases,gramrnaticalityhas traditionally beena binary decisionfor manylinguists. This will likely change as linguists begin to look at spontaneousspeechdata. Statisticalmodelingof linguistically relevant relationships(e.g., number agreementof subject and verb; or co-occurrencesof adjectiveswith nouns, which may be an arbitrary
CombiningLinguistic with StatisticalMethods
125
number of words away from each other) is a growing area of interest. For , examples,seethe numerouspaperson this topic in the (D)ARPA , Eurospeech on ICSLP and InternationalConference SpokenLanguageProcessing( ) proceedings over the pastseveralyears. Search Given the acoustic models, the languagemodels, and the input , the role of the recognizeris to searchthroughall possiblehypotheses speech and fmd the best (most likely) string of words. As the acousticand language modelsbecomemoredetailedthey becomelarger, andthis canbe anenonnous task, even with increasingcomputationalpower. Significant effort has been spenton managingthis search.Recentinnovationshaveinvolved schemesfor makingmultiple passesusingcoarsermodelsat first to narrow the searchand progressivelymore detailedmodels.to further narrowthe prunedsearchspace (see, e.g. [Murveit et al., 1993; Nguyenet al., 1993]). Typically, more extensive linguistic knowledgeis more expensiveto computeand is savedfor later " " esusedfor integrationof speechandnaturallanguage stages.The N-best approach (seesection5) havealsobeenusedto improvespeechrecognition.
4 NaturalLanguageUnderstanding es to NL understandinghave been basedin symbolic Traditional approach rule based estypically involving largesetsof handcrafted approach logic, using since the first rules. However, joint meetingof the speechandNL communities in 1989, the numberof papersand the rangeof topics addressedusing statistical . At the last two meetings, the categoryof methodshavesteadilyincreased statisticallanguagemodelingandmethodsreceivedthe mostabstractsandwas . one of the mostpopularsessions In the mergerof speechwith NL , the traditional computationallinguistic approachof coveringa set of linguistically interestingexampleswas put to a severetestin the attemptto cover, in a limited domain, a setof utterancesproduced by peopleengagedin problem-solving tasks. Severalnew sourcesof complexitywereintroduced: the moveto an empirically basedapproach(covering a seeminglyendlessnumberof " simple" things becamemore important thancoveringthe " interesting," but morerare, complexphenomena ), the separation of testandtraining materials(addingrulesto coverphenomenaobserved in the training corpusmayor may not affect coverageon an independenttest speech(which hasa different, and perhaps corpus), the natureof spontaneous more creative, structurethan written language, previously the focus of much NL work), andrecoveryfrom errorsthat canoccurin recognitionor by the fact
126
Chapter6
. that talkers do not always produceperfectly fluent well-formed utterances of statisticalapproach es(asoutlinedabove) areappropriate Many of the advantages for dealing with theseissues. The growing tendencyto combinesymbolic with statisticaland engineeringapproach es, basedon recentpapers, is representedin severalresearchareas.describedbelow. Lexicon Although speechrecognitioncomponentsusuallyusea lexicon, lexical . Different tools in NL aremorecomplexthanlists of wordsandpronunciations formalismsstoredifferent typesand formatsof information, including, for example, morphologicalderivations, part-of-speechinformation, and syntactic and semanticconstraintson combinationswith other words. Recently, therehasbeenwork in using statisticalinformation in lexical work. See, for -frequenciesfor word sensedisambiguationin [Miller example, theuseof sense et al., 1994] . Grammar An NL grammarhastraditionally beena set of rules devisedby . observationof or intuitions concerningpatternsin a languageor sublanguage Typically, such grammarshave either accepteda sentenceor rejectedit , although grammarsthat degrademore gracefully in the face of spontaneous and recognitionerrors are being developed(see, e.g. [Hindle, 1992]). speech Basedon the grammarused, the goal of parsingis to retrieveor assigna structure . Traditionally, to a string of words for useby a later stageof processing on a of have worked deterministically singlestring input. Whenparsers parsers werefacedwith typedinput, asidefrom the occasionaltypographicalerror, the intendedwords werenot in doubt. The mergerof NL with speechrecognition has forced NL componentsto considerspeechdisfluencies, novel syntactic constructions , andrecognitionerrors. The indeterminancyof the input andthe needto analyzevarioustypesof ill -formed input haveled to an increaseduse of statisticalmethods. The (D)ARPA, Eurospeech , and ICSLP proceedingsof recentyearscontain several.examplesof combining linguistic and statistical componentsin grammars,parsers,andpart-of-speechtaggers. of meaning Interpretation Interpretationis the stageat which a representation is consttucted, and may occur at different stagesin different systems. Of is not of muchusewithout a " back-end" thatcanuse course, this representation the representationto perform an appropriateresponse , for example, retrievea set of datafrom a database , ask for more information, etc. This stageis typically purelysymbolic, thoughlikelihoodsor scoresof plausibility maybe used. Seealso the work on sensedisambiguationmentionedabove. Somework has
CombiningLinguistic with StatisticalMethods
127
beendevotedto probabilistic semanticgrammars(see[Seneff, 1992]) and to " hidden " understanding(see[Miller , et al., 1995]). 5 Integration of SpeechRecognition and Natural Language Understanding The integrationof speechwith NL hasseveralimportantadvantages : ( 1) To NL can , understandingspeechrecognition bring prosodicinfonnation, infonnation importantfor syntaxandsemanticsbut not well representedin text; (2) NL can bring to speechrecognitionseveralknowledgesources(e.g., syntax and semantics ) not previously used (N-grams model only local constraints, and largely ignore systematicconstraintssuchasnumberagreement ); (3) for both, the integrationaffords the possibility. of many more applicationsthan could otherwisebe envisioned,andthe acquisitionof new techniquesandknowledge basesnot previouslyrepresented . Althoughtherearemanyadvantages , integrationof speechandNL givesrise to somenew challenges , including integrationstrategies , the effective use in NL of a new sourceof infonnation from speech(prosody, in particular), and the handlingof spontaneous speecheffects. Prosodyanddisfluenciesareespecially importantissuesin the integrationof speechand NL sincethe evidence for themis distributedthroughoutall linguistic levels, from phoneticto at least the syntacticandsemanticlevels. Integrationstrategies , prosody, anddisfluenciesaredescribedbriefly below (an elaborationappearsin [Price, 1995]). Integration There is much evidence that human speech understanding involvesthe integrationof a greatvariety of knowledgesources,andin speech recognitiontighter integrationof componentshasconsistentlyled to improved perfonnance.However, asgrammaticalcoverageincreases , standardNL techniques can becomecomputationallydifficult and provide less constraintfor . On the otherhand, a simpleintegrationby concatenationis suboptimal speech becauseany speechrecognitionerrorsarepropagatedto the NL systemandthe speechsystemcannottakeadvantageof the NL knowledgesources . In the face of culturalandtechnicaldifficulties with tight integrationandthe limitationsof " " a simpleconcatenation , N-best integrationhasbecomepopular: The connection betweenspeechand NL can be strictly serial, but fragility problemsare mitigatedby the fact that speechoutputsnot onebut manyhypotheses . The NL can then use other know component ledgesourcesto detenninethe best-scoring . The D hypothesis ( )ARPA, Eurospeech , and ICSLP proceedingsover the past severalyearscontainseveralexamplesof the N-bestapproach.In addition, the
128
Chapter6
special issue of Speech Communication on spoken dialogue [ Shiral and Furui , 1994] contains several contributions on this topic .
Prosody Prosody can be defined as the suprasegrnentalinfonnation in speech; that is, infonnation that c~ ot be localizedto a specific soundsegment , or infonnation that does not changethe segmentalidentity of speech . Prosodicinfonnation is not generallyavailablein text-basedsystems segments , exceptinsofar as punctuationmay indicatesomeprosodicinfonnation. Prosodycanprovideinfonnationaboutsyntacticstructure,discourse,andemotion and attitude. A surveyof combiningstatisticalwith linguistic methodsin prosodyappearsin [Price andOstendorf, 1995]. SpontaneousSpeech The s~ e acousticatbibutesthat indicatemuchof the prosodic structure (pitch and duration patterns) are also very common in speechthat seemto be morerelatedto the speechplanning aspectsof spontaneous the structureof the utterance.Disfluenciesarecommonin than to process normal speech. However, modeling of speechdisfluenciesis only beginning (see[Shriberget al., 1992; Lickley , 1994; Shriberg, 1994]). The distributionof disfluenciesis not random, and may be a part of the communicationitself. Although disfluenciestendto be lessfrequentin human-computerinteractions thanin human-humaninteractions,aspeoplebecomeincreasinglycomfortable with human-computerinteractionsand concentratemore on the task at hand thanon monitoringtheir speech,disfluenciescanbe expectedto increase.
6 Current Challenges Although progress has been made in recent years in balancing symbolic with statistical methods in speech and language research, important challenges remain. A few of the challenges for speech recognition , for NL , and for their integration are outlined below .
6.1 SpeechRecognition Challenges Someof our knowledge, perhapsmuchof our knowledge, aboutspeechhasnot beenincorporatedin automaticspeechrecognitionsystems.For example, the notion of a prototypeanddistancefrom a prototype(see, e.g. [Massaro, 1987; Kuhl , 1990]) which seemsto explain much datafrom speechperception(and other areasof perception), is not well modeledin the currentspeechrecognition frameworks. A personwho hasnot beenwell understoodtendsto change . This may involve speakhis or her speechstyle so asto be betterunderstood
CombiningLinguistic with StatisticalMethods
129
ing more loudly or more clearly, changingthe phrasing, or perhapsevenleaving pausesbetweenwords. Thesechangesmay help in human-humancommunication , but in typical human-machineinteractions, they result in forms that are more difficult for the machineto interpret. The conceptof a prototypein machinerecognitioncould leadto morerobustrecognitiontechnology. escommonin statisticalmethods That is, the maximum-likelihood approach to speechrecognitionmiss a crucial aspectof language:the role of contrast. A given linguistic entity (e.g., phone) is characterizednot just by what it is but also by what it is not, that is, the systemof contrastin which it is involved. Thus, hyperarticulationmay aid communicationover noisy telephonelines for humans,but may decreasethe performanceof recognizerstrainedon a corpus in which this styleof speechis rareor missing. The resultscanbe disastrousfor , a commonreactionis to applications, sincewhen a recognizermisrecognizes hyperarticulate([Shriberget al., 1992]). Although many factors affect how well a systemwill perform, examining recentbenchmarkevaluationscangive an ideaof the relativedifficulty of various aspectsof speech(seee.g. [Pallettet al., 1995]). Suchareasmight be able to takeadvantageof increasedlinguistic knowledge. For example, the variance acrossthe talkersusedin the test set was greaterthan the varianceacrossthe systemstested. Further, the varioussystemstestedhad the highesterror rates for the samethreetalkers, who werethe fastesttalkersin the set. Theseobservations could be taken as evidencethat variability in pronunciation, at least insofarasfast speechis concerned , is not yet well modeled. 6.2 Natural Language Challenges Results in NL understanding have been more resistant to quantification than those in speechrecognition ; people agreemore on what string of words was said than on what those words mean. Evaluation is important to scientific progress, but how do we evaluate an understanding system if we are unable to agree on what it means to understand? In the DARPA community , this question has been postponed somewhat by agreeing to evaluate on the answer returned from a database. Trained annotators examine the string of words ( NL input ) and use a databaseextraction tool to extract the minimum and maximum accepted set of " " tupies from the evaluation database. A comparator then automatically determines whether a given answer is within the minimum and maximum allowed . The community is not , however , content with the current expense and limitations of the evaluation method described above, and is investing significant resources in finding a better solution . Key to much of the debate is the cultural gap: engineers are uncomfortable with evaluation measures that cannot be
130
Chapter6
automated(forgettingthe role of the annotatorin the currentprocess); andlinguists are uncomfortablewith evaluationsthat are not diagnostic; and, of course, neithersidewantssignificantresourcesto go to evaluationthat would otherwisego to research .
6.3 Integration Challenges In fact, most of this chapterhasaddressedthe challengeof integratingspeech with NL , andmuchof thechallengehasbeenarguedto berelatedto culturaldifferences as much as to technicaldemands . As arguedin Price and Ostendorf the 1995 [ ], increasinglypopularclassificationand regressiontrees, or decision trees(see, e.g. [Breirnanet al., 1984]) appearto be a particularlyusefultool in bridgingtheculturalandtechnicalgapin question.In this formalism, the speech researcheror linguist can spe