RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING
AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE Gene...
32 downloads
936 Views
16MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING
AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F. KONRAD KOERNER (University of Ottawa) Series IV - CURRENT ISSUES IN LINGUISTIC THEORY
Advisory Editorial Board Henning Andersen (Los Angeles); Raimo Anttila (Los Angeles) Thomas V. Gamkrelidze (Tbilisi); John E. Joseph (Edinburgh) Hans-Heinrich Lieb (Berlin); Ernst Pulgram (Ann Arbor, Mich.) E. Wyn Roberts (Vancouver, B.C.); Danny Steinberg (Tokyo)
Volume 136
Ruslan Mitkov and Nicolas Nicolov (eds) Recent Advances in Natural Language Processing
RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING SELECTED PAPERS FROM RANLP'95
Edited by
RUSLAN MITKOV University of Wolverhampton
NICOLAS NICOLOV University of Edinburgh
JOHN BENJAMINS PUBLISHING COMPANY AMSTERDAM/PHILADELPHIA
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences — Permanence of Paper for Printed Library Materials, ANSI Z39.48-1984.
Library of Congress Cataloging-in-Publication Data Recent advances in natural language processing : selected papers from RANLP'95 /dited by Ruslan Mitkov and Nicolas Nicolov. p. cm. - (Amsterdam studies in the theory and history of linguistic science. Series IV, Current issues in linguistic theory, ISSN 0304-0763 ; v. 136) Includes bibliographical references and index. 1. Computational linguistics-Congresses. I. Mitkov, Ruslan. II. Nicolov, Nicolas. III. International Conference on Recent Advances in Natural Language Processing (1st : 1995 : Tsigov Chark, Bulgaria) IV. Series: Amsterdam studies in the theory and history of linguistic science. Series IV, Current issues in linguistic theory : v. 136. P98.R44 1997 410'.285-dc21 97-38873 ISBN 90 272 3640 2 (Eur.) /1-55619-591-5 (US) (alk. paper) CIP © Copyright 1997 - John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O.Box 75577 · 1070 AN Amsterdam · The Netherlands John Benjamins North America · P.O.Box 27519 · Philadelphia PA 19118-0519 · USA
TABLE OF CONTENTS Editors' Foreword
ix
I. MORPHOLOGY AND SYNTAX Aravind K. Joshi Some linguistic, computational and statistical implications of lexicalised grammars
3
Allan Ramsay & Reinhard Schäler Case and word order in English and German
15
Khalil Sima'an An optimised algorithm for data oriented parsing
35
Marcel Cori, Michel de Fornel & Jean-Marie Marandin Parsing repairs
47
Matthew F. Hurst Parsing for targeted errors in controlled languages
59
Ismail Biskri & Jean-Pierre D e s c l è s Applicative and combinatory categorial grammar (from syntax to functional semantics)
71
Udo Hahn & Michael Strube PARSETALK about textual ellipsis
85
Iñaki Alegria, Xabier Artola & Kepa Sarasola Improving a robust morphological analyser using lexical transducers
97
VI
TABLE OF CONTENTS
IL SEMANTICS AND DISAMBIGUATION Eiåeki Kozima & Akira Ito Context-sensitive word distance by adaptive scaling of a semantic space
111
M. Victoria Arranz, Ian Radford, Sofia Ananiadou & Jun-ichi Tsujii Towards a sublanguage-based semantic clustering algorithm 125 Roberto Basili, Michelangelo Della Rocca, Maria Teresa Pazienza & Paola Velardi Contexts and categories: tuning a general purpose verb classification to sublanguages
137
Akito Nagai, Yasushi Ishikawa & Kunio Nakajima Concept-driven search algorithm incorporating semantic interpretation and speech recognition
149
Eneko Agirre & German Rigau A proposal for word sense disambiguation using conceptual distance
161
Olivier Ferret & Brigitte Grau An episodic memory for understanding and learning
173
Christian Boitet & Mutsuko Tomokiyo Ambiguities and ambiguity labelling: towards ambiguity data bases
185
III. DISCOURSE Małgorzata E. Stys & Stefan S. Zemke Incorporating discourse aspects in English - Polish MT
213
Ruslan Mitkov Two engines are better than one: generating more power and confidence in the search for the antecedent
225
TABLE OF CONTENTS
VII
Tadashi Nomoto Effects of grammatical annotation on a topic identification task
235
Wiebke Ramm Discourse constraints on theme selection
247
Geert-Jan M. Kruijff & Jan Schaake Discerning relevant information in discourses using TFA
259
IV. GENERATION Nicolas Nicolov, Chris Mellish & Graeme Ritchie Approximate chart generation from non-hierarchical representations
273
Christer Samuelsson Example-based optimisation of surface-generation tables
295
Michael Zock Sentence generation by pattern matching: the problem of syntactic choice
317
Ching-Long Yeh & Chris Mellish An empirical study on the generation of descriptions for nominal anaphors in Chinese
353
Kalina Bontcheva Generation of multilingual explanations from conceptual graphs
365
V. CORPUS PROCESSING AND APPLICATIONS Jun'ichi Tsujii Machine Translation: productivity and conventionality of language
377
Ye-Yi Wang & Alex Waibel Connectionist F-structure transfer
393
Yuji Matsumoto & Mihoko Kitamura Acquisition of translation rules from parallel corpora
405
viii
TABLE OF CONTENTS
Harris V. Papageorgiou Clause recognition in the framework of alignment
417
Daniel B. Jones & Harold Somers Bilingual vocabulary estimation from noisy parallel corpora using variable bag estimation
427
Jung Ho Shin, Young S. Han & Key-Sun Choi A HMM part-of-speech tagger for Korean with wordphrasal relations
439
Ivan Bretan, Måns Engstedt & Björn Gambäck A multimodal environment for telecommunication specifications
451
List and Addresses of Contributors
463
Index of Subjects and Terms
469
Editors' Foreword This volume brings together revised versions of a selection of papers pre sented at the First International Conference on "Recent Advances in Natural Language Processing" (RANLP'95) held in Tzigov Chark, Bulgaria, 14-16 September 1995. The aim of the conference was to give researchers the opportunity to present new results in Natural Language Processing (NLP) based on modern theories and methodologies. Alternative techniques to mainstream symbolic NLP, such as analogy-based, statistical and connectionist approaches were also covered. It would not be too much to say that this conference was the most sig nificant NLP event to have taken place in Eastern Europe since COLING'82 was held in Prague and COLING'88 in Budapest, and one of the most im portant conferences in NLP for 1995. The conference received submissions from more than 30 countries. Whilst we were delighted to have so many contributions, restrictions on the number of papers which could be presented forced us to be more selective than we would have liked. From the 48 papers presented at RANLP'95 we have selec ted the best for this book, in the hope that they reflect the most significant and promising trends (and succesful results) in NLP. The book is organised thematically. In order to allow for easier access, we have grouped the contributions according to the traditional topics found in Natural Language Processing, namely, morphology, syntax, grammars, parsing, semantics, discourse, grammars, generation, machine translation, corpus processing, and multimedia. Clearly, some papers lie at the inter section of various areas. To help the reader find his/her way we have added an index which contains major terms used in NLP. We have also included a list and addresses of contributors. We believe that this book will be of interest to researchers, lecturers and graduate students interested in Natural Language Processing and, more specifically, to those who work in Computational Linguistics, Corpus Lin guistics, and Machine Translation. Given the success of the 1995 Conference, it has been decided that "Re cent Advances in Natural Language Processing" will be the first in a series of conferences to be held biennially, the next being scheduled for 1997 (11-13 September 1997).
EDITORS' FOREWORD
χ
We would like to thank all members of the Program Committee. Without them the conference, although well organised (special thanks to Victoria Arranz), would not have had an impact on the development of NLP. Together they have ensured t h a t the best papers were included in the final proceed ings and have provided invaluable comments for the authors, so t h a t the papers are 'state of the art'. The following is a list of those who participated in the selection process and to whom a public acknowledgement is due: Branimir Boguraev Christian Boitet Eugene Charniak Key-S un Choi Jean-Pierre Déseles Anne DeRoeck Rodolfo Delmonte Steve Finch Eva Hajičová Johann Haller Paul Jacobs Aravind Joshi Lauri Karttunen Martin Kay Richard Kittredge Karen Kukich Josef Mariani Carlos Martin-Vide Yuji Matsumoto Kathleen McKeown Ruslan Mitkov Nicolas Nicolov Sergei Nirenburg Manfred Pinkal Allan Ramsey Harold Somers Pieter Seuren Oliviero Stock Benjamin T'sou Jun-ichi Tsujii Dan Tufis David Yarowsky Michael Zock
(Apple Computer, Cupertino) (IMAG, Grenoble) (Brown University) (KAIST, Taejon) (Université de la Sorbonne-Paris) (University of Essex) (University of Venice) (University of Edinburgh) (Charles University, Prague) (IAI, Saarbrücken) (SRA, Arlington) (University of Pennsylvania) (Xerox Grenoble) (Xerox, Palo Alto) (University of Montreal) (Bellcore, Morristown) (LIMSI-CNRS, Orsay) (University Rovira і Virgili) (Nara Institute of Science and Technology) (Columbia University) (IAI/Institute of Mathematics) (University of Edinburgh) (New Mexico State University) (University of Saarland, Saarbrücken) (University College Dublin) (UMIST, Manchester) (University of Nijmegen) (IRST, Trento) (City Polytechnic of Hong Kong) (UMIST, Manchester) (Romanian Academy of Sciences) (University of Pennsylvania) (LISMI-CNRS, Orsay)
EDITORS' FOREWORD
xi
Special thanks must go to: Steve Finch, Günter Görz, Dan Tufis, David Yarowsky, and Michael Zock who reviewed more papers than anyone else and who provided substantial comments. The conference grew out of an idea proposed by Ruslan Mitkov which we discussed at the international summer school "Contemporary Topics in Computational Linguistics" in 1994 (the summer school has taken place annually in Bulgaria since 1989). Among those who supported the idea at the time and encouraged us to organise RANLP'95 were Harold Somers, Michael Zock, Manfred Kudlek, and Richard Kittredge. We would like to acknowledge the unstinting help received from our series editor, Konrad Koerner, and from Ms Anke de Looper of John Benjamins in Amsterdam. Without them this book would not have been a viable project. Thank you both for the numerous clarifications and your constant encouragement! Nicolas Nicolov produced the typesetting code for the book, utilising the T E X system with the LATEX2є package. The technical support from the Department of Artificial Intelligence at the University of Edinburgh is gratefully acknowledged. May 1997
Ruslan Mitkov Nicolas Nicolov
I M O R P H O L O G Y AND SYNTAX
Some Linguistic, Computational and Statistical Implications of Lexicalised Grammars ARAVIND K.
JOSHI
University of Pennsylvania Abstract In this paper we discuss some linguistic, computational and stat istical aspects of lexicalised grammars, in particular the Lexicalised Tree-Adjoining Grammar (LTAG). Some key properties of LTAG, in particular, the extended domain of locality and the factoring of recursion from the domain of dependencies are described together with their statistical implications. The paper introduces a technique called supertag disambiguation based on LTAG trees. This technique and an explanation-based learning technique lead to 'almost' pars ing, i.e., a parsed output where the correct lexical trees have been assigned, but the features have not been checked. Some recent work on relating LTAGs to categorial grammars based on partial proof trees is also discussed.
1
Lexicalisation
A grammar G is said to be lexicalised if it consists of: • a finite of structures (strings, trees, dags, for example), each structure being associated with a lexical item, called its 'anchor', and • a finite set of operations for composing these structures. A grammar G is said to strongly lexicalise another grammar G' if G i s a lexicalised grammar and if the structured descriptions (e.g., trees) of G and G' are exactly the same (cf. Schabes, Abeille & Joshi 1988). The following results are easily established according to Joshi & Schabes (1992): • CFGs cannot strongly lexicalise CFGs. Although for every CFG there is an equivalent CFG in the Greibach Normal Form (GNF), it only weakly lexicalises the given CFG as only a weak equivalence is guar anteed by GNF. • Tree Substitution Grammars, (TSG) , i.e., grammars with a finite set of lexically anchored trees together with the operation of substitution, cannot strongly lexicalise CFGs.
4
ARAVIND JOSHI • TSGs with substitution and another operation called adjoining can strongly lexicalise CFGs. These grammars are exactly LTAGs. Thus LTAGs strongly lexicalise CFGs.
These results show how LTAGs arise naturally in the course of strong lexicalisation of CFGs. Strong lexicalisation is achieved by working with trees rather than strings, hence the property Extended Domain of Local ities (EDL), and by introducing adjoining, which results in the property Factoring Recursion from the Domain of Dependencies (FRD). Thus both EDL and FRD are crucial for strong lexicalisation. 2
Lexicalised Tree-Adjoining Grammar
Lexicalised Tree-Adjoining Grammar (LTAG) consists of elementary trees, with each elementary tree anchored on a lexical item on its frontier. An elementary tree serves as a complex description of the anchor and provides a domain of locality over which the anchor can specify syntactic and semantic (predicate-argument) constraints. Elementary trees are of two kinds — (i) initial trees and (ii) auxiliary trees. Nodes on the frontier of initial trees are substitution sites. Exactly one node on the frontier of an auxiliary tree, whose label matches the label of the root of the tree, is marked as a foot node; the other nodes on the frontier of an auxiliary tree are marked as substitution sites. Elementary trees are combined by substitution and adjunction operations. Each node of an elementary tree is associated with the top and the bot tom feature structures (FS). The bottom FS contains information relating to the subtree rooted at the node, and the top FS contains information relating to the supertree at that node. The features may get their values from three different sources such as the morphology of anchor, the struc ture of the tree itself, or by unification during the derivation process. FS are manipulated by substitution and adjunction. The result of combining the elementary trees shown is the derived tree. The process of combining the elementary trees to yield a parse of the sen tence is represented by the derivation tree. The nodes of the derivation tree are the tree names that are anchored by the appropriate lexical items. The combining operation is indicated by the nature of the arcs-broken line for substitution and and bold line for adjunction-while the address of the operation is indicated as part of the node label. The derivation tree can also be interpreted as a dependency tree with unlabeled arcs between words of the sentence.
LEXICALISED GRAMMARS
5
Elementary trees of LTAG are the domains for specifying dependencies. Mathematical, computational, and linguistic properties of LTAGs, their ex tensions and other related systems have been extensively studied. All these properties follow from two key properties of LTAGs: Extended Domain of Locality (EDL): The elementary trees of LTAG provide an extended domain (as compared to CFG's or CFG-based grammars) for the specification of syntactic and related semantic de pendencies. • Factoring Recursion from the Domain of Dependencies (FRD); Re cursion is factored away from the domains over which dependencies are specified. LTAGs are more powerful than Context-free Grammars (CFGs) both weakly and more importantly, strongly, in the sense that even if a language is context-free, LTAGs can provide structural descriptions not available in a CFG. LTAGs can handle both nested and crossed dependencies. Variants of LTAGs have been developed for handling various word-order variation phe nomena. LTAGs belong to the so-called class of 'mildly context-sensitive' grammars. LTAGs have proved useful also in establishing equivalences among various classes of grammars, Head Grammars, Linear Indexed Gram mars, and Combinatory Categorial Grammars (CCGs) for example. All im portant properties of CFGs are carried over to LTAGs including polynomial parsability, however, with increased complexity 0(n6) (Joshi, Vijay-Shanker & Weir 1993). A wide-coverage grammar for English has been developed in the framework of LTAG. The XTAG system which is based on this grammar also as an LTAG grammar development system and consists of a predictive left-to-right parser, an X-window interface, a morphological analyser and a part-of-speech tagger. The wide-coverage English grammar of the XTAG system contains 317,000 inflected items in the morphology (213,000 of these are nouns and 46,500 are verbs) and 37,000 entries in the syntactic lexicon. The syntactic lexicon associates words with the trees that they anchor. There are 385 trees in all, in a grammar which is composed of 40 different subcategorisation frames. Each word in the syntactic lexicon, on the aver age, depending on the standard parts-of-speech of the word, is an anchor for about 8 to 40 elementary trees.
6 3
ARAVIND JOSHI Statistical implications
Probabilistic CFGs can be defined by associating a probability with each production (rule) of the grammar. Then the probability of a derivation can be easily computed because each rewriting in a CFG derivation is independ ent of context and hence the probabilities associated with the different re writing rules can be multiplied. However, the rule expansions are, in general, not context free. A probabilistic CFG can distinguish two words or phrases w and w' only if the probabilities P(w/N) and P(w'/N) as given by the grammar differ for some nonterminal. That is, all the distinctions made by a probabilistic CFG must be mediated by the nonterminals of the grammar. Representing distributional distinctions in nonterminals leads to an explo sion in the number of parameters required to model the language. These problems can be avoided by adopting probabilistic TAGs, which provide a framework for integrating the lexical sensitivity of stochastic approaches and the hierarchical structure of grammatical systems. Two features of LTAGs make it particularly suitable as the basis of a probabilistic framework for corpus analysis (Resnik 1992; Schabes 1992). First, since every tree is as sociated with a lexical anchor, words and their associated structures are tightly linked. Thus the probabilities associated with the operations of sub stitution and adjoining are sensitive to lexical context. This attention to lexical context is not acquired at the expense of the independence assump tion of probabilities because substitutions and adjoinings at different nodes are independent of each other. Second, FRD allows one to capture the co occurrences between the verb likes and the head nouns of the subject and the object of likes, as the verb and its subject and object all appear within a single elementary structure. 4
Synchronous TAGs
Synchronous TAGs are a variant of TAGs, which characterise correspond ences between languages (Shieber & Schabes 1992). Using EDL and FRD synchronous TAGs allow the application of TAGs beyond syntax to the task of semantic interpretation or automatic translation. The task of interpreta tion consists of associating a syntactic analysis of a sentence with some other structure-a logical form representation or an analysis of a target language sentence. In a synchronous TAG both the original language and its associ ated structure are defined by grammars stated in the TAG formalism. The two TAGs are synchronised with respect to the operations of substitution
LEXICALISED GRAMMARS
7
and adjoining, which are applied simultaneously to related nodes in pairs of trees, one tree for each language. The left member of a pair is an elementary tree from the TAG for one language, say English, and the right member of the pair is an elementary tree from the TAG for another language, say the logical form language. Synchronous TAGs have been applied to the tasks of semantic interpretation, language generation, and machine translation. 5
Viewing lexicalised trees as super parts-of-speech
Parts of speech disambiguation techniques (taggers) are often used to elim inate (or substantially reduce) the parts of speech ambiguity prior to parsing itself. The taggers are all local in the sense that they use only local inform ation in deciding which tag(s) to choose for each word. As is well known, these taggers are quite successful. In a lexicalised grammar such as LTAG each lexical item is associated with one more elementary structures. The elementary structures of LTAG localise dependencies including long distance dependencies. As a result of this localisation, a lexical item may be (and, in general, is almost always) associated with more than one elementary structure. We call these ele mentary structures associated with each lexical item supertags, in order to distinguish them from the usual parts of speech. Thus the LTAG parser needs to search a large space of supertags for a given sentence. Eventu ally, when the parse is complete there is only one supertag for each word (assuming there is no global ambiguity). Note that even when there there is unique standard parts of speech for a word, say a verb (V), there will be in general more than one supertag associated with this word, because of the localisation of dependencies and the syntactic locality that LTAG requires. It is the LTAG parser that is expected to carry out the supertag disambiguation. In this sense, supertag disambiguation is parsing. Since LTAGs are lexicalised, we are presented with a novel opportunity to eliminate (or substantially reduce) the supertag assignment ambiguity by using local information such as local lexical dependencies, prior to parsing. As in the standard parts of speech disambiguation we can use local statistical information, such as bigram and trigram models based on the distribution of supertags in a LTAG parsed corpus. Since the supertags encode dependency information, we can also use information about the distribution of distances of the dependent supertags for a given supertag. We have developed techniques for disambiguating supertags and in vestigated their performance and their impact on LTAG parsing (Joshi &
δ
ARAVIND JOSHI
Srinivas 1994). Note that in the standard parts of speech disambiguation, the disambiguation could have been carried out by a parser, however, car rying out the parts of speech disambiguation makes the job of the parser easier, there is less work for the parser to do. Supertag disambiguation, in a sense, reduces the work of the parser even further. After supertag dis ambiguation, we have in effect a parse in our hand except for depicting the substitutions and adjoining explicitly, hence, supertag disambiguation can be described as almost parsing. The data required for disambiguating supertags have been collected by parsing the Wall Street Journal, IBM-manual and ATIS corpora using the wide-coverage English grammar being developed as part of the XTAG sys tem. The parses generated by the system for these sentences from the corpora are not subjected to any kind of filtering or selection. All the de rivation structures are used in the collection of the statistics. The supertag statistics which have been used in the preliminary exper iments described below have been collected from the XTAG parsed cor pora. The derivation structures resulting from parsed corpora (Wall Street Journal, for the experiments described here) serve as training data for these experiments. We have investigated three models. One method of disambiguating the supertags assigned to each word is to order the supertags by the lexical preference that the word has for them. The frequency with which a certain supertag is associated with a word is a direct measure of its lexical preference for that supertag. Associating frequencies with the supertags and using them to associate a particular supertag with a word is clearly the simplest means of disambiguating supertags. In a unigram model a word is always associated with the supertag that is most preferred by the word, irrespective of the context in which the word appears. An alternate method that is sensitive to context is the η-gram model. The η-gram model takes into account the contextual dependency probabilities between supertags within a window of η words in associating supertags with words. In the n-gram model for disambiguating supertags, dependencies between supertags that appear beyond the η word window cannot be incorporated into the model. This limitation can be overcome if no a priori bound is set on the size of the window but instead a probability distribution of the distances of the dependent supertags for each supertag is maintained. A supertag is dependent on another supertag if the former substitutes or adjoins into the latter.
LEXICALISED GRAMMARS 6
9
LTAGs and explanation-based learning techniques
Some novel applications of the so-called Explanation-based Learning Tech nique (EBL) have been made to the problem of parsing LTAGs. The main idea of EBL is to keep track of problems solved in the past and to replay those solutions to solve new but somewhat similar problems in the future. Although put in these general terms the approach sounds attractive, it is by no means clear that EBL will actually improve the performance of the system using it, an aspect which is of great interest to us here. Rayner was the first to investigate this technique in the context of nat ural language parsing (Rayner 1988). Seen as an EBL problem, the parse of a single sentence represents an explanation of why the sentence is a part of the language defined by the grammar. Parsing new sentences amounts to finding analogous explanations from the training sentences. The idea is to reparse the training examples by letting parse tree drive the rule expansion process and halting the expansion of a specialised rule if the current node meets a 'tree-cutting' criteria. Samuelsson used the information-theoretic measure of entropy to derive the appropriate sized tree chunks automatically (Samuelsson 1994). Although our approach can be considered to be in this general direc tion, it is distinct in that it exploits some of the key properties of LTAG to: (i) achieve an immediate generalisation of parses in the training set of sen tences, (ii) achieve an additional level of generalisation of the parses in the training set, thereby dealing with test sentences which are not necessarily of the same length as the training sentences, and (iii) represent the set of generalised parses as a finite state transducer (FST), which is the first such use of FST in the context of EBL, to the best of our knowledge. Although our work can be considered to be in this general direction, it is distinguished by the following novel aspects. We exploit some of the key properties of LTAG (i) to achieve an immediate generalisation of parses in the training set of sentences, (ii) to represent the set of generalised parses as a finite state transducer (FST), which is the first such use of FST in the context of EBL, to the best of our knowledge, (iii) to achieve an additional level of generalisation of the parses in the training set, not possible in other approaches, thereby being able to deal with test sentences which are not ne cessarily of the same length as one of the training sentences more directly. In addition to these special aspects of our work, we will present experi mental results evaluating the effectiveness of our approach on more than one kind of corpora, which are far more detailed and comprehensive than
10
ARAVIND JOSHI
results reported so far. We also introduce a device called 'stapler', a very significantly impoverished parser, whose only job is to do term unification and compute alternate attachments for modifiers. We achieve substantial speed-up by the use of 'stapler' together with the output of the FST. 6.1
Implications of LTAG representation for EBL
An LTAG parse of a sentence can be seen as a sequence of elementary trees associated with the lexical items of the sentence along with substitution and adjunction links among the elementary trees. Given an LTAG parse, the generalisation of the parse is truly immediate in that a generalised parse is obtained by (i) uninstantiating the particular lexical items that anchor the individual elementary trees in the parse and (ii) uninstantiating the feature values contributed by the morphology of the anchor and the derivation process. In other EBL approaches (Rayner 1988; Samuelsson 1994) it is necessary to walk up and down the parse tree to determine the appropriate subtrees to generalise on and to suppress the feature values. The generalised parse of a sentence is stored under a suitable index computed from the sentence, such as, part-of-speech (POS) sequence of the sentence. In the application phase, the POS sequence of the input sentence is used to retrieve a generalised parse(s) which is then instantiated to the features of the sentence. If the retrieval fails to yield any generalised parse then the input sentence is parsed using the full parser. However, if the retrieval succeeds then the generalised parses are input to the 'stapler'. This method of retrieving a generalised parse allows for parsing of sen tences of the same lengths and the same POS sequence as those in the training corpus. However, in our approach there is another generalisation that falls out of the LTAG representation which allows for flexible matching of the index to allow the system to parse sentences that are not necessarily of the same length as some sentence in the training corpus. Auxiliary trees in LTAG represent recursive structures. So if there is an auxiliary tree that is used in an LTAG parse, then that tree with the trees for its arguments can be repeated any number of times, or possibly omitted altogether, to get parses of sentences that differ from the sentences of the training corpus only in the number of modifiers. This type of generalisation can be called modifier-generalisation. This type of generalisation is not possible in other EBL approaches. This implies that the POS sequence covered by the auxiliary tree and its arguments can be repeated zero or more times. As a result the index of
LEXICALISED GRAMMARS
11
a generalised parse of a sentence with modifiers is no longer a string but a regular expression pattern on the POS sequence and retrieval of a general ised parse involves regular expression pattern matching on the indices. If, for example, the training example was: Show/V me/N the/D flights/N from/P Boston/N to/P Philadelphia/'N. then, the index of this sentence is: V N D N (P N)* since the prepositions in the parse of this sentence would anchor auxiliary trees. A Finite State Transducer (FST) combines the generalised parse with the POS sequence (regular expression) that it is indexed by. The idea is to annotate each of the finite state arcs of the regular expression matcher with the elementary tree associated with that POS and also indicate which ele mentary tree it would be adjoined or substituted into. The FST represent ation is possible due to the lexicalised nature of the elementary trees. This representation makes a distinction between dependencies between modifiers and complements. The number in the tuple associated with each word is a signed number if a complement dependency is being expressed and is an unsigned number if a modifier dependency is being expressed. In addition to these special aspects of our approach, we have evaluated the effectiveness of our approach on more than one kind of corpus. A substantial speed-up (by a factor of about 60) by the use of 'stapler' in combination with the output of the FST has been achieved (Srinivas & Joshi 1995). 7
LTAGs and Categorial Grammars
LTAG trees can be viewed as partial proof trees (PPTs) in Categorial Gram mars (CGs). The main idea is to associate with each lexical item one or more PPTs as syntactic types. These PPTs are obtained by unfolding the arguments of the type that would be associated with that lexical item in a simple categorial grammar such as the Ajdukiewicz and Bar-Hillel grammar (AB). This (finite) set of basic PPTs (BPPT) is then used for the building blocks of the grammar. Complex proof trees are obtained by 'combining' these PPTs by a uniform set of inference rules that manipulate the PPTs. The main motivation is to incorporate into the categorial framework the key idea of LTAG, namely EDL and FRD (Joshi 1992; Joshi & Kulick 1995). Roughly speaking, EDL allows one to deal with structural adja-
12
ARAVIND JOSHI
cency rather than the strict string adjacency in a traditional categorial grammar. In LTAG, this approach provides more formal power (both weak and strong generative power) without increasing the computational com plexity too much beyond CFGs, while still achieving polynomial parsability (i.e., the class of mildly context-sensitive grammar formalisms (Joshi, VijayShanker & Weir 1993)). EDL also allows strong lexicalisation of CFGs, lead ing again to LTAGs. Therefore, just as strong lexicalisation, EDL, and FRD together lead to LTAGs from CFGs, we can investigate the consequences of incorporating these notions into an AB categorial grammar, leading to the system based on PPTs. This work is also related to the work on descrip tion trees by (Vijay-Shanker 1992) and HPSG compilation into LTAGs by (Kasper, Kiefer, Netter & Vijay-Shanker 1995). There are two aspects to the PPTS system: the construction of the indi vidual PPTs, and the inference rules that define how they are manipulated. The set BPPT is constructed by the following schemas: 1. Arguments of the type associated with a lexical item are unfolded by introducing assumptions. 2. There is no unfolding past an argument which is not an argument of the lexical item. 3. If a trace assumption is introduced while unfolding then it must be locally discharged, i.e., within the basic PPT which is being construc ted. 4. During the unfolding a node can be interpolated from a conclusion node X to an assumption node Y. All assumptions introduced in a PPT must be fulfilled by one of the following three operations: 1. application — The conclusion node of one PPT is linked to an as sumption node of another. 2. stretching — An interior node of a PPT is 'opened up', to create a conclusion node and an assumption node, in order to allow interaction with another PPT. 3. interpolation — The two ends of an interpolation construction (pre viously created within a PPT) are linked to another PPT. While traditional categorial grammar rules specify inferences between types, the inference rules for the three operation on PPTs instead specify inferences between proofs. This in a direct consequence of the extended domain of locality in PPTS. (However, the rules for building the set BPPT are similar to those of other categorial grammars.)
LEXICALISED GRAMMARS
13
These three operations are specified by inference rules that take the form of Α-operations, where the body of the Α-term is itself the proof. This is done by adapting a version of typed label-selective Α-calculus. this extension of Α-calculus uses labeling of abstractions and applications to allow unordered currying. Arguments have both symbol and number labels, and the intuitive idea is that the symbolic labels express the possibility of taking input on multiple channels, and the number labels expresses the order of input on that channel. In conclusion, we have discussed the notion of lexicalisation and its im plications for formal and computational properties of such systems. Our discussion is in the context of LTAGs, however, we have briefly discussed how this approach can be extended to Categorial Grammars and related systems. REFERENCES Joshi, Aravind K. 1992. "TAGs in Categorial Clothing". Proceedings of the 2nd Workshop on Tree-Adjoining Grammars, Institute for Research in Cognitive Science (IRCS), University of Pennsylvania. Joshi, Aravind K. & Seth Kulick. 1995. "Partial Proof-Trees as Building Blocks for Categorial Grammars". Submitted for publication. Joshi, Aravind K. & Yves Schabes. 1992. "Tree-Adjoining Grammars and Lexicalized Grammars". Tree Automata and Languages ed. by M. Nivat & A. Podelski, 409-431. New York: Elsevier. Joshi, Aravind K. & Bangalore Srinivas. 1994. "Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing". Proceedings of the 17th Inter national Conference on Computational Linguistics (COLING-94), 154-160. Kyoto, Japan. Joshi, Aravind K., K. Vijay-Shanker & D. Weir. 1991. "The Convergence of Mildly Context-Sensitive Grammar Formalisms". Foundational Issues in Natural Language Processing, ed. by Peter Sells, Stuart Shieber & Thomas Wasow, 31-81. Cambridge, Mass.: MIT Press. Kasper, Robert, B. Kiefer, . Netter & K. Vijay-Shanker. 1995. "Compilation of HPSG to TAG". Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics (ACL'95), 92-99. Rayner, Manny. 1988. "Applying Explanation-Based Generalisation to Natural Language Processing". Proceedings of the International Conference on Fifth Generation Computer Systems, 99-105. Tokyo, Japan.
14
ARAVIND JOSHI
Resnik, Philip. 1992. "Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing". Proceedings of the 14th Inter national Conference on Computational Linguistics (COLING-92), 418-424. Nantes, Prance. Samuelsson, Christer. 1994. "Grammar Specialisation through Entropy Thre sholds". Proceedings of the 32nd Meeting of the Association for Computa tional Linguistics (ACL'94), 150-156. Las Cruces, New Mexico. Schabes, Yves, Anne Abeille & Aravind K. Joshi. 1988. "Parsing Strategies with 'Lexicalized' Grammars: Application to Tree-Adjoining Grammars". Pro ceedings of the 12th International Conference on Computational Linguistics (COLING-88), 578-583. Budapest, Hungary. Schabes, Yves. 1992. "Stochastic Lexicalized Tree-Adjoining Grammars". Pro ceedings of the lĄth International Conference on Computational Linguistics (COLING-92), 426-432. Nantes, Prance. Shieber, Stuart & Yves Schabes. 1990. "Synchronous Tree-Adjoining Gram mars". Proceedings of the Fourteenth International Conference in Computa tional Linguistics (COLING-90), 1-6. Helsinki, Finland. Srinivas, Bangalore & Aravind K. Joshi. 1995. "Some Novel Applications of Explanation-Based Learning to Parsing Lexicalized Tree-Adjoining Gram mars". Proceedings of the 33rd Meeting of the Association for Computational Linguistics (ACL'95), 268-275. Steedman, Mark. "Combinatory Grammars and Parasitic Gaps". 1987. Natural Language and Linguistic Theory 5:403-439. Vijay-Shanker, K. 1992. "Using Descriptions of Trees in a Tree Adjoining Gram mar". Computational Linguistics 18.4:481-517. The XTAG Research Group. 1995. "A Lexicalized Tree-Adjoining Grammar of English". Technical Report IRCS 95-03. Philadelphia: Institute for Research in Cognitive Science (IRCS), University of Pennsylvania.
Case and word order in English and German ALLAN RAMSAY* & REINHARD SCHÄLER**
* Centre for Computational Linguistics, UMIST **Dept. of Computer Science, University College Dublin Abstract It is often argued that English is a 'fixed word order' language, whereas word order in German is 'free'. The current paper shows how the different distributions of verb complements in the two lan guages can be described using very similar constraints, and discusses the use of these constraints for efficiently parsing a range of construc tions in the two languages.
1
Background
The work reported here arises from an attempt to use a single syntactic/ semantic framework, and a single parser, to cope with both German and English. The motivation behind this is partly practical — having a uniform treatment of a large part of the two languages should make it easier to develop MT and other systems which are supposed to manipulate texts in both languages; and partly theoretical, just because any shared structural properties of languages with differing surface characteristics are of interest in themselves. The general framework is as follows: • Lexical items contain detailed information about the arguments they require and the targets they can modify. This information includes a specification of where a particular argument or target will be found. For arguments, this is done by partially ordering the arguments in terms of which is to be found next and specifying the direction in which it is to be found via a feature that can take one of the values l e f t or r i g h t . This is very similar to the treatment in categorial grammar, save that in the current approach the sequence in which arguments are to be found is given via a partial ordering, rather than the complete linear ordering of standard categorial grammar. Ad ditionally, the feature specifying the direction in which to look for a particular argument may not be instantiated until immediately before that argument is required.
16
ALLAN RAMSAY & REINHARD SCHALER • There is a strictly compositional semantics, expressed using a dynamic version of Turner's (1987) property theory 1 . • Syntactic, and hence semantic, analysis is performed by a chart parser driven by a 'head-corner' strategy, whereby phrases are built up by combining the head with its arguments looking either to the right or the left depending on the direction specified by the next argument.
This system would analyse the sentence 'he stole a car' as: A::{past(A)} simple(A, ΛB{ΙC ::{subset{C,ΛD(male{D))) Λ \C\ = 1}
E ::{subset(E,ΛF(book{F))) Λ \Ε\ = 1} event (B) Λ type (Β, steal) A object{B,E) A by(B,C))) This example displays most of the characteristics of our semantic analyses: • We use an event-based semantics, with aspect interpreted as a re lation between event types and temporal objects such as instants. An event type is represented as a A-abstraction over sentences about events (though remember that we are using property theory rather than typed Α-calculus as the means to interpret such expressions). • We use anchors to capture dynamic characteristics of referring ex pressions, so that an expression like ιC::{subset(C, ΛD(male(D))) Λ \C\ — 1}W says that W is true of the contextually unique singleton set of male individuals if there is one, and is uninterpretable otherwise (in other words, W is true of he). • Thematic relations are named after the prepositions that give them their most obvious syntactic marking, so that by(, ) means that is the agent of the event A, since agency is marked by the use of the case-marking preposition by when it is marked at all. This kind of semantic analysis is reasonably orthodox: the use of Davidsonian events has been widely adopted (e.g., see van Eijck & Alshawi 1992), the treatment of referring expressions via anchors into the context is very similar to the use of anchors in situation semantics (Barwise & Perry 1983), the decision to use the names of case-marking prepositions for thematic re lations can easily be justified by appeal to Dowty's (1988) analysis of the 1
Property theory allows you to combine the standard logical truth functional operat ors with the abstraction operator of the λ-calculus without either running into the paradoxes of self-reference or being restricted by an otherwise unnecessary hierarchy of types. See (Ramsay 1994) for phenomena whose analysis is greatly simplified by the absence of types from property theory.
CASE AND WORD ORDER
17
semantics of thematic relations. The most surprising element of the treat ment above is the analysis of aspect as a relation between a temporal object and an event type: dealing with aspect this way provides more flexibility than is available in the approach taken by Moëns and Steedman (1988), but as far as the present paper is concerned it makes little difference and if you find it unintuitive then the best thing to do is ignore it. Treatments of a variety of semantic phenomena in English have been published elsewhere (Ramsay 1992; Ramsay 1994). The purpose of the current paper is to describe the syntactic devices which are used to indicate thematic role in English and to show how these can be adapted with very minor changes to obtain the same information in German. 2
Case and order in English
English deploys two mechanisms for assigning thematic roles to arguments of a verb, (i) Thematic roles are partially ordered in terms of their affinity for the syntactic role of subject. In particular, if the list of required arguments includes an agent and this argument is not explicitly case marked then it must be the subject; and the only time the thematic object of any verb can be the syntactic subject is if there are no other candidates. The subject is always adjacent to the verb, either on the left (simple declarative sentences) or the right (aux-inverted questions). The subject has the surface case marker +nom.For passive verbs, the item which would have taken the role of subject for the active form is found and is marked as being optional and obligatorily case marked before the real subject is found, (ii) Any other arguments appear to the right of the verb and are otherwise freely ordered, with the proviso that the argument in what is usually termed direct object position should be required to be marked +acc if possible, while any other arguments should have a case marking which reflects their thematic role. This case marking typically comes in the form of a preposition. Thus in (1) He gave his mother a picture. (2) He gave a picture to his mother. he is the agent of the event, his mother is the recipient and the picture is the object. In both cases the subject has to be the agent, since agents always take precedence when allocating the role of subject. In (1) the second argument a picture has its thematic role assigned by the surface case marking. In this case, that surface case marking is +acc, which specifies that this argument is playing the role of object. This leaves the role of
18
ALLAN RAMSAY & REINHARD SCHALER
recipient to his mother. The explicit case marker for the role of recipient is overridden by the assignment of +acc to whatever appears in direct object position, but it doesn't matter because it is already clear that the other two arguments are the agent and the object, which leaves recipient as the only option. In (2) the second argument to his mother has the case marker to, and hence is clearly the recipient, leaving object as the only option for a picture. The behaviour of the verb open fits the same pattern: (3) (4) (5) (6)
He opened the door with the key. He opened the door. The key opened the door. The door opened.
(3) is just like (2): the role of subject is taken by the agent, the final argu ment has its thematic role explicitly marked by the case-marking preposition , and the remaining argument gets the role of object because that's all that is left. In the other cases, the role of subject gets allocated to the agent in (4), to the instrument in (5), and the object in (6), in descending order of affinity. The only real problem is that we would expect to get (3') He opened the key the door. as a sort of 'dative shift' variant of (3). We rule this out simply by banning the instrument of the verb open from appearing in this position. This mapping between thematic roles and surface appearance is determ ined by three sets of rules, (i) Local rules may specify properties of par ticular arguments, e.g., that the agent of any passive verb must be marked —nom, or that the instrument of the verb open must be marked —obj1. (ii) A set of 'subject affinity' rules specifies which thematic role will be realised by an NP playing the surface role of subject, (iii) A set of linear preced ence rules of the kind introduced in GPSG (Gazdar et al. 1985) specifies the permitted orders in which the the arguments of the verb may appear. Subject affinity rules The decision as to which item should take the role of subject is determined by a set of rules such as the following: (51) X[+agent, +nom] «subj Y (52) X[+nom] «subj Y[+object] The first of these says that the agent is a better candidate for the role of subject than anything else is, provided that it is in fact capable of playing this role at all. The side-condition that the agent must be capable of playing
CASE AND WORD ORDER
19
this role is specified by the requirement that it should satisfy the property of being +nom — in certain circumstances, notably in passives, the agent is required to be explicitly case-marked by the preposition by, and hence cannot be the subject. In any sensible implementation the explicit case marking of the agent should precede the application of the subject affinity rules, but it is not in fact a logical necessity. The second rule here says that the thematic object is the worst candid ate, among those that are eligible, for this role. These two rules cover most, if not all, cases in English: the only situ ations where they fail to determine the subject is if (i) there is no agent or the agent is not eligible, and (ii) there are two other arguments neither of which is the semantic object. Such situations are sufficiently rare to be ' ignored for the purposes of this paper. Linear precedence rules The notion of linear precedence (LP) rule used here is slightly different from the standard GPSG treatment. In particular, because the grammar here is highly lexical our LP-rules deal with the arguments of lexical items, rather than with daughters of ID-rules. We will want to use the LP-rules on the fly, to determine which argument to look for next, and where to look for it. The following are the key rules for the arguments of English verbs: (LP1) X «lp Y[-nom,mother = X] (LP2) X[+nom,mother = M] «lp Y[mother = M] (LP3) X[+nom,mother = Y] «lp Y[—inv] (LP1) says that any non-subject argument Y of X must follow X; (LP2) says that the subject of M must precede any non-subject argument; and (LP3) says that if is marked as being non-invertible then its subject must precede it. (Sl-2) and (LP1-3) can be utilised within a head-corner parser to de termine what argument to look for next and where to look for it, as follows: • Start by applying the local rules: it's best to do this before choosing the subject, since the local rules will generally only be compatible with one choice of subject, but it is not strictly necessary to do so. • Next allocate the role of subject to one of the arguments of the verb by (Sl-2). Require this item to be marked +. • If there is an argument X of the verb V such that (i) V «lp X and (ii) there is no argument Y such that V «lp Y «lp X then look to the right for X, and delete X from the set of arguments waiting to be found. This step cannot sensibly be performed until the subject
20
ALLAN RAMSAY & REINHARD SCHALER has been found, since the LP rules depend on whether some item is
+ / - nom. • If there is an argument X of the verb V such that (i) X «lp V and (ii) there is no argument Y such that X «lp Y «lp V then look to the left for X, and delete X from the set of arguments waiting to be found. With one non-trivial extension, these rules cover virtually all the relevant phenomena in English. The key extension concerns the presence or absence of an explicit case marker on the leftmost item after the subject and the verb. If we mark this item as +obj1, then we need a default rule of the kind introduced in (Reiter 1980) of the form: M{X[+acc]) : X[+objl] X[+acc] This says that if it is possible to require the item in the relevant position to be marked +acc then you should do so. Unlike the previous rules this has to be a default rule and hence cannot be applied until the others have all done their work. The point here is that for a verb like rely the item in direct object position must be case-marked by the preposition on, as in He relied on her integrity: the consistency check in the above rule allows this by noting that the effect of the rule is incompatible with the effect of the lexical properties of rely, and hence the rule does not apply. Note that the requirement that the first non-subject argument after the verb has to be +obj1 provides the mechanism for ruling out he opened the door the key as a 'dative' version of he opened the door with the key. We simply mark the instrument of open as —obj1 (though not - n o m , since the instrument can get promoted to subject position if there is no explicit agent). The above rather straightforward rules cover virtually all the relevant phenomena in English. In particular they provide appropriate analyses for (1)-(6), and for: (7) A picture was given to his mother. (8) His mother was given a picture. (9) I saw him stealing a car. (10) He was seen stealing a car. (11) The ancient Greeks knew that the earth went round the sun. (12) That the earth went round the sun was known to the ancient Greeks.2 2
The case marker for the subject of the active sentence (11) turns out to be the pre-
CASE AND WORD ORDER
21
Extraposition in English English word order is not, however, as rigidly fixed as these examples sug gest. In particular, it is not unusual for one of the arguments which would normally appear to the right of the verb to appear way over to the left in front of the subject. The usual reason for this is that it provides a way of making the semantics of the shifted item available for some discourse operation such as contrast, as with the book in: (13) The film was banal, but the book I enjoyed. A wide variety of more-or-less formal accounts of such discourse operations have been provided (e.g., Halliday 1985, Krifka 1993, Ramsay 1994, Hoffman 1995), and there is no need to discuss their various merits and demerits here. The crucial point for the current paper is that surface word order does frequently get reorganised in this way in English, so that any claim that word order in English is fixed has to be treated very carefully. It is worth noting at this point that VP modifiers such as PPs and ADVPs seem to be subject to very similar kinds of constraint on where they can appear. Cases such as He suddenly stopped the car, he ate it in the park, I saw him sleeping by himself, . . . seem to indicate that there is a general rule in English that says that a VP can be modified by an appropriate modifier, and that the modifier should appear to the left of the VP if it is head-final and to the right if it is head-initial 3 (see Williams 1981, for discussion of this rule). This simple rule, however, is violated by examples like: (14) In the park he ate a peach. (15) She believed with all her heart that he loved her. In (14) the head-initial PP in the park is to the left of the S, rather than to the right of the VP; and in (15) the PP with all her heart is between the verb believed and its sentential complement that he loved her. In order to account for (14) we have to argue either that in the park can be either a left-modifier of an S or a right-modifier of a VP, in which case it will have to have different semantic types to combine appropriately with the types of its two potential targets; or that it is in fact a right-modifier of the VP which has been shifted to the left, probably in order to reduce ambiguity
3
position to, indicating that The Greeks is not the agent of know. This reflects the fact that agents typically intend the events that they bring about, which is not the case for the ancient Greeks in (11). Modifiers consisting of a single word, such as quietly, are both head-initial and headfinal, so that you get both he ate it quietly and he quietly ate it
22
ALLAN RAMSAY & REINHARD SCHALER
(since (14) has only one reading, whereas he ate a peach in the park has two) rather than to make it the argument of a discourse operator. The easiest way to account for (14) seems to be to argue that the complement that he loved her has been right shifted, probably again in order to reduce ambiguity (she believed that he loved her with all her heart sounds extremely odd, largely because the obvious attachment of with all her heart is to the VP loved her). In the system being described here, these 'shifts' of some argument or adjunct are dealt with using the standard unification technique of having a category valued feature called slash which can be given a value in order to denote the fact that some item is 'missing' from its expected position. We extend the standard notion, however, by allowing slash to have a stack of items as its value, indicating that more than one thing has gone missing. This is a departure from standard practice — in GPSG, for instance, the foot feature principle specifies that s l a s h can be given a non-trivial value by at most one of the daughters of a rule. This extension is required in English to cope with cases like (16) I was just talking to him when suddenly he collapsed. where the most obvious analysis assumes that both when and suddenly have been left-shifted. Much the same also holds for (17) Quietly, without a word, he turned his face to the wall. where quietly and without a word have both been topicalised, and for (18) where I believed at the time that he had left it where where has been topicalised out of that he had left it, which has itself been right-extraposed in the same way as that he loved her in (15). We will assume from now on that there is no pre-determined limit on the number of items that may be shifted either right or left, though there may well be local constraints that prevent extraposition happening in particular cases. The decision to allow multiple extrapositions could easily lead to an explosion in the number of partial parses that might be constructed. We therefore use a mechanism similar to Johnson and Kay's (1994) notion of 'sponsorship' to insist that for each object which you believe has been left-shifted there must indeed be at least one candidate item somewhere to the left. Furthermore, if more than one item has been left-extraposed then the sponsors must appear in the right order. With this filter on the freedom to hypothesise left-extrapositions, our move to permitting multiple extrapositions does not
CASE AND WORD ORDER
23
lead to an unacceptable increase in the number of potential analyses4. 3
Case and word order in German
We now turn to German, where surface case marking seems to be rather more important than word order in the allocation of thematic roles to ar guments. Very roughly, it seems that in German the following conditions hold: • General properties of a clause determine whether the verb appears as the first, second or final constituent. • One argument is marked as being the subject, and undergoes the usual agreement constraints for subjects. • The arguments of a verb are not subject to a strict set of LP-rules, though quite strong discourse effects can be obtained by putting some thing other than the subject as the leftmost argument. To take a simple example, (19) (20) (21) (22)
Er gab seiner Mutter ein Bild. Er gab ein Bild seiner Mutter. Seiner Mutter gab er ein Bild. Ein Bild gab er seiner Mutter.
are all reasonable translations of he gave a picture to his mother. In each case, the choice of er as the subject indicates that he was the agent, the dative marking of seiner Mutter indicates that the mother was the recipient, and the accusative marking of ein Bild shows that this is the thematic object. Choosing (21) or (22) would normally presuppose that the speaker wanted to make seiner Mutter or ein Bild available for some discourse operator, but all four options are certainly permissible. Similarly, (23) (24) (25) (26) 4
Gab er seiner Mutter ein Bild? Gab er ein Bild seiner Mutter? Gab seiner Mutter er ein Bild? Gab ein Bild er seiner Mutter?
It does not seem possible to extend the notion of sponsorship to deal with right ex trapositions, since you can't anticipate whether sponsors may turn up later on as you proceed. Fortunately the local constraints tend to restrict the number of items that could possibly be right-shifted.
24
ALLAN RAMSAY & REINHARD SCHALER
are all available as questions about the donation of a book to someone's mother, with the choice of which argument is to come immediately after the verb indicating whether, as in (23), we don't know whether the person he gave the book to was his mother, or, as in (26), we don't know whether what he gave her was a book. Uszkoreit (1987) argues that (19)-(22) and (23)-(26) can all be ob tained, as for the English cases, from a set of rules which choose the subject, a set of LP-rules, and a mechanism for left-extraposition. The essence of Uszkoreit's analysis is that there is one basic LP-rule, which in the terms used here would look like X «lp Y[mother = X] and that the simple declarative forms (19)—(22) are obtained by topicalisation. This looks very straightforward, and the only change that we would argue for at this point is that Uszkoreit deals with cases like (27) In dem Park aß er einen Apfel by treating in dem Park as an argument of aß, whereas it seems more sensible to treat it as an ordinary post-modifier of the VP and to allow it to be left-shifted just as in (14). It is notable that, in German as in English, cases where a preposition modifier is left-shifted are much less marked than ones where some other non-subject item appears in the leftmost position. The reason is that left-shifting a modifier can be used as a means of reducing ambiguity, and hence is a useful thing to do regardless of any discourse effect you want to produce. The first point at which this simple rule has to be altered arises when we consider verbs other than the main verbs of major clauses (i.e., non-finite verbs and main verbs of subordinate clauses. Following Uszkoreit we will mark these as —mc). In (28) Ich sah ihn ein Auto stehlen. (29) Ich habe ein Auto gestohlen. the NP ein Auto is certainly an argument of (ge)ste(o)hlen, yet appears to its left. It also seems as though in (28) ihn may also be an argument of stehlen, as something like a +acc marked subject. To accommodate these examples, we might adapt our observations about word order by simply saying that the arguments of a minor verb must precede it, and leave it at that. The LP rules would then become:
CASE AND WORD ORDER
25
X[+mc] «lp Y [mother = X] Y[mother = X] «lp X[—mc] These rules have much the same flavour as the ones for English, and could be used in just the same way by a parser which incrementally chose which argument to look for next and which direction to look for it in. Clearly the verb-second examples like (19)—(22) would require you to worry about left-extraposition, but this is not a major extra burden since you will always have to worry about that anyway. Unfortunately, you cannot always tell from the appearance of a verb whether it should be marked +rac or —rac. Non-finite verbs are always —mc, but there are plenty of cases, e.g., stehlen, where the appearance of the verb does not determine its form; and even where the form is determined, you cannot know for a tensed verb whether it is +rac or —mc until you know the context in which it appears. This means that any bottom-up parser which depends on the two LP-rules above is frequently going to have to investigate two sets of hypotheses, one looking to the right for all the arguments of the verb and one looking to the left. At this point it is worth recalling two points: (i) on Uszkoreit's account, the only difference between the polar interrogative form and the simple main clause declarative is that the latter has something (either an argument or a modifier) left-extraposed. (ii) For entirely independent reasons, it seemed sensible in English to allow multiple items to be extraposed. We therefore propose the following alternative treatment of —mc verbs in German. • There is only one LP-rule for verbs, namely X «lp Y[mother = X). • Polar interrogatives, simple declaratives and —rac verbs are distin guished entirely by the number of items which have been left-shifted. With these rules we get all the obvious cases, e.g., (30) Stahl er ein Auto? [[Stahl, right—er], right-[ein, right—Auto]] (31) . . . (weil) er ein Auto stahl. [er, [[ein, right—Auto], [[stahl,right—trace], right—trace]]] (32) Ein Auto stahl er. [[Ein, right—Auto], [[stahl,right—er], right—trace]]
26
ALLAN RAMSAY & REINHARD SCHALER
The markers left and right in these indicate where the item in question was found, and trace indicates that what was found was a trace of something which has been extraposed. Thus in (32) both arguments were found to the right of the verb, but one of them was a trace which was cancelled by the NP ein Auto which itself consisted of a determiner with a noun to its right. This is exactly as described by Uszkoreit for verb-initial and verb-second clauses. More interestingly, we can cope with embedded clauses without requiring —mc verbs to look to the left for their arguments: (33) Ich weiß, er stahl ein Auto. [Ich, [[weiß) right—trace], right—[er, [[stahl,right—trace], right—[ein, right—Auto]]]]] (34) Ich weiß, in dem Park stahl er ein Auto. [Ich, [[weiß, right—trace], right —[[in, right—[dem, right—Park]], [[[stahl,right—er],right—[ein,right—Auto]], right—trace]]]] (35) Ich weiß, daß er ein Auto stahl. [Ich, [[weiß, right—trace], right -[daß, right — [er, [[ein, right—Auto], [[stahl,right—trace], right—trace]]]]]] In (33) weiß requires a +mc clause as its argument, and hence er stahl ein Auto, with one left-shifted argument is fine. Similarly, the presence of the left-shifted PP in (34) means that the embedded clause is +mc. In (35), on the other hand, the complementiser daß requires a —mc clause as its argument, and hence both arguments er and ein Auto of stahl have to be left-shifted. The complementiser then returns a +mc clause, as required. Similarly, in (36) ein Auto, das er stahl,
CASE AND WORD ORDER
27
[ein, right —[Auto, right—[das, [er, [[stahl,right—trace], right—trace]]]]] the relative clause has to be —mc and hence has all its arguments left-shifted, with the WH-pronoun (!) das coming first because of the fact that you can't extrapose anything from a sentence which has already been WH-marked (so you don't get: * e i nAuto, er das stahl,). Verbs with non-finite sentential complements work exactly the same way: (37) Ich sah ihn ein Auto stehlen. [Ich, [[sah, right—trace], right — [ihn, [[ein, right—Auto], [[stehlen,right—trace], right—trace]]]]] The embedded clause ihn ein Auto stehlen has a non-finite, hence —rac, main verb, and therefore both arguments have again been left-shifted. Auxiliaries are slightly more awkward. In English, auxiliaries and mod als take VPs as their arguments, i.e., verbs which have found all their ar guments apart from the subject. To deal with that in the current context, we would have to allow slash elimination to occur with VP's as well as with S's, analysing the phrase ein Auto gestohlen in (38) Ich habe ein Auto gestohlen. by taking gestohlen as something like VP[subcat = {NP[+nom]},slash = {NP}] and then cancelling the slashed NP against ein Auto to obtain a normal VP. This is a possibility, but the decision to allow slash elimination to occur with items other than S's is a major step. For the moment we prefer to assume that auxiliaries and modals require S's whose subjects have been extraposed, rather than ones whose subjects have not been found, and to retain the principle that slash elimination only occurs with S's. We therefore treat (38) as [Ich, [[habe, right — [[ein, right—Auto], [[gestohlen, right—trace], right—trace]]], right—trace]]
28
ALLAN RAMSAY & REINHARD SCHALER
Here gestohlen has had both arguments extraposed, and habe has had its subject extraposed. Only one of the arguments for gestohlen gets cancelled, namely ein Auto, and therefore there is an S with its subject missing im mediately to the right of habe. This is therefore accepted as one argument, and the other is slashed. When it turns up, namely as Ich, the whole thing turns out to be a perfectly ordinary declarative main clause. This may turn out not to be the best solution for auxiliaries. For the moment we will just note that it does at least work, and that it does not require any radical extensions to the analysis developed above for the other cases. We will be looking again at this, but it does at least provide a treatment that works without incurring any substantial extra costs. 4
Implementation
The rules outlined above for computing the properties of the next argument to be found when saturating a verb in English and German have been implemented in a version of the parser and syntax/semantics reported in (Ramsay 1992; Ramsay 1994). Within this framework as much information as seems sensible is packed into the descriptions of lexical items, with a very small number of rules being used for saturating and combining structures together. In particular, the description of a lexical item W contains the following pieces of information: • a description of the syntactic properties of the item W' that would result from saturating W. • a description of the set of arguments which W requires. This set may be empty, as in the case of pronouns or simple nouns. • a description of the items that W' might modify (e.g., an adjective like old would specify that it could modify an N, a preposition like in would specify that when saturated it could modify either an N or a VP). The grammar then has four rules: • An unsaturated item can combine with one of its arguments under appropriate circumstances. • A modifier can combine with an appropriate target. • A sentence which has had something extraposed to the left or right can combine with an appropriate item on the left or right. • If X' is a redescription of X then any of the first three rules can be applied to X'. This rule captures the notion that items can often be
CASE AND WORD ORDER
29
viewed from different perspectives — that a generic N can be seen as an NP, that certain sorts of WH-clause can be seen as NP's (e.g., I don't know much about art but I know what I like), and so on. These rules are simple enough for it to be reasonable to build the parser around them. The key, of course, is that the first three all talk about 'appropriate' items and circumstances, and this notion of appropriate needs to be fleshed out. Part of what is meant here is that feature percolation principles have to be applied in order to complete the descriptions of the required items. These feature percolation principles are essentially dynamic, since they include pre-defaults which say things like "unless you already know that X is required to be something else then require it to be +acc" ; post-defaults, which say things like "unless you know that X is capable of functioning as an adjunct then assume it isn't"; and principles like the FFP which depend on properties of the siblings of the item in question. The issue of appropriateness also includes information about which argument from a set of arguments to look for next, and whether to look to the left or the right for it; and about whether a modifier should appear to the left or right of its target, or whether an extraposed item should be found to the left or right of the sentence it has been extracted from. The question of whether to look to the left or right for an item is also essentially dynamic. Consider, for instance, the following NP's: (39) a sleeping man (40) a quietly sleeping man (41) a man sleeping in the park In (39) and (40) the modifier has to appear to the left of the target, in (41) it has to appear to the right. The reason seems to be that sleeping and quietly sleeping are head-final, whereas sleeping in the park is not. This is a property of the phrase as a whole, rather than of its individual components, and hence cannot be determined until the whole phrase has been found. Similarly, the discussion of case-marking and argument order in Sections 2 and 3 above suggests that the direction in which the next argument should be found and the details of its syntactic properties depend on what was found last and what properties it had. Given this dynamic view of these otherwise rather skeletal rules, it seems reasonable to embody them directly into the parser. The term 'head-corner' reflects the fact that we work outwards from lexical items, trying to saturate them by looking either left or right, as determined dynamically by the LPrules. This strategy provides a very effective combination of top-down and
30
ALLAN RAMSAY & REINHARD SCHALER
bottom-up processing. As examples of cases where this pays off, consider the following English sentences: (42) That she should be so confident says a lot for her education. (43) Eating raw eggs can give you salmonella poisoning. (44) There is a dead rat in the kitchen. In (42), the sentence that she should be so confident is the subject of the verb says; in (43) the subject of give is the VP eating raw eggs; and in (44) the subject of is is the dummy item there. The fact that verbs can require either non-NP's or extremely special NP's as their subjects means that you can't afford to have a simple rule like: S NP, VP[+tensed] since it won't cover (42), it won't cover (43) unless you regard present participle VP's as being a species of NP, and it won't specify the detailed characteristics of the subject NP in (44). You would therefore need a rule more like: S X, VP[+tensed, subject = X] But any parser which worked generally left to right would produce unaccept able numbers of hypotheses in the presence of a rule like this. By working outwards from the head verb in directions specified by the LP-rules, we can cope with (42)—(44) without drowning in a sea of unwarranted hypotheses. Similarly, by replacing the general rule: X X, conj,X by lexical entries whose subcategorisation frames say that a conjunction can be saturated to an X if you find an X to the left and then one to the right, we can cope with the combinatorial explosion that a rule of this kind would otherwise introduce. In much the same way, the fact that we determine the direction in which a modifier is to seek its target dynamically means that we can be economical about making hypotheses about where to look for adjunct/target pairs. The main reason for providing distinct mechanisms for combining heads with arguments and adjuncts with targets comes from our desire to treat examples like: (45) In the park there is a playground for preschool children. as involving extraposition of the PP in the park. This treatment is motivated on semantic grounds, since otherwise we have to be prepared to treat in the park as both a function of type t→ t when it modifies an S, as would happen in (45); and as a function of type ((e → t) → (e → t)) when it modifies a VP, as in:
CASE AND WORD ORDER
31
(46) The youths drinking cider in the park looked extremely threatening. The key difference is that in head/argument pairs, the argument can be extraposed, whereas in modifier/target pairs the modifier can. We therefore cannot afford to treat a preposition like in as being of type VP\ VP/NP, as would be done in raw categorial grammar, since there is no obvious way of extraposing the partially saturated structure in the park from this. This parser works fine for English. It works even better for German. Consider the verb gab. On the analysis outlined above, this generates six possible orders for the arguments, namely agent-object-recipient, agentrecipient-object, object-agent-recipient, object-recipient-agent, recipientobject-agent, recipient-agent-object. Some of these mark strong rhetorical devices, and others may only be possible with particular combinations of +/-heavy NP's, but they are all at least conceivable. Furthermore, if we take it that gab can appear in polar questions, +mc declarative sentences, and —mc clauses, then each of these can appear with the verb either at the start, after the first item, or at the end — a total of 18 possible sequences. And then we have to consider the possible presence of adjuncts, which could easily lead to +mc declarative forms in which the verb precedes all three core arguments. And finally, of course, in each case we have to consider the possibility that a given argument may have been extraposed, either for rhetorical reasons or simply to construct a relative clause. Within the current framework, we initially generate just three hypo theses — that the agent is the leftmost argument, or that the object is, or that the recipient is. We then look to the right for this argument: if we find a concrete instance then the case marking will almost certainly rule out all except one case, and if we decide to hallucinate an extraposed instance then the search for sponsors will ensure that we only do so if there is indeed something of the required kind already lying around. We therefore explore only a very constrained part of the overall search space. The parser we developed initially for English actually works even better for German! 5
English is German
Uszkoreit, rightly, complains that a consequence of the historical concentra tion on English is that other languages get forced into a framework which really does not fit them at all well. This is particularly unfortunate in view of the fact that English is in fact a rather messy amalgam of other languages, with German being a notable contributor. It is therefore appropriate to fin ish the current paper by noting a couple of English constructions which do
32
ALLAN RAMSAY & REINHARD SCHALER
not fit the analysis outlined in Section 2 above, but which do behave very much like the constructions described in Section 3. The first is a rather archaic form of polar question. It used to be possible to say things like (47) rather than (48): (47) Know ye not who I am? (48) Don't you know who I am? (47) is exactly parallel to the standard form of German polar question, and it is tempting to treat it in exactly the same way. It is also tempting, of course, to treat it using the standard English rules but allowing words other than auxiliaries to be marked +inu, and it would be a mistake to make too much of this example, but it is at the very least provocative. Perhaps more significant is the topicalisation of (49) to (50): (49) An old man was on the bus. (50) On the bus was an old man. The standard rules for topicalisation in English would have produced On the bus an old man was, parallel to On the bus an old man slept. The German rules, however, would have produced (50). Should we therefore deal with this one as though the English copula was in fact subject to the German LP-rules? Is at least part of English just German? REFERENCES Barwise, Jon & John Perry. 1983. Situations and Attitudes. Cambridge, Mass. Bradford Books. Dowty, David R. 1988. "Type raising, functional composition and non-constituent conjunction". Categorial Grammars and Natural Language Structures ed. by Richard T. Oehrle, Emmon Bach & Deirdre Wheeler, 153-198, Dordrecht: Reidel. Gazdar, Gerald, Ewan Klein, Geoffrey K. Pullum & Ivan Sag. 1985. Generalised Phrase Structure Grammar. Oxford: Basil Blackwell. Halliday, M. A. K. 1985. An Introduction to Functional Grammar. London: Edward Arnold. Hoffman, Beryl. 1995. "Integrating "Free", Word Order Syntax and Information Structure". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95), 245-252. Dublin. Johnson, Mark & Martin Kay. 1994. "Parsing and Empty Nodes". Computa tional Linguistics 20:2.289-300. Krifka, Manfred. 1993. "Focus, Presupposition and Dynamic Interpretation". Journal of Semantics, 10.
33
CASE AND WORD ORDER
Moëns, Marc & Mark Steedman. 1988. "Temporal Ontology and Temporal Reference". Computational Linguistics 14:2.15-28. Ramsay, Allan M. 1992. "Bare Plural NPs and Habitual VPs". Proceedings of the 14th International Conference On Computational Linguistics (COLING-92), 226-231. Nantes, Prance. 1994. "Focus on 'only', and 'not'". Proceedings of the 15th International Conference On Computational Linguistics (COLING-94), 881-885. Kyoto, Japan. Reiter, Ray. 1980. 13:1.81-132. Turner, Ray.
1987.
"A Logic for Default Reasoning". "A Theory of Properties".
Artificial
Intelligence
Journal of Symbolic Logic
52:2.455-472. Uszkoreit, Hans. 1987. Word Order and Constituent Structure in German. CSLI, Stanford, Calif. van Eijck, Jan & Hiyan Alshawi. 1992. "Logical Forms". The Core Language Engine ed. by Hiyan Alshawi, 11-40. Cambridge, Mass.: MIT Press. Williams, Edwin. 1981. "On the notions 'lexically related' and 'head of a word' ". Linguistic Inquiry 12:2.254-274.
An Optimised Algorithm for Data Oriented Parsing KHALIL SIMA'AN
Utrecht University Abstract This paper presents an optimisation of a syntactic disambiguation algorithm for Data Oriented Parsing (DOP) (Bod 1993) in particu lar, and for Stochastic Tree-Substitution Grammars (STSG) in gen eral. The main advantage of this algorithm over existing alternat ives (Bod 1993; Schabes & Waters 1993) is time-complexity linear, instead of square, in grammar-size. In practice, the algorithm ex hibits substantial speed up. The paper also suggests a heuristic for DOP, supported by experiments measuring disambiguation-accuracy on the ATIS domain. Bracketing precision is 97%, 0-crossing sen tences are 84% of those parsed and average CPU-time is 18 seconds. 1
Introduction
Data Oriented Parsing (DOP) (Scha 1990; Bod 1992) projects an STSG directly from a given tree-bank. DOP projects an STSG by decompos ing each tree in the tree-bank in all ways, at zero or more internal nodes each time, obtaining a set of constituent structures, which then serves as the elementary-trees set of an STSG. An STSG is basically a ContextFree Grammar (CFG) with rules which have internal structure i.e., are elementary-trees (henceforth elem-trees). Deriving a parse for a given sen tence in STSG is combining elem-trees using the same substitution opera tion as used by CFGs. In contrast to CFGs, however, STSGs allow various derivations to generate the same parse. Crucial for natural language disam biguation, the set of trees generated by an STSG is not always generatable by a CFG; thus, STSGs impose extra constraints on the generated struc tures. For selecting a distinguished structure from the space of generated structures for a given sentence, DOP assigns probabilities to the applica tion of elem-trees in derivations. The probability, which DOP infers for each elem-tree, is the ratio between the number of its appearances in the tree-bank (i.e., either as a tree or as a subtree) and the total number of ap pearances of all elem-trees which share with it the same root non-terminal (for an example see Figure 1). A derivation's probability is then defined as the multiplication of the probabilities of the elem-trees which participate in
36
KHALIL SIMA'AN
it. And a parse's probability is the sum of the probabilities of all deriva tions which generate it. For disambiguation, one parse is selected from the many that a given sentence can be assigned with. In experiments repor ted in Bod (1993), on a manually corrected version of the ATIS tree-bank, both, the most probable parse (MPP) and the parse derived by the most probable derivation (MPD) were observed. As expected, the STSGs which DOP projects from a tree-bank, have a large number of deep elem-trees. This makes parsing and disambiguation time-consuming. The experiments in Bod (1993) had to employ MonteCarlo techniques (basically repeated random sampling). Execution-time in these experiments was a few hours per sentence. In Sima'an et al. (1994), various algorithms are presented for disambiguation under DOP; among them there is a polynomial-time algorithm for computing the MPD and an exponential-time algorithm for computing the MPP 1 . Another algorithm for computing the MPD for Stochastic Lexicalised Context-Free Grammars (SLCFGs) is presented in Schabes & Waters (1993). Time-complexity of both algorithms for computing the MPD is square in grammar size. For DOP grammars, these algorithms become unattractive as soon as the gram mar takes realistic sizes. In this paper the algorithm for computing the MPD (Sima'an et al. 1994) is refined to achieve time-complexity of order linear in grammar size. In addition, the present paper suggests a useful heuristic for reducing the sizes of DOP models projected from tree-banks. The structure of the paper is as follows. Section 2 presents shortly the necessary termin ology and properties pertaining STSGs and parsing. Section 3 presents the algorithm formally. Subsequently, Section 4 provides empirical evidence to its claimed performance and discusses a heuristic for DOP. Finally, in Section 5, the conclusions are discussed. 2
STSGs: Definitions, terminology and properties
Notation: A, , , N, S denote non-terminal, and w denotes a ter minal symbol. α, β denote strings of zero or more symbols which are either terminals or non-terminals. A CFG left-most (l.m.) derivation of (exactly one)/(zero or more)/(at least one) steps is denoted resp. with →/→*im/→+im. Note, → is also used in declarations of functions. \X\ de notes the size of a set X (i.e., its cardinality). A n STSG is a five-tuple (VN VT, S, C, PT), where VN and VT denote 1
Recently we proved the problem of computing the MPP under STSGs is NP-hard.
AN OPTIMISED ALGORITHM FOR DOP
37
Example: corpus tree tIis cut at the internal S node. The resulting set of elem-trees is at the right side. Elementary-trees etl and et3 occur each only once in the corpus-trees, while et2 occurs twice (once as a tree and once as a result of cutting tl). The total number of occurrences of these elem-trees is 4, leading to the probabilities shown in the figure. Fig. 1: An example: STSG projection in DOP respectively the set of non-terminal and the set of terminal symbols, S de notes the start non-terminal, is a set of elem-trees (of arbitrary depth ≥ 1) and PT is a function which assigns a value 0 < PT(t) ≤ 1 (probab ility) to each elem-tree i such that ΣtЄc, root(t)=N PT(t) = 1 (where root(t) denotes the root of tree t). An elem-tree in has only non-terminals as in ternal nodes but may have both terminals and non-terminals on its frontier. A non-terminal on the frontier is called an open-tree (). Substitution: If the left-most open-tree N of tree t is equal to the root of tree t\ then t o t1 denotes the tree obtained by substituting t 1 for N in t. The partial function is called left-most substitution. Notice that the value PT(t) for elem-tree t with root N is the probability of substituting t for any open-tree N in any elem-tree in . A Left-most derivation (l.m.d.) is a sequence of left-most substitutions Imd = (... (t1 ot 2 )o...)ot n , where t 1 , . . . , tn Є , root(ti) = S and the frontier of Imd consists of only terminals. The probability Ρ (lmd) is defined as PT(t1) χ . . . x PT(tn). For convenience, derivation refers to l.m. derivation. A Parse is a tree generated by a derivation. A parse can be generated by many derivations. The probability of a parse is the sum of the probabilities of the derivations which generate it. A Finitely ambiguous grammar derives a finite string only in a finite number of ways. A STSG is in Extended CNF (ECNF) if in each elem-tree each non-terminal node has one non-terminal child, two non-terminal children or only one terminal child2. 2
Each STSG can be transformed into this form without disturbing it as a probabilistic model. Moreover a reverse transformation of any result obtained in the ECNF is easy
38
KHALIL SIMA'AN
Definition: A context-free rule (CF-rule) R=A → A1... An is said to appear in a tree t in if one of the following is true: (1) A is the root of է and Α1... An are its direct children (in this order), (2) R appears in the subtree under one of the children of the root of t. Definition: (VN,VT,S,R ) is the CFG underlying the TSG (VN,VT,S,C) iff R is the set {R | rule R appears in a tree in } (See example in Fig ure 2). A n item of a CFG is any of its rules of which the right-hand side (rhs) contains a dot 3 . ITEMS denotes the set of all items of a CFG. Global assumption: We assume STSGs that have a proper and finitely ambiguous underlying CFG in ECNF.
Example: Given the elem-tree set of a TSG on the left side of this figure, the parse shown in on the right side, is generated by the derivations (t3 o t1) and ((t3 t3) t2). The CFG underlying the TSG has the two rules S → Sb and S → a. The appearances of these rules are represented, resp., by {1,2} and {3,11}, where the naturals in the sets decorate uniquely an appearance of a rule. Fig. 2: (Left) A tree-set and (Right) a derived parse Relevant properties: The set of the strings (language) generated by any STSG is a context-free language (CFL). The set of the parses (treelanguage) generated by an STSG can not always be generated by a CFG which generates the same language. For example, consider a TSG with elem-trees {t1, t2} of Figure 2. There exists no CFG which generates both the same language and the same tree-set as this TSG. The set of the paths (path-set), from the root to the frontier, in the parses generated by STSG derivations forms a regular language. 3
Disambiguating an input sentence
To syntactically disambiguate an input sentence, a 'distinguished' struc ture is assigned to it. This is a step further than mere parsing which 3
and valid. Items serve as notation for parsing.
AN OPTIMISED ALGORITHM FOR DOP
39
has the goal of discovering the structures which the sentence encapsu lates. Bod 1993 tested two selection criteria for the distinguished structure, namely the most probable derivation (MPD) and the most probable parse (MPP). The present paper is concerned with the computation of the MPD. Algorithms for computing the MPD for stochastically enriched restricted versions of Tree-Adjoining Grammars (TAGs) exist (e.g., Schab es & Wa ters 1993). These algorithms can easily be adapted to STSGs. However, the applications we have in mind assume large STSGs which employ a small set of CFG-rules and a large number of deeper trees. For such STSGs the mentioned algorithms have high time-consumption to the degree that their usefulness maybe questioned. The solution proposed in this paper is tailor ing an algorithm for large STSGs, which achieves acceptable execution-time. Two observations underly the structure of the present algorithm. Firstly, the tree-set generated by an STSG for a given sentence is always a subset of the tree-set generated by the underlying CFG for that sentence. And secondly, each STSG derivation can be represented by a unique decoration of the nodes of the parse it generates. Moreover, since the path set of a given STSG derivation always forms a regular set, over the nodes of the elem-trees which participate in the derivation, then there is a certain constraint on the decorations which correspond to STSG derivations. This constraint is de scribed below and is embedded in the so called viability property. Given an arbitrary decoration of a parse for a given sentence, it is possible to check whether it corresponds to an STSG derivation of that sentence by checking whether it fulfills this viability property. This implies that a characterisa tion of the tree-set of an STSG for a given sentence can be achieved through decorating the trees generated by the underlying CFG for the same sentence in a way which fulfills the viability property.
Fig. 3: The two modes of the viability property The viability property : Given an STSG (VN, VT, S, C, PT), assign to each non-frontier non-terminal in each elem-tree in a unique code from a code-domain Π (say the integers), and consider the parse generated by a given derivation. The internal nodes of the parse are decorated by the
40
KHALIL SIMA'AN
codes that originally decorated the elem-trees participating in the given derivation. This specific decoration of the parse corresponds only to the derivation at hand. Clearly, not any decoration of a parse corresponds to a derivation. A closer study of a decorated tree, which results from an STSG derivation, reveals the following property: 1. The code of its start non-terminal S corresponds to the root of an elem-tree. And 2. for any two non-terminals T and Nj, which are parent and its j - t h child (j Є {1, 2}) in the tree, one of the following two properties holds: Parenthood: iV's code, c, and N j 's code, , correspond, resp., to a parent and its j - t h child in an elem-tree (see right-hand side of Figure 3). Or substitution: iV's code, c, appears in an elem-tree with Nj as its open-tree child, and N j 's code, , is the code decorating the root of an elem-tree (see left-hand side of Figure 3). Data structures: The following representation makes the viability property of an STSG explicit. Given an STSG (VN, VT, S, , PT) in which the non-frontier nodes of its elem-trees are coded uniquely with values from Π (e.g., the naturals). Infer three predicates: 1. Parent?(, , j) denotes the proposition "c and are resp. the codes of a parent and its j-th child in a tree in ". 2. Root? () denotes the proposition "c is the code of the root of a tree in C". 3. OT?(c,j) denotes the proposition "child j (enumeration always from left to right) of the node with code is an ". Now infer the seven-tuple (VN, VT, S', R , A, Viable?, P) where : • (VN , VT, S, R ) is the CFG underlying {VN, VT, S, C, F T ) , • Viable?(c, Cj, j) = Parent?(c, cj, j) or {OT?(c,j) and Root?(cj)), • A = {A{R) | R ЄR }, where A(R = N →α) = {c | is code of N for an appearance of R in C} • P : Π →(Π x {1, 2}) →(0..1]. For c, c' Є Π and j Є {1, 2}:
P(ć)(c,j)=
PT(t) 1 0
If {Viable?{c,ć,j) and (t ЄC , ć = root(t))), Else If Viable?(c,c',j), Otherwise.
The set A(R), in this definition, denotes the set of all appearances (i.e., codes) of a rule R ЄR in the tree-set. In any decorated parse tree in which decorates a node, the term P(c') denotes the probability of c' as a function of the code of its parent and its child-number j (from left to right). It expresses the fact that, in an STSG, the probability of a CF-rule of the
AN OPTIMISED ALGORITHM FOR DOP
41
underlying CFG is a function of its particular appearance (code) in the tree-set. The algorithm: The algorithm is an extension to the CYK (Younger 1967) algorithm for CFGs. Firstly, the parse-space (parse-forest) of the in put sentence is constructed on basis of the CFG underlying the given STSG. Subsequently, the computation of the MPD is conducted on this parse-forest employing a constrained decoration mechanism. For the specification of the algorithm define the set A(item, i, j) to be A(R), where item is R with a dot somewhere on its rhs. And let Max denote the operator max on sets of reals4 . Parse-forest: A compact representation of the parse-space of an inputsentence is a data-structure called a parse-forest. A well known algorithm for constructing parse-forests for CFGs is the CYK (Younger 1967; Aho & Ullman 1972; Jelinek et al. 1990). It constructs for a given sentence wN0=W1 . . . wn a table with entries [, j], for 0 ≤ < j ≤ . Informally speaking, entry [, j] contains all items A→α · β such that α→*im wij. Computing the M P D . Algorithm MPD in Figure 4 computes the MPD. P(wn0) denotes the probability of the MPD of the sentence wn0. The function Pp : Π x ITEMS x [0, η) x [0, η] → [0, 1] computes the probability of the most probable among the derivations, which start with code and generate a structure for wji. Algorithm MPD can be adapted to computing the probability of the sentence by exchanging every Max with ∑. The polynomiality of its com putation follows from that of the CYK and from the fact that the sets A(R) are all bounded in size by a constant. The time-complexity of this algorithm is |R| n 3 + |A| 2 n 3 . For natural-language tree-grammars, the ra tio |A|/|R| is usually quite large (an order of 100 is frequent). Therefore, the term |A| 2 n 3 dominates execution-time. In comparison to the algorithm described in Schabes & Waters 1993, the present algorithm is more suitable for larger STSGs. Its use of a CFG-based mechanism enables, in practice, a faster reduction of the parse-space. A n optimisation: Consider Figure 4, let itemP and itemCh denote respectively the item to the left of the semicolon and the item that appears in the overbraced term. The 'multiplication' of the two sets A(itemP,i,j) and A(itemCh,l,m) can be conducted in time linear instead of square in \A\. 4
For example Max pred ( x )A(x) is the maximum on the set {A(x) / Pred{x)} .
42
KHALIL SIMA'AN
For this purpose define the following partitions of these two sets for = 1,2: HasOT(SET,k) = {cЄ SET / OT?(c,k)} HasCh(SET,k) = SET - HasOT(SET,k) where SET = A(itemP,i,j)
RootsOf(SET) ={c Є SET / Root?(c)} InternOf(SET) =SET - RootsOf (SET) | where SET = (iteh,l,)
These two partitions result each in two complementary subsets. Notice that a code in HasOT can be in the viability relation only with codes that correspond to roots of elem-trees, i.e., in RootsO f. Moreover, all codes of HasOT are in the viability relation with all codes of RootsO f. This is because all codes in HasOT allow the substitution of exactly the same roots of elem-trees, namely those in RootsO f. Thus, the multiplication of only one member of HasOT with all members of the set RootsO f should be sufficient. The result of this multiplication can then be copied to the rest of the codes in HasOT. This is done in time linear in |A{itemP,i,j)|. On the other hand, the set HasCh comprises codes that have children which are internal to elem-trees, i.e., in InternOf. But each code in HasCh has one and only one unique child in I nternOf (and vice versa). The search for this child can be done using binary search in log 2 ((iteh,1,)). However, if in the data-structures one also maintains for each code a refer ence to each of its children, then direct access is possible. To exploit these partitions, the right-most M a x expression (overbraced
43
AN OPTIMISED ALGORITHM FOR DOP num. of elem. trees 74450 26612 19094 19854
|R|
|A|
870 870 870 870
436831 381018 240619 74719
Avg.Sen. Length 9.5 9.5 9.5 9.5
Average CPU-secs. linear Bin.Search Square 445 993 9143 281 197 2458 131 223 346
Table 1: Disambiguation time on various STSG sizes in Figure 4) in each of the last three cases of algorithm MPD is rewrit ten. Let these three expressions be denoted by the more general expression MaxCļЄA(itemCh,m,q)Pp(cl,itemCh,m,q)(c,l). Substitute for this expression the following, where item, i and j are as defined by algorithm MPD: Case (cЄ HasOT{A(item,i,j),l))
: Max
Pp{cι,m,q)(c,l),
cι Є RootsOf(A(itemCh,m,q)) Case
(cЄ
HasOh{A(item,i,j),l))
: If cι Є InternOf(A(itemCh,m,q)) and Parent?(c, cΙ 1) Then Pp{cι,m,q)(c,l) Else 0.
This optimisation does not affect space-complexity (i.e., O(|A|n 2 )). 4
Experimental results
num. of
\n\ |A|
elem.trees
11499 11241 11082 10841
N
CPU
Cover
uw
Secs. -age
415 404 412 410
89208 87767 85295 84560
8 18 8.3 14 8.65 17 8.44 19
82% 76% 75% 71%
16% 16% 18% 21%
Exact SA
Bracket.
Bracket.
Match
Prec. 98.2% 95.8% 97% 97.5%
Recall 78.4% 70.5% 66.2% 63%
38/82 27/76 37/75 35/71
69/82 64/76 65/75 58/71
Table 2: Disambiguation accuracy on ATIS sentences The experiments reported below used the ATIS domain Penn Tree-bank II without modifications. They were carried out, on a Sun Sparc sta tion 10 with 64 MB RAM, parsing ATIS word-sequences (previous DOP experiments concerned PoS-Tag sequences).
44
KHALIL SIMA'AN
Efficiency experiments: The three versions of the algorithm were compared for execution-time varying STSG size. The STSGs were projected by varying the allowed maximum depth of elem-trees and by projecting only from part of the tree-bank. The experiment was conducted, for all versions of the algorithm, on the same 76 sentences randomly selected. The results are listed in Table 1. Average cpu-time includes parse-forest generation. Note the difference in growth of execution-time between the three versions as grammar size grows. Accuracy experiments: In Table 2, various accuracy measures are reported. Coverage is the percentage of sentences that were parsed (sen tences containing unknown-words were not parsed - see bellow). Exact match is percentage of parses assigned by the disambiguator that exactly match test-set counterparts. Sentence accuracy (SA) is the percentage of parses, assigned by the disambiguator, that contain no crossing constituents (i.e., 0-crossing) when compared to their test-set counterparts. Bracketing precision is the average, on all parsed sentences, percentage of brackets as signed by the disambiguator, that do not cross any brackets in the test-parse. Bracketing recall is the average ratio of non-crossing brackets assigned by the disambiguator to the total number of brackets in all test-set parses. U W denotes the percentage of sentences containing unknown-words, and N denotes the average number of words per sentence. In each experiment, a random training-set was obtained from the treebank (485 trees), and the rest (100 sentences) formed the test-set. Training was not allowed on test-set material. Various experiments were carried out changing each time the maximal depth of the elem-trees projected from the training-set as suggested in Bod (1993). However, limiting the depth was not effective in limiting the number of elem-trees (that exceeded 570000 for maximum depth 4) and sacrificed many linguistic dependencies. This became also apparent in the accuracy results. To minimise the number of elem-trees without sacrificing any dependencies we constrained the frontier of the elem-trees instead of their depth 5 . The frontiers of elem-trees are constrained to allow a maximum number of substitution-sites and a max imum number of lexical items per elem-tree. Since each substitution can be viewed as a 'bet' with a certain probability of success, the number of substitution sites should be as small as possible. The number of lexical items is chosen in order to control lexicalisation. Table 2 lists accuracy figures for four experiments on four different tťain/test partitions. The ex5
This constraint does not apply to elem-trees of depth 1.
AN OPTIMISED ALGORITHM FOR DOP
45
periments allowed 2 substitution sites and 7 lexical items per elem-tree. These figures are substantially better than those of DOP models that limit depth of elem-trees to 3 or 4. In the reported experiments we did not al low proper-nouns and determiners to lexicalise elem-trees of depth larger than one. We also removed punctuation and markings of empty category from training and test sets. And we did not employ PoS-Tagging since the words lexicalised the elem-trees. The sentences containing unknownwords formed 16-21%. These sentences were not parsed. As far as we know, 97.0% bracketing accuracy, 45% exact-match and 84% 0-crossing sentences are the best figures ever reported on ATIS word-strings. For example, Pereira & Schabes (1992) report around 90.4% bracketing preci sion (on ATIS I PoS-Tag sequences), using the Inside-Outside algorithm for PCFGs. Brill (1993), using Transformation-Based Error-Driven Parsing, reports precision of 91.1% and sentence-accuracy of 60% for experiments with an average test-sentence length 11.3 words. 5
Conclusions
The present optimised algorithm proved vital for experimenting with DOP. As can be seen from the experiments, space and time consumption are or ders of magnitude smaller than those employed in Bod (1993). Extensive experimentation supports constraining the frontier of elementary-trees, sim ilar to η-gram models, when projecting DOP grammars from Tree-banks. It reduces space- and time-complexity and, we suspect, also sparse-data effects. However, further study of the projection mechanism of DOP and other optimisations of the present algorithm is necessary. Acknowledgements. I thank Christer Samuelsson, Rens Bod, Steven Krauwer, and Remko Scha for discussions and comments on an earlier version of the paper. REFERENCES Aho, Alfred V. & Jeffrey, Ullman 1973. The Theory of Parsing, Translation and Compiling. (= Series in Automatic Computation). Englewood Cliffs, New Jersey: Prentice-Hall. Bod, Rens. 1992. "A computational model of language performance: Data Ori ented Parsing". Proceedings of the 14th International Conference on Com֊ putational Linguistics (COLING'92), 855-860. Nantes, Prance. 1993. "Monte Carlo Parsing". Proceedings of the 3rd International Work shop on Parsing Technologies, 1-11. Tilburg/Durbuy.
46
KHALIL SIMA'AN 1995. Enriching Linguistics with Statistics: Performance models of Natural Language, (= IL L ֊ dissertation series, 14). Ph.D. dissertation, University of Amsterdam, The Netherlands.
Brill, Eric. 1993. "Transformation-Based Error-Driven Parsing". Proceedings of the 3rd International Workshop on Parsing Technologies, 13-25. Tilburg/ Durbuy. Jelinek, Fred, John D. Lafferty & Robert L. Mercer. 1990. Basic Methods of Probabilistic Context Free Grammars. Technical Report IBM RC 16374 (#72684). Yorktown Heights, U.S.A.: IBM. Joshi, Aravind K. & Yves, Schabes. 1992. "Tree-Adjoining Grammars and Lexicalised Grammars". Tree Automata and Languages ed. by M. Nivat & Andreas Podelski, 409-430. Amsterdam: Elsevier Science Publishers. Magerman, David M. 1995. "Statistical Decision-Tree Models for Parsing". Pro ceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL'95), 276-283. Cambridge, Mass.: MIT. ɔ
ereira, Fernando & Yves, Schabes. 1992. "Inside-outside reestimation from partially bracketed corpora". Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL'92), 128-135. Newark.
Scha, Remko. 1990. "Taaltheorie en Taaitechnologie; Competence and Perform ance". Computertoepassingen in de Neerlandistiek, LVVN-jaarboek ed. by Q.A.M, de Kort & G.L.J. Leerdam, 7-22. Almere: Landelijke Vereniging van Neerlandici. [In Dutch.] Schabes, Yves & Richard, Waters. 1993. "Stochastic Lexicalised Context-Free Grammar". Proceedings of the 3rd International Workshop on Parsing Tech nologies, 257-266. Tilburg/Durbuy. Sima'an, Khalil, Rens Bod, Steven Krauwer & Remko Scha. 1994. "Efficient Disambiguation by Means of Stochastic Tree-Substitution Grammars". Pro ceedings of the International Conference on New Methods in Language Pro cessing, 50-58. CCL, UMIST, Manchester. Vijay-S hanker, . & David Weir. 1993. "Parsing Some Constrained Grammar Formalisms". Computational Linguistics 19:4.591-636. Younger, D.H. 1967. "Recognition and Parsing of Context-Free Languages in Time n 3 ". Information and Control 10:2.189-208.
Parsing Repairs M A R C E L CORI*, M I C H E L DE FORNEL** & J E A N - M A R I E MARANDIN*
*Université Paris 7, ** EHESS (CELÍTE) &*CNRS (URA 1028) Abstract The paper deals with the parsing of transcriptions of spoken utter ances with self-repair. A syntactic analysis of self-repair is given. A single well-formedness principle accounts for the regularities ob served in a large corpus of transcribed conversations: a constituent is a well-formed repair iff it can be substituted into the right edge of the tree which represents the syntactic structure of the interrupted utterance. The analysis is expressed in a PS-grammar. An augment ation of the Earley algorithm is presented which yields the correct inputs for conversational processing. 1
Introduction
If natural language understanding systems are ever to cope with transcrip tions of spoken utterances, they will have to handle the countless self-repairs (or self-corrections) that abound in them. This is a longstanding problem: "hesitations and false starts are a consistent feature of spoken language and any interpreter that cannot handle them will fail instantly" (Kroch & Hindie 1982:162). See also (Kay et al. 1993). The current assumption is that in terruptions and self-repairs should be handled by editing rules which allow the text to be normalised; these rules belong to a kind of adjustment mod ule within the performance device (Fromkin 1973; Kroch & Hindie 1982; Hindie 1983; Labov (pc); Schegloff 1979). We shall lay the foundations of another approach in this paper: interruptions and self-repairs can be directly handled by the syntactic module. Our proposal is based on the observation that "speakers repair in a linguistically principled way" (Levelt 1989:484). The regular character of self-repair has been emphasised in a number of detailed descriptive studies in different fields: linguistics, conversation analysis, psycholinguistics (Blanche-Benveniste 1987; Fornel 1992a, 1992b; Frederking 1988; Levelt 1983, 1989; Schegloff et al. 1977; Schegloff 1979). Among others, Levelt (1989:487) proposes that "self-repair is a syntactically regular process. In order to repair, the speakers tend to follow the normal rules of syntactic coordination". We have shown elsewhere that self-repair
48
M. CORI, M. DE FORNEL & J.-M. MARANDIN
cannot be reduced to a kind of coordination1. Nevertheless the forms of selfrepair are not only regular but they are submitted to a simple geometric principle of well-formedness. This principle is given a formal representa tion in a PS-grammar. It opens a fresh perspective on the parsing of non standard inputs with self-repairs: a simple and principled augmentation of a standard parsing algorithm can handle them. We make the point with the Earley algorithm. 2
Characterising self-repair
2.1
The overt characteristics of self-repair
The overt characteristics of self-repair are the following: an utterance is interrupted. The interruption is marked by a number of prosodic or phonetic signals such as cut-offs, pauses, hesitation markers or lengthenings. The interruption is followed by an arbitrary number of constituents which appear to be in a paratactic relation to the interrupted utterance. This is illustrated by the following sample taken from a corpus of transcribed conversations2: (1)
a. elle était:: an- mm irlandaise (.) enfin:: de l'Irlande b. elle ne sort plus de son:: euh studio mais il faudrait que vous passiez par euh:: (.) par le:: par le numéro du commissariat hein d. je croyais qu'il était euh:: je croyais qu'il était encore là-bas jusqu'à ce soir3
We shall use the following shorthand convention in the following: stands for the interrupted utterance, # for any prosodic or phonetic signal and R for the repair. 1
2
3
The argumentation is summed up in (Cori et al. 1995); it is fully developed in (Fornel & Marandin Forthcoming). The research is based on an extended corpus of spontaneous self-repairs in French (approximatively 2000 occurrences). They are taken from a large body of transcribed audio and video tapes of naturally occurring conversations in various settings (tele phone, everyday conversation, institutional interaction, etc.). We refer the reader to (Schegloff et al. 1977; Schegloff 1979) for the transcription conventions of (1). (l.a) She was En- mm Irish (.) from Ireland; (l.b) she doesn't leave her er studio; (l.c) but you should go through (.) through the (.) through the number of the police station; (l.d) I thought that he was er I thought that he was still there till tonight. In order to limit the word to word glossing of French utterances, we shall use simple forged examples in the following.
PARSING REPAIRS 2.2
49
The structural characteristics of self-repair
The structural features of self-repair are the following: - ) is a segment analysable as a well-formed syntactic unit apart from the fact that one or more sub-constituent(s) may be missing. - B) R is a segment analysable as a single syntactic unit. This unit may be lexical, phrasal or sentential. It is usually a maximal projection (Xmax or S) but need not be. R can be interrupted as can be; this yields what we call a cascaded repair (§3.2 below). Note that any analysis which reduces self-repair to coordination presup poses (B). In this connection, note the difference between (2.a) and (2.b): (2)
a. ?? l'homme avec les lunettes a poussé le clown # avec les mous taches a poussé le clown b. l'homme a donné un coup de poing au # une gifle au clown4
The string avec les moustaches a poussé le clown does not make up a con stituent and thus is not a licit R, whereas une gifle au clown is a licit R since it can be treated as a single constituent, a ghost constituent (Dowty 1988), in a coordination and in a question-answer pair; une gifle au clown is not a maximal projection. - C) R depends on O 5 . The dependency between and R includes two sub-relations: - C1) R repairs a constituent of which immediately precedes R. Hence the ill-formedness of Vhomme avec les lunettes a poussé le clown # avec les moustaches. The PP avec les moustaches cannot repair avec les lunettes over the VP a poussé le clown. - C2) The choice of the category of R depends on O: R is a licit daughter in 0. This is illustrated in (3):
4
5
(2.a) The man with the spectacles pushed the clown # with the moustache pushed the clown; (2.b) the man gave a punch to the # a slap to the clown. (2.a) is judged an ill-formed repair by Levelt (1989:489). We have not encountered repairs such as (2.a) in our corpus. Levelt (1989:486) did observe the fact: "well-formedness of a repair is apparently not a property of its intrinsic syntactic structure. It is rather dependent on its relation to the interrupted original utterance"
50
M. CORI, M. DE FORNEL & J.-M. MARANDIN
(3)
a. Les enfants attendent le bateau # le ferry de Marseille b. Les enfants attendent le # que le bateau vienne c. Les enfants attendent le # bateau c'. Les enfants attendent le ferry # de Marseille6
Any contemporary theory of coordination puts two constraints on each con junct: (i) "each conjunct should be able to appear alone in place of the en tire coordinate structure" (Sag et al. 1985:165); (ii) each conjunct shares at least one feature with the other (categorial identity being the most frequent case). Self-repair does not have to meet the latter constraint (ii): this is why it cannot be reduced to a coordinate structure. On the other hand, it has to meet the former. - D) R completes O. The intuition which underlies the notion of repair is the following: when R is interpreted as a repair, R is interpreted as a constituent in O, it may, or may not, replace a constituent partially or completely realised in O. For example the sequences O#R in (3.a) and in (3.c') are interpreted as (4) would be; in (3.a) the NP le ferry de Marseille replaces le bateau whereas in (3.C') the PP de Marseille replaces nothing. (4)
3
Les enfants attendent le ferry de Marseille
Analysing self-repair
3.1
Syntactic well-formedness
Generalisation () which characterises the relation holding between and R can be unified in a single principle, the principle of the right edge (REP) 7 : (5)
A constituent R is a well-formed repair for iff it can be substituted into the right edge of the interrupted O.
The interrupted part of (3.b) Les enfants attendent le # may be repaired with an R of category N bateau, NP le ferry, VP espèrent que le bateau viendra or S Les enfants espèrent que le bateau viendra (the constituency 6
7
(3.a) The children wait for the boat # the ferry to Marseille; (3.b) that the boat arrives. Principle (5) is reminiscent of the Major Constituent Constraint on gapping (Hankamer 1973; Gardent 1991). On the status of the right edge for discourse processes, see (Gardent 1991; Prüst 1993).
PARSING REPAIRS
51
requirement involves categorial identity) or with S'[que] que le bateau vi enne (in accordance with the sub-categorisation requirement of the verb attendre). This is illustrated in Figure 1 8 .
Fig. 1: Illustration of the licensing pńnciple Principle (5) prevents ill-formed repairs such as (2.a) above. It accounts for all types of self-repair (reformulation, lemma substitution and restart) 9 . 3.2
Cascaded repair
A repair R itself can be interrupted and it can be repaired. Examples are given in (l.a) and (l.c) where the phonetic signal is followed by a "string of Rs" 10 . The sequence can be schematised as # R 1 # R 2 . . . R m . The REP needs not be augmented or modified to handle this case once we have made precise the structures acting as in the cascade. For example: (6) 8 9
10
Les enfants attendent le # le bateau de # qui va à Marseille
The category U stands for Utterance. On the contrary, the reduction of self-repair to coordination leads to distinguish three different processes (De Smedt & Kempen 1987). Blanche-Benveniste (1987) proposed that the Rs form a coordinate structure. See Fornel & Marandin (Forthcoming) for counter argumentation.
52
M. CORI, M. DE FORNEL & J.-M. MARANDIN
R1 {le bateau de) can be substituted into . R2 {qui va à Marseille) cannot be (?? les enfants attendent # qui va à Marseille). On the other hand, it can be substituted into the "new" configuration TV which is obtained by substituting R1 into {les enfants attendent le bateau de). Cascaded repairs result from the iteration of repair. Repair always in volves only one and one R at a time. The tree obtained by substituting R1 into gives TV which becomes the for repair R2 and so forth. 3.3
Interpreting
#R
The interpretation of #R is built on the tree TV obtained by substituting R into . Thus R is treated as a repair. For example, the interpretation of an utterance such as (3.b) Les enfants attendent le # que le bateau vienne discards the interrupted NP le # and is derived from the tree TV which is the repaired tree: Les enfants attendent que le bateau vienne. The main implication for the interpretation of #R is the following: the recovery of the interpretation is parallel to the licensing of the category of R. Once R is recognised as a constituent of 0 , no specific rule of interpretation has to be called for; the configuration TV is interpreted exactly in the same way that a canonical configuration would be 11 .
3.4
Parsing self-repairs
The analysis allows a simple solution to the problem of parsing an input # R . The relevant features are the following: (i) R is a licit daughter of and (ii) R is a daughter on the right edge of according to the REP. (The REP restricts the choice of categories for R). Thus the input #R can be parsed with a classical algorithm such as Earley and as easily as any other input. Moreover, the same kind of ambiguity encountered in the parsing of canonical inputs arises: attachment ambiguity. For example, Marie in (7.a) can be substituted to Paul or to la femme de l;likewise in (7.b) le professeur Tournesol (...) can be substituted under S' or U. Here lies the other drawback of the reduction of self-repair to coordination: in a coordinate structure, the well-formedness constraints are distinguished from the inter pretative rules which depend on the choice of the conjunctions. If self-repair were a kind of coordination, its semantics should be given a separate and specific formulation. This does not seem plausible.
PARSING REPAIRS (7)
53
Jean aime la femme de Paul # Marie Tournesol m'a dit que l'élève # le professeur Tournesol m'a dit que l'élève n'était pas au point 12 We propose in (Fornel & Marandin Forthcoming) a heuristic rule that mini mises the attachment ambiguity.
4
a. b.
R e p r e s e n t i n g self-repair
Self-repair receives a straightforward formal representation in a PS-grammar. We first define the notions of interrupted tree and right subtree. Let G — (V T , V N , R , U) be a CF-grammar where VT is a terminal vocab ulary, VN a non-terminal vocabulary, U Є VN the axiom, and where the rules are numbered from 1 to n. Each rule i is left(ί) → right(i); λ(i) is the length of right (i); rightj(i) is th j-th symbol in right(i). We assume that there are no rules with right(i) being the empty string. An elementary tree is associated with each rule. Complex non punctual trees are represented by leftmost derivations: A = (i1... ip). root(A) is the label of the root of A. Definition 1 An interrupted tree, written A = (i1 . . . ik-1ik[l]ik+i...ip), is such that the l-th leaf of i k is a terminal leaf of the tree A = ( i 1 . . ik . ·. ip) (i.e., a leaf labelled with a symbol taken in VT), all nodes preceding this leaf (according to the precedence order) dominate terminal leaves of Ճ, and all nodes following this leaf are leaves of A. Definition 2 An elementary right subtree (ERS) of an interrupted tree A = (i1... ik-1ik[l]ik+1 ...ip) is defined as follows: (i) i 1 is an ERS of
(ii) if ij is an ERS of A and if right(ij) = αY, Y being a non-terminal symbol, if all non-terminal leaves of ij are roots of elementary trees in A, then the last one of these elementary trees, ¿ J+s , is also an ERS of A. If j + s = , we must have l ≥ λ(i j + s ) — 1. Definition 3 If ir is an ERS of A, then (ir... ip) is a right subtree of A. Right edge principle. We consider an interrupted tree — [i1... i k [l)... ip) such that root(0) = U and a tree R = (j1.. .jq). (7.a) Jean loves the wife of Paul # Marie; (7.b) Tournesol told me that the student # Professor Tournesol told me that the student was not ready.
54
M. CORI, M. DE FORNEL & J.-M. MARANDIN
(8) R is a well formed repair for iff either root(R) = U or there is an ERS ir of and a rule ξ in the grammar such that left(ir) = left{ξ) and right(ir) — pX and right(ξ) = ρ root(R) with X Є VN U VT. Repaired tree. A repaired tree iV is obtained by substituting R for a right subtree of O: N = ( i 1 . . . ir-1ξj1... j q ). R is then a right subtree of N. Note that lexical repair is not a special case; it corresponds to the case of a punctual R tree. Cascaded repairs. We have two sequences: (i) N 0 , N 1 , . . . , Nm where N 0 , N1,..., N m _ 1 are interrupted trees and N m a complete tree such that root(N0) = U, and (ii) R 1 , . . . , Rm where R1 . . . , R m - 1 are interrupted trees (interrupted re pairs), and R=is a complete tree. Condition (8) is verified for each pair N i _ 1 , R i . Ni is a new tree obtained from Ni-1 and Ri. N m is the repaired tree of the cascade. 5
A n augmented Earley algorithm for repair
We show how to augment the Earley algorithm (Earley 1970) to parse in terrupted inputs with repairs. 5.1
String representations
Let LF be a set of lexical forms. A categorisation function associates a set of terminals with each lexical form u: cat(u) C VT. A representation of a string 1u2... un Є LF* is given by a tree A such that root(A) = U and such that if the ordered sequence of the leaves of A is z1z2 ... zN, then for each i, zi Є cat(ui). A string may be represented by an interrupted tree A= ( i 1 . . . ik[l]...ip) by taking z1z2... zq where zq is the lth leaf of ik. 5.2
Augmentations to the standard algorithm
A type is added in the definition of the states: • right (vs left) indicates whether an elementary tree is an ERS; • cut distinguishes the states involved in the building of interrupted trees. We add the following to the definition of the operations:
PARSING REPAIRS
55
• predict: [1.1.2] and [1.2] below are added to send into the set S m + 1 , which contains the initial states for R, all elementary trees which may dominate R. • scan: [2.1.2], [2.2] and [2.3] are added to handle the replacement of punctual subtrees of O. • complete: [3.2] is added in order to obtain a representation of the interrupted trees in addition to the straightforward output of the algorithm: the repaired trees N. 5.3
The augmented algorithm
The input data of the algorithm is a grammar G = (VT, VN,R, U) and a string u u1... um #um+2um_+3 ... um+p+1 where each u1 Є LF. We add to the grammar a rule numbered 0 such that right(0) = U and left(0) Є VN. The algorithm builds a sequence of sets, 50,51,..., Sm+p+1, made of states. A state is a 5-uple (q, j, k,t, a) where q is a rule, j is a position in right(q) (0 < j < λ(q)), is a set number (0 < < m + p + 1), t is a type (right, left or cut), α is a string (the current result). The initial state (0,0,0, right, ε) is entered into S0. Consider the state 0 and j = X(q) — 1 and t = right then add also ((q, 0, , right, ε) to sm+i. [1.2] If i = m then [1.2.1] if j > 0 then for each (q', j', k', right, β) ε Sk such that rightj+1(q') = left(q) and ƒ = λ(q') — 1, for each rule ξ such that left(q') = left(ξ) and right(q') = pX and right (ξ) = pY, for each rule r such that left(r) = Y, add T ) , M 1 [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-N\N:red] (>B) [S:((B ( C * (the flag)) is)}-[N\N:white}-[CONJN:and}-N\N:red] (>) [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-[N\N:red] (>dec) [S/(N/N):(B (C* (the flag)) is)]-[N\N:(and white red)] () [S:((B ( C * (the flag)) is)(and white red))] (>dec)
Genotype (9-11) 9 [S:((B ( C * (the flag)) is)(and white red))] 10 [S:((C * (the flag))(is (and white red)))] 11 [S:((is(and white red))(the
flag))]
(B) (C*)
Other examples and more details are provided in (Biskri 1995). Analysises are implemented. Here, we do not give the details of the algorithm.
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 83 6
Conclusion
We have presented a model of analysis within the framework of Applicative Cognitive Grammar that realises the interface between syntax and semantic. For many French examples this model is able to realise the following aims: • to produce an analysis which verifies the syntactic correction of state ments. • to develop automatically the predicative structures that yield the func tional semantic interpretation of statements. Moreover, this model has the following characteristics: 1. We do not make any calculus parallel to syntactic calculus like Monta gue's one (1974). A first calculus verifies the syntactic correction, this calculus is carried on by a construction of functional semantic interpretation. This has been made possible by the introduction of combinators to some specific positions of syntagmatic order. 2. We introduce some components of functional semantic by some ap plicative syntactic tools (combinators). 3. We calculate the functional semantic interpretation by some applicat ive syntactic methods (combinators reduction). In order to sum up, we interpret by means of absolute syntactic techniques. The distinction syntax/semantic should be then thought again in another perspective. REFERENCES Ades, Anthony & Mark Steedman. 1982. "On the Order of Words". Linguistics and Philosophy 4.517-558. Barry, Guy & Martin Pickering. 1992. "Dependency and Constituency in Cat egorial Grammar". Word Order in Categorial Grammar / L'ordre des mots dans les grammaires catégorielles ed. by Alain Lecomte, 38-57. ClermontFerrand: Adosa. Biskri, Ismail. 1995. La Grammaire categorielle combinatoire applicative dans le cadre de la grammaire applicative et cognitive. Ph.D. dissertation, EHESS, Paris. Buszkowski, Wojciech, W. Marciszewsk: & Joan Van Benthem. 1988. Categorial Grammar. Amsterdam & Philadelphia: John Benjamins. Curry, Haskell . & Robert Feys. 1958. Combinatory Logic. vol.I, Amsterdam: North-Holland. Deselès, Jean-Pierre. 1990. Langages applicatifs, langues naturelles et cognition. Paris: Hermes.
84
ISMAIL BISKRI & JEAN-PIERRE DESCLÈS & Frederique Segond. 1992. "Topicalisation: Categorial Analysis and Ap plicative Grammar". Word Order in Categorial Grammar ed. by Alain Le comte, 13-37. Clermont-Ferrand: Adosa.
Haddock, Nicholas. 1987. "Incremental Interpretation and Combinatory Cat egorial Grammar". Working Papers in Cognitive Science, I: Categorial Gram mar, Unification Grammar and Parsing ed. by Nicholas Haddock et al., 7184. University of Edinburgh. Lecomte, Alain. 1994. Modeles logiques en théorie linguistique: Éléments pour une théorie informationnelle du langage. Work synthesis. Grenoble:: Uni versité de Grenoble. Moortgat, Michael. 1989. Categorial Investigation, Logical and Linguistic pects of the Lambek Calculus. Dordrecht: Foris.
As
Oehrle, Richard T., Emmon Bach & Deidre Wheeler. 1988. Categorial Grammars and Natural Languages Structures. Dordrecht: Reidel. Pareschi, Remo & Mark Steedman. 1987. "A Lazy Way to Chart Parse with Categorial Grammars". Proceeding of the 27th Annual Meeting of the Asso ciation for Computational Linguistics (ACL'87). Stanford. Shaumyan, Sebastian K. 1987. A Semiotic Theory of Natural Language. Bloom ington: Indiana University Press. Steedman, Mark. 1989. Work in Progress: Combinators and Grammars in Nat ural Language Understanding. Summer Institute of Linguistics, Tucsoni, Uni versity of Arizona. Szabolcsi, Anna. 1987. "On Combinatory Categorial Grammar". Proceeding of the Symposium on Logic and Language, 151-162. Budapest: Akademiai Kiadó.
PARSETALK
about Textual Ellipsis
U D O HAHN & MICHAEL STRUBE
Freiburg University Abstract We present a hybrid methodology for the resolution of textual ellipsis. It incorporates conceptual proximity criteria applied to ontologically well-engineered domain knowledge bases and an approach to cen tering based on functional topic/comment patterns. We state gram matical predicates for textual ellipsis and then turn to the procedural aspects of their evaluation within the framework of an actor-based implementation of a lexically distributed parser. 1
Introduction
Text phenomena, e.g., textual forms of anaphora or ellipsis, are a particu larly challenging issue for the design of natural language parsers, since lack ing recognition facilities either result in referentially incohesive or invalid text knowledge representations. At the conceptual level, textual ellipsis (also called functional anaphora) relates an elliptical expression to its ante cedent by conceptual attributes (or roles) associated with that antecedent (see, e.g., the relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) in (3) and (2) below). Thus it complements the phe nomenon of nominal anaphora (cf. Strube & Hahn 1995), where an ana phoric expression is related to its antecedent in terms of conceptual gener alisation (as, e.g., "Rechner" (computer) refers to "LTE-Lite/25" (a partic ular notebook) in (2) and (1) below). The resolution of text-level anaphora contributes to the construction of referentially valid text knowledge repres entations, while the resolution of textual ellipsis yields referentially cohesive text knowledge bases. (1) Der LTE-Lite/25 wird mit der ST-3141 von Seagate ausgestattet. (The LTE-Lite/25 is - with the ST-3141 from Seagate - equipped.) (2) Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für WindowsProgramme. (The computer provides - because of this new hard disk drive - sufficient storage for Windows programs.) (3) Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. (Also - is - the access time of 25 ms - quite short.)
86
UDO HAHN & MICHAEL STRUBE
Fig, 1: Fragment of the information technology domain knowledge base In the case of textual ellipsis, the conceptual entity that relates the topic of the current utterance to discourse elements mentioned in the preceding one is not explicitly mentioned in the surface expression. Hence, the missing conceptual link must be inferred in order to establish the local coherence of the whole discourse (for an early statement of that idea, cf. Clark (1975)). For instance, in (3) the proper conceptual relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) must be determined. This relation can only be made explicit if conceptual knowledge about the domain is supplied. It is obvious (see Figure 11) that the concept A C C E S S - T I M E is bound in a direct associative or aggregational relation, viz. access-time, to the concept H A R D - D I S K - D R I V E , while its relation to the instance LTEL I T E - 2 5 is not so tight (assuming property inheritance). A relationship between A C C E S S - T I M E and S T O R A G E - S P A C E or SOFTWARE is excluded at the conceptual level, since they are not linked via any conceptual role. 1
The following notational conventions apply to the knowledge base for the information technology domain to which we refer throughout the paper (see Figure 1): Angular boxes from which double arrows emanate contain instances (e.g., LTE-LITE 2 5), while rounded boxes contain generic concept classes (e.g., NOTEBOOK). Directed unlabelled links relate concepts via the isa relation (e.g., NOTEBOOK and COMPUTER-SYSTEM), while links labelled with an encircled square represent conceptual roles (definitional roles are marked by "d"). Their names and value constraints are attached to each circle (e.g., COMPUTER-SYSTEM - has-central-unit - CENTRAL-UNIT, with small ital ics emphasising the role name). Note that any sub concept or instance inherits the conceptual attributes from its superconcept or concept class (this is not explicitly shown in Figure 1).
PARSETALK ABOUT TEXTUAL ELLIPSIS
87
Nevertheless, the association of concepts through conceptual roles is far too unconstrained to properly discriminate among several possible antecedents in the preceding discourse context. We therefore propose a basic heur istic for conceptual proximity, which takes the path length between concept pairs into account. It is based on the common distinction between concepts and roles in classification-based terminological reasoning systems (cf. MacGregor (1991) for a survey). Conceptual proximity takes only conceptual roles into consideration, while it does not consider the generalisation hier archy between concepts. The heuristic can be phrased as follows: If fully connected role chains between the concepts denoted by a possible ante cedent and an elliptical expression exist via one or more conceptual roles, that particular role composition is preferred for the resolution of textual ellipsis whose path contains the least number of roles. Whenever several connected role chains of equal length exist, functional constraints which are based on topic/comment patterns apply for the selection of the proper ante cedent. Hence, only under equal-length conditions grammatical information from the preceding sentence is brought into play (for a precise statement in terms of the underlying text grammar, cf. Table 5 in Section 4). To illustrate these principles, consider the sentences (1)-(3) and Fig ure 1. According to the convention above H A R D - D I S K - D R I V E is conceptu ally most proximate to the elliptical occurrence of A C C E S S - T I M E (due to the direct conceptual role linking H A R D - D I S K - D R I V E - access-time -A C C E S S T I M E with unit length 1), while the relationship between L T E - L I T E - 2 5 and A C C E S S - T I M E exhibits a greater conceptual distance (counting with unit length 2, due to the composition of roles between L T E - L I T E - 2 5
has-hd-drive ֊ H A R D - D I S K - D R I V E - access-time ֊ A C C E S S - T I M E ) . 2
Ontological engineering for ellipsis resolution
Metrical criteria incorporating path connectivity patterns in network-based knowledge bases have often been criticised for lacking generality and in troducing ad hoc criteria likely to be invalidated when applied to different domain knowledge bases (DKB). The crucial point about the presumed un reliability of path-length criteria addresses the problem how the topology of such a network can be 'normalised' such that formal distance measures uniformly relate to intuitively plausible conceptual proximity judgements. Though we have no formal solution for this correspondence problem, we try to eliminate structural idiosyncrasies by postulating two ontology engineer ing (OE) principles (cf. also Simmons (1992) and Mars (1994)):
88
UDO HAHN & MICHAEL STRUBE
1. Clustering into Basic Categories. The specification of the upper level of the ontology of some domain (e.g., information technology (IT)) should be based on a stable set of abstract, yet domain-oriented ontologicai categories inducing an almost complete partition on the en tities of the domain at a comparable level of generality (e.g., hardware, software, companies in the IT world). Each specification of such a ba sic category and its taxonomic descendents constitutes the common ground for what Hayes (1985) calls clusters and Guha & Lenat (1990) refer to as micro theories, i.e., self-contained descriptions of concep tually related proposition sets about a reasonable portion of the commonsense world within a single knowledge base partition (subtheory). 2. Balanced Deepening. Specifications at lower levels of that onto logy, which deal with concrete objects of the domain (e.g., notebooks, laser printers, hard disk drives in the IT world), must be carefully balanced, i.e., the extraction of attributes for any particular category should proceed at a uniform degree of detail at each decomposition level. The ultimate goal is that any subtheory have the same level of representational granularity, although these granularities might differ among various subtheories (associated with different basic categories). Given an ontologically well-engineered DKB, the ellipsis resolution problem, finally, has to be projected from the knowledge to the symbol layer of repres entations. By this, we mean the abstract implementation of knowledge rep resentation structures in terms of concept graphs and their emerging path connectivity patterns. At this level, we draw on early experiments from cognitive psychologists such as Rips et al. (1973) and more recent research on similarity metrics (Rada et al. 1989) and spreading-activation-based inferencing, e.g., by Charniak (1986). They indicate that the definition of proximity in semantic networks in terms of the traversal of typed edges (e.g., only via generalisation or via attribute links) and the corresponding counting of nodes that are passed on that traversal is methodologically valid for computing semantically plausible connections between concepts.2 The OE principles mentioned above are supplemented by the following linguistic regularities which hold for textual ellipsis: 1. Adherence to a Focused Context. Valid antecedents of elliptical expressions mostly occur within subworld boundaries (i.e., they remain within a single knowledge base cluster, micro theory, etc.). Given the 2
An alternative to simple node counting for the computation of semantic similarity, which is based on a probabilistic measure of information content, has recently been proposed by Resnik (1995).
PARSETALK ABOUT TEXTUAL ELLIPSIS
89
OE constraints (in particular, the one requiring each subworld to be characterised by the same degree of conceptual density), path length criteria make sense for estimating the conceptual proximity. 2. Limited Path Length Inference. Valid pairs of possible ante cedents and elliptical expressions denote concepts in the DKB whose conceptual relations (role chains) are constructed on the basis of rather restricted path length conditions (in our experiments, no valid chain ever exceeded unit length 5). This corresponds to the implicit require ment that these role chains must be efficiently computable. 3
Functional centering principles
Conceptual criteria are of tremendous importance, but they are not suffi cient for the proper resolution of textual ellipsis. Additional criteria have to be supplied in the case of equal role length for alternative antecedents. We therefore incorporate into our model various functional criteria in terms of topic/comment patterns which originate from (dependency) structure ana lyses of the underlying utterances. The framework for this type of informa tion is provided by the well-known centering model (Grosz et al. 1995). Ac cordingly, we distinguish each utterance's backward-looking center (Cb(Un)) and its forward-looking centers (Cf(Un)). The ranking imposed on the ele ments of the Cf reflects the assumption that the most highly ranked element of Cf(Un) is the most preferred antecedent of an anaphoric or elliptical ex pression in the utterance U n+1 , while the remaining elements are (partially) ordered according to decreasing preference for establishing referential links. The main difference between the original centering approach and our proposal concerns the criteria for ranking the forward-looking centers. While Grosz et al. assume (for the English language) that grammatical roles are the major determinant for the ranking on the C f , we claim that for German - a language with relatively free word order - it is the functional informa tion structure of the sentence in terms of topic/comment patterns. In this framework, the topic (theme) denotes the given information, while the com ment (rheme) denotes the new information (for surveys, cf. Danes (1974) and Dahl (1974)). This distinction can easily be rephrased in terms of the centering model. The theme then corresponds to the C b (U n ), the most highly ranked element of (Cf(Un_1) which occurs in Un. The theme/rheme hierarchy of Un is determined by the (C f (U n _ 1 ): elements of Un which are contained in Cf(Un-1) (context-bound discourse elements) are less rhematic than elements of Un which are not contained in ( C f ( U n - 1 ) (unbound ele-
90
UDO HAHN & MICHAEL STRUBE
ments). The distinction between context-bound and unbound elements is important for the ranking on the Cf, since bound elements are generally ranked higher than any other non-anaphoric elements. The rules for the ranking on the Cf are summarised in Table 1. They are organised at three layers. At the top level, >TCbase denotes the basic relation for the overall ranking of topic/comment (TC) patterns. The second relation in Table 1, > TCboundtype denotes preference relations exclusively dealing with multiple occurrences of bound elements in the preceding utterance. The bottom level of Table 1 is constituted by >prec, which covers the prefer ence order for multiple occurrences of the same type of any topic/comment pattern, e.g., the occurrence of two anaphora or two unbound elements (all heads in a sentence are ordered by linear precedence relative to their text position). The proposed ranking, though developed and tested for German, prima facie not only seems to account for other free word order languages as well but also extends to fixed word order languages like English, where grammatical roles and information structure, unless marked, coincide. Table 1: Functional ranking on Cf based on topic/comment patterns context-bound element(s) >TCbase unbound element(s) anaphora >TCboundtype elliptical antecedent >TCboundtype elliptical expression nominal head1 >prec nominal head2 >prec ... >prec nominal headn Given these basic relations, we may define the composite relation >TC (cf. Table 2). It summarises the criteria for ordering the items on the forwardlooking centers CF (X and y denote lexical heads). Table 2: Global topic/comment
relation
>TC := { (x, ) | if χ and y both represent the same type of TC patterns then the relation >prec applies to x and y else if x and y both represent different forms of bound elements then the relation >TCboundtype applies to x and y else the relation >TCbase applies to x and y } 4
Grammatical predicates for textual ellipsis
We here build on the ParseTalk model, a fully lexicalised grammar theory which employs default inheritance for lexical hierarchies (Hahn et al. 1994). The grammar formalism is based on dependency relations between lexical
PARSETALK ABOUT TEXTUAL ELLIPSIS
91
heads and modifiers at the sentence level. The dependency specifications3 allow a tight integration of linguistic knowledge (grammar) and conceptual knowledge (domain model), thus making powerful terminological reasoning facilities directly available for the parsing process. Accordingly, syntactic analysis and semantic interpretation are closely coupled. The resolution of textual ellipsis is based on two criteria, a structural and a conceptual one. The structural condition is embodied in the predicate is ΡotentialElliptic Antecedent (cf. Table 3). An elliptical relation between two lexical items is restricted to pairs of nouns. The elliptical phrase which occurs in the n-th utterance is restricted to a definite NP, the antecedent must be one of the forward-looking centers of the preceding utterance. Table 3: Grammar predicate for a potential elliptical antecedent
Į
isPotentialEllipticAntecedent (x, y, η) :⇔ x isac* Noun Λ isac* Noun Λ 3 ζ: (y head ζ Λ ζ isac* DetDefinite) Λ y Є Un Λ x Є Cf(Un-1)
The function Proximity Score (cf. Table 4) captures the basic conceptual condition in terms of the role-related distance between two concepts. More specifically, there must be a connected path linking the two concepts under consideration via a chain of conceptual roles. Finally, the predicate PreferredConceptualBridge (cf. Table 5) combines both criteria. A lexical item χ is determined as the proper antecedent of the elliptical expression y if it is a potential antecedent and if there exists no alternative antecedent ζ whose Proximity Score either is below that of χ or, if their ProximityScore is equal, whose strength of preference under the TC relation is higher than that of x. 3
We assume the following conventions to hold: = {Word, Nominal, Noun, DetDefin ite,...} denotes the set of word classes, and isac = {(Nominal, Word), (Noun, Nominal), (DetDefinite, Nominal),...} cCxC denotes the subclass relation which yields a hierarch ical ordering among these classes. The concept hierarchy consists of a set of concept names F = {COMPUTER-SYSTEM, NOTEBOOK, ACCESS-TIME, T I M E - M S - P A I R , . . . }
(cf. Figure 1) and a subclass relation isaF = {(NOTEBOOK, COMPUTER-SYSTEM), (ACCESS-TIME, TIME-MS-PAIR),...} F x F. The set of role names R = [has-part, has-hd-drive, has-property, access-time,...} contains the labels of admitted conceptual roles. These role names are also ordered in terms of a conceptual hierarchy, viz. isaR = {(has-hd-drive, has-part), (access-time, has-property),...} ΊΖ x ΊΖ. The relation permit F x R x F characterises the range of possible conceptual roles among con cepts, e.g., (HARD-DISK-DRIVE, access-time, ACCESS-TIME) Є permit. Furthermore, object. refers to the concept denoted by object, while head denotes a structural
92
UDO HAHN & MICHAEL STRUBE ProximityScore (from- concept, to-concept)
Table 4: Conceptual distance function
Ι
PreferredConceptualBridge (χ, y, η) :⇔ isPotentialEllipticAntecedent (χ, y, n) Λ - z : isPotentialEllipticAntecedent (ζ, y, n) Λ ( ProximityScore (z., .) < ProximityScore(x.c, y.x) V ( ProximityScore (z.c, y.x) = ProximityScore (x.x, .) Λ z >TC x ) ) Table 5: Preferred conceptual bridge for textual ellipsis
5
Text cohesion parsing: Ellipsis resolution
The actor computation model (Agha & Hewitt 1987) provides the back ground for the procedural interpretation of lexicalised grammar specifica tions in terms of so-called word actors (Hahn et al. 1994). Word actors communicate via asynchronous message passing; an actor can only send messages to other actors it knows about, its so-called acquaintances. The arrival of a message at an actor triggers the execution of a method that is composed of grammatical predicates, as those given in the previous section. The resolution of textual ellipsis depends on the results of the resolution of nominal anaphora and on the termination of the semantic interpretation of the current sentence. A SearchTextEllipsisAntecedent message will only be triggered at the occurrence of the definite noun phrase NP when NP is not a nominal anaphor and NP is not already connected via a Pof-type relation (e.g., property-of, physical-part-of)4. 4
relation within dependency trees, viz. χ being the head of y. Associated with the set R is the set of inverse roles R-1. This distinction becomes important for already established relations like has-property (subsuming access-time, etc.) or has-physical-part (subsuming has-hd-dnve, etc.) insofar as they do not block the initialisation of the ellipsis resolution procedure, whereas the existence of their inverses, we here refer to as Pof-type relations, viz. property-of (subsuming accesstime-of, etc.) and physical-part-of (subsuming hd-drive-of etc.), does. This is simply due to the fact that the semantic interpretation of a phrase like "the access time of the new hard disk drive", as opposed to that of its elliptical counterpart "the access time" in sentence (3), where the genitive object is elliptified (zeroed), already leads to the creation of the Pof-type relation the ellipsis resolution mechanism is supposed to determine. This blocking condition has been proposed and experimentally validated by Katja Markert.
PARSETALK ABOUT TEXTUAL ELLIPSIS
93
Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für Windows-Programme. Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. The computer provides - because of this new HD-drive - sufficient storage for Windows programs. Also - is - the access time of 25 ms - quite short.
Fig. 2: Sample parse for text ellipsis resolution The message passing protocol for establishing cohesive links based on the recognition of textual ellipsis consists of two phases: 1. In phase i, the message is forwarded from its initiator to the sentence delimiter of the preceding sentence, where its state is set to phase 2. 2. In phase 2, the sentence delimiter's acquaintance Cf is tested for the predicate PreferredConceptualBridge. Note that only nouns and pronouns are capable of responding to the SearchTextEllipsis Antecedent message and of being tested as to whether they fulfil the required criteria for an elliptical relation. If the text ellipsis predic ate PreferredConceptualBridge succeeds, the determined antecedent sends a TextEllipsisAntecedentFound message to the initiator of the SearchTextEllipsisAntecedent message. Upon receipt of the AntecedentFound message, the discourse referent of the elliptical expression is conceptually related to the antecedent's referent via the most specific (common) Pof-type relation, thus preserving local coherence at the conceptual level of text propositions. In Figure 2 we illustrate the protocol for establishing elliptical rela tions by referring to the already introduced text fragment (2)-(3) which is repeated at the bottom line of Figure 2. Sentence (3) contains the def inite NP die Zugriffszeit (the access time). Since, at the conceptual level, A C C E S S - T I M E does not subsume any lexical item in the preceding text (cf. Figure 1), the anaphora test fails. The conceptual correlate of die Zugriffs zeit has also not been integrated in terms of a Pof-type relation into the conceptual representation of the sentence as a result of the semantic inter pretation. Consequently, a S'earchTextEllipsisAntecedent message is created by the word actor for Zugriffszeit. That message is sent directly to the sentence delimiter of the previous sentence (phase 1), where the predicate PreferredConceptualBridge is evaluated for the acquaintance Cf (phase 2).
94
UDO HAHN & MICHAEL STRUBE
The concepts are examined in the order given by the C f , first L T E - L I T E - 2 5 (unit length 2), then S E A G A T E - S T - 3 1 4 1 (unit length 1). Since no paths shorter than those with unit length 1 can exist, the test terminates. Even if another item in the centering list following S E A G A T E - S T - 3 1 4 1 would have this shortest possible length, it would not be considered due to the functional preference given to S E A G A T E - S T - 3 1 4 1 in the Cf. Since S E A G A T E - S T 3 1 4 1 has been tested successfully, a TextEllipsisAntecedentFound message is sent to the initiator of the SearchAntecedent message. An appropriate up date links the corresponding instances via the role access-time-of'and, thus, local coherence is established at the conceptual level of the text knowledge base. 6
C o m p a r i s o n with related approaches
As far as proposals for the analysis of textual ellipsis are concerned, none of the standard grammar theories (e.g., HPSG, LFG, GB, CG, TAG) covers this issue. This is not surprising at all, as their advocates pay almost no attention to the text level of linguistic description (with the exception of several forms of anaphora) and also do not take conceptual criteria as part of grammatical descriptions seriously into account. More specifically, they lack any systematic connection to well-developed reasoning systems accounting for conceptual knowledge of the underlying domain. This latter argument also holds for the framework of DRT, although Wada (1994) deals with restricted forms of textual ellipsis in the DRT context. Also only few systems exist which resolve textual ellipses. As an ex ample, consider the PUNDIT system (Palmer et al. 1986), which provides an informal solution for a particular domain. We consider our proposal superior, since it provides a more general, domain-independent treatment at the level of a formalised text grammar. The approach reported in this paper also extends our own previous work on textual ellipsis (Hahn 1989) by the incorporation of a more general proximity metric and an elaborated model of functional preferences on Cf elements which constrains the set of possible antecedents according to topic/comment patterns. 7
Conclusion
In this paper, we have outlined a model of textual ellipsis parsing. It con siders conceptual criteria to be of primary importance and provides a prox imity measure in order to assess various possible antecedents for consider ation of proper bridges (Clark 1975) to elliptical expressions. In addition,
PARSETALK ABOUT TEXTUAL ELLIPSIS
95
functional constraints based on topic/comment patterns contribute further restrictions on elliptical antecedents. The anaphora resolution module (Strube & Hahn 1995) and the tex tual ellipsis handler have both been implemented in Smalltalk as part of a comprehensive text parser for German. Besides the information techno logy domain, experiments with this parser have also been successfully run on medical domain texts, thus indicating that the grammar predicates we developed are not bound to a particular domain (knowledge base). The current lexicon contains a hierarchy of approximately 100 word class spe cifications with nearly 3.000 lexical entries and corresponding concept de scriptions from the LOOM knowledge representation system (MacGregor & Bates 1987) — 900 and 500 concept/role specifications for the information technology and medicine domain, respectively. Acknowledgements. We would like to thank our colleagues in the CLIF Lab who read earlier versions of this paper. In particular, improvements were due to discussions we had with N. Bröker, K. Markert, S. Schacht, K. Schnattinger, and S. Staab. This work has been funded by LGFG aden-Württemberg (1.1.4-7631.0; M. Strube) and a grant from DFG (Ha 2907/1-3; U. Hahn). REFERENCES Agha, Gul & Carl Hewitt. 1987. "Actors: A Conceptual Foundation for Concur rent Object-oriented Programming". Research Directions in Object-Oriented Programming ed. by B. Shriver et al., 49-74. Cambridge, Mass.: MIT Press. Charniak, Eugene. 1986. "A Neat Theory of Marker Passing". Proceedings of the 5th National Conference on Artificial Intelligence (AAAI '86), vol.1, 584-588. Clark, Herbert H. 1975. "Bridging." Proceedings of the Conference on Theoretical Issues in Natural Language Processing (TINLAP-1), Cambridge, Mass. ed. by Roger Schank & . Nash-Webber, 169-174. Dahl, Sten, ed. 1974. Topic and Comment, Contextual Boundness and Focus. Hamburg: Buske. Danes, František, ed. 1974. Papers on Functional Sentence Perspective. Prague: Academia. Grosz, Barbara J., Aravind K. Joshi & Scott Weinstein. 1995. "Centering: A Framework for Modeling the Local Coherence of Discourse". Computational Linguistics 21:2.203-225. Guha, R. V. & Douglas B. Lenat. 1990. "CYC: A Midterm Report". AI Maga zine 11:3.32-59.
96
UDO HAHN & MICHAEL STRUBE
Hahn, Udo. 1989. "Making Understanders out of Parsers: Semantically Driven Parsing as a Key Concept for Realistic Text Understanding Applications". International Journal of Intelligent Systems 4:3.345-393. Hahn, Udo, Susanne Schacht & Norbert Bröker. 1994. "Concurrent, Objectoriented Natural Language Parsing: The ParseTalk Model". International Journal of Human-Computer Studies 41:1/2.179-222. Hayes, Patrick J. 1985. "The Second Naive Physics Manifesto". Formal Theories of the Commonsense World ed. by J. Hobbs & R. Moore, 1-36. Norwood, N.J.: Ablex. MacGregor, Robert. 1991. "The Evolving Technology of Classification-based Knowledge Representation Systems." Principles of Semantic Networks ed. by J. Sowa, 385-400. San Mateo, Calif.: Morgan Kaufmann. MacGregor, Robert & Raymond Bates. 1987. The LOOM Knowledge Repres entation Language. Information Sciences Institute, University of Southern California (ISI/RS-87-188). Mars, Nicolaas J. I. 1994. "The Role of Ontologies in Structuring Large Know ledge Bases". Knowledge Building and Knowledge Sharing ed. by K. Fuchi & T. Yokoi, 240-248. Tokyo, Ohmsha and Amsterdam: IOS Press. Palmer, Martha S. et al. 1986. "Recovering Implicit Information". Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (ACL86), 10-19. New York, N.Y. Rada, Roy, Hafedh Mili, Ellen Bicknell & Maria Blettner. 1989. "Development and Application of a Metric on Semantic Nets". IEEE Transactions on Sys tems, Man, and Cybernetics 19:1.17-30. Resnik, Philip. 1995. "Using Information Content to Evaluate Semantic Similar ity in a Taxonomy". Proceedings of the 14th International Joint Conference on Artificial Intelligence (IL95), vol.1, 448-453. Montreal, Canada. Rips, L. J., E. J. Shoben & E. E. Smith. 1973. "Semantic Distance and the Verification of Semantic Relations". Journal of Verbal Learning and Verbal Behavior 12:1.1-20. Simmons, Geoff. 1992. "Empirical Methods for 'Ontologicai Engineering'. Case Study: Objects". Ontologie und Axiomatik der Wissensbasis von LILOG ed. by G. Klose, E. Lang & Th. Piriein, 125-154. Berlin: Springer. Strube, Michael & Udo Hahn. 1995. "ParseTalk about Sentence- and Text-level Anaphora". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95)i 237-244. Wada, Hajime. 1994. "A Treatment of Functional Definite Descriptions." Pro ceedings of the 15th International Conference on Computational Linguistics (COLING-94), vol.II, 789-795. Kyoto, Japan.
Improving a Robust Morphological Analyser Using Lexical Transducers IÑAKi
ALEGRÍA, X A B I E R ARTOLA
&
K E P A SARASOLA
University of the Basque Country Abstract This paper describes the components of a robust and wide-coverage morphological analyser for Basque and their transformation into lex ical transducers. The analyser is based on the two-level formalism and has been designed in an incremental way with three main mod ules: the standard analyser, the analyser of linguistic variants, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque and its first three applications are a commercial spelling corrector and a general purpose lemmatiser/tagger. The lexical transducers are gen erated as a result of compiling the lexicon and a cascade of two-level rules (Karttunen et al. 1994). Their main advantages are speed and expressive power. Using lexical transducers for our analyser we have improved both the speed and the description of the different com ponents of the morphological system. Some slight limitations have been found too. 1
Introduction
The two-level model of morphology (Koskenniemi 1983) has become the most popular formalism for highly inflected and agglutinative languages. The two-level system is based on two main components: (i) a lexicon where the morphemes (lemmas and affixes) and the possible links among them (morphotactics) are defined; (ii) a set of rules which controls the mapping between the lexical level and the surface level due to the morphophonological transformations. The rules are compiled into transducers, so it is possible to apply the system for both analysis and generation. There is a free available software, PC-Kimmo (Antworth 1990) which is a useful tool to experiment with this formalism. Different flavours of two-level morphology have been developed, most of them changing the continuation-class based morphotactics by unification based mechanisms (Ritchie et al. 1992; Sproat 1992).
98
INAKI ALEGRIA,
XABIER
ARTOLA & ΚΕΡΑ SARASOLA
We did our own implementation of the two-level model with slights vari ations, and applied it to Basque (Agirre et al. 1992), a highly inflected and agglutinative language. In order to deal with a wide variety of linguistic data we built a Lexical Database (LDBB). This database is both source and support for the lexicons needed in several applications, and was designed with the objectives of being neutral in relation to linguistic formalisms, flexible, open and easy to use (Agirre et al. 1995). At present it contains over 60,000 entries, each with its associated linguistic features (category, sub-category, case, number, etc.). In order to increase the coverage and the robustness, the analyser has been designed in a incremental way. It is composed of three main modules (see Figure 1): the standard analyser, the analyser of linguistic variants pro duced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. An important feature of the analyser is its homogeneity as the three different steps are based on two-level morphology, far from ad-hoc solutions.
Fig. 1: Modules of the analyser This analyser is a basic tool for current and future work on automatic pro cessing of Basque and its first two applications are a commercial spelling cor rector (Aduriz et al. 1994) and a general purpose lemmatiser/tagger (Aduriz et al. 1995). Following an overview of the lexical transducers and the description of the application of the two-level model and lexical transducers to the different steps of morphological analysis of Basque are given.
IMPROVING MORPHOLOGY USING TRANSDUCERS 2
99
Lexical transducers
A lexical transducer (Karttunen et al. 1992; Karttunen 1994) is a finitestate automaton that maps inflected surface forms into lexical forms, and can be seen as an evolution of two-level morphology where: • Morphological categories are represented as part of the lexical form. Thus it is possible to avoid the use of diacritics. • Inflected forms of the same word are mapped to the same canonical dictionary form. This increases the distance between the lexical and surface forms. For instance better is expressed through its canonical form good (good+COMP:better). • Intersection and composition of transducers is possible (see Kaplan & Kay 1994). In this way the integration of the lexicon (the lexicon will be another transducer) in the automaton can be resolved and the changes between lexical and surface level can be expressed as a cascade of two-level rule systems (Figure 2).
Fig. 2: Lexical transducers (from Karttunen et al. 1992) In addition, the morphological process using lexical transducers is very fast (thousands of words per second) and the transducer for a whole morpholo gical description can be compacted in less than 1 MB.
100
INAKI ALEGRIA,
XABIER
ARTOLA & ΚΕΡΑ SARASOLA
Different tools to build lexical transducers (Karttunen & Beesley 1992; Karttunen 1993) have been developed in Xerox and we are using them. Uses of lexical transducers are documented by Chanod (1994) and Kwon & Karttunen (1994). 3
T h e s t a n d a r d analyser
Basque is an agglutinative language; that is, for the formation of words the dictionary entry independently takes each of the elements necessary for the different functions (syntactic case included). More specifically, the affixes corresponding to the determinant, number and declension case are taken in this order and independently of each other (deep morphological structure). One of the principal characteristics of Basque is its declension system with numerous cases, which differentiates it from the languages spoken in the surrounding countries. We have applied the two-level model defining the following elements (Agirre et al. 1992; Alegría 1995): • Lexicon: over 60,000 entries have been defined corresponding to lem mas and affixes, grouped into 154 sublexicons. The representation of the entries is not canonical because 18 diacritics are used to control the application of morphophonological rules. • Continuation classes: they are groups of sublexicons to control the morphotactics. Each entry of the lexicon has its continuation class and all together define the morphotactics graph. The long distance de pendencies among morphemes can not be properly expressed by con tinuation classes, therefore in our implementation we extended their semantics defining the so-called extended continuation classes. • Morphophonological rules: 24 two-level rules have been defined to express the morphological, phonological and orthographic changes between the lexical and the surface levels that appear when the morph emes are combined. The morphological analyser attaches to each input word-form all possible in terpretations and its associated information that is given in pairs of morphosyntactic features. The conversion of our description to a lexical transducer was done in the following steps: 1. Canonical forms and morphological categories were integrated in the lexicon from the lexical data-base.
IMPROVING MORPHOLOGY USING TRANSDUCERS
101
2. Due to long distance dependencies among morphemes, which could not be resolved in the lexicon, two additional rules were written to ban some combinations of morphemes. These rules can be put in a different rule system near to the lexicon without mixing morphotactics and morphophonology (see Figure 3). 3. The standard rules could be left without changes (mapping in the lexicon canonical forms and arbitrary forms) but were changed in or der to change diacritics by morphological features, doing a clearer description of the morphology of the language.
Fig. 3: Lexical transducer for the standard analysis of Basque The resultant lexical transducer is about 500 times faster than the original system.
102 4
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA T h e analysis and correction of linguistic variants
Because of the recent standardisation and the widespread dialectal use of Basque, the standard morphology is not enough to offer good results when analysing corpora. To increase the coverage of the morphological processor an additional two-level subsystem was added (Aduriz et al. 1993). This subsystem is also used in the spelling corrector to manage competence errors and has two main components: 1. New morphemes linked to the corresponding correct ones. They are added to the lexical system and they describe particular variations, mainly dialectal forms. Thus, the new entry tikan, dialectal form of the ablative singular morpheme, linked to its corresponding right entry tik will be able to analyse and correct word-forms such etxetikan, k a l e t i k a n , ... (variants of e t x e t i k from the house, k a l e t i k from the street, ...). Changing the continuation class of morphemes morphotactic errors can be analysed. 2. New two-level rules describing the most likely regular changes that are produced in the variants. These rules have the same structure and management than the standard ones. Twenty five new rules have been defined to cover the most common competence errors. For instance, the rule h:0 => V:V_V:V describes that between vowels the h of the lexical level may disappear in the surface level. In this way the wordform bear, misspelling of behar, to need, can be analysed. All these rules are optional and have to be compiled with the standard rules but some inconsistencies have to be solved because some new changes were forbidden in the original rules. To correct the word-form the result of the analysis has to be entered into the morphological generation using correct morphemes linked to variants and original rules. To correct beartzetikan, variant of b e h a r t z e t i k , two steps, analysis and generation, are followed as it is shown in Figure 4. When we decided to use lexical transducers for the treatment of linguistic variants, the following procedure was applied: 1. The additional morphemes linked to the standard ones are solved using the possibility of expressing two levels in the lexicon. In one level the non-standard morpheme will be specified and in the other (the correspondent to the result of the analysis) the standard morpheme. 2. The additional rules do not need to be integrated with the standard ones (Figure 5), and so, it is not necessary to solve the inconsistencies.
IMPROVING MORPHOLOGY USING TRANSDUCERS
103
Fig. 4: Steps {or correction As Figure 5 (B) shows, it is possible and clearer to put these rules in other plane near to the surface, because most of the additional rules are due to phonetic changes and do not require morphological information. Only the surface characters, the morpheme boundary and additional information about one change (the final a of lemmas) complete the intermediate level between the two rule systems. 3. In our original implementation it was possible to distinguish between standard and non-standard analysis (the additional rules are marked and this information can be obtained as result of the analysis), and so the non- standard information can be additional; but with lexical transducers, it is necessary to store two transducers one for standard analysis and other for standard and non-standard analysis. Although in the original system the speed of analysis using additional in formation was two or three times slower than the standard analysis, using lexical transducers the difference between both analysis is very slight. 5
The analysis of unknown words
Based on the idea used in speech synthesis (Black et al. 1991), a two-level mechanism for analysis without lexicon was added to increase the robustness of the analyser.
104
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA
(A)
(B)
Fig. 5: Lexical transducer for the analysis of linguistic
variants
This mechanism has the following two main components in order to be capable of treating unknown words: 1. generic lemmas represented by "??" (one for each possible open cat egory or subcategory) which are organised with the affixes in a small two-level lexicon 2. two additional rules in order to express the relationship between the generic lemmas at lexical level and any acceptable lemma of Basque, which are combined with the standard ones Some standard rules have to be modified because surface and lexical level are specified, and in this kind of analysis the lexical level of the lemmas changes. The two-level mechanism is also used to analyse the unknown forms, and the obtention of at least one analysis is guaranteed. In order to eliminate the great number of ambiguities in the analysis, a local disambiguation process is carried out.
IMPROVING MORPHOLOGY USING TRANSDUCERS
105
By using lexical transducers the two additional rules can be placed inde pendently (see Figure 6), and so, the original rules can remain unchanged. In this case the additional subsystem is arranged close to the lexicon be cause it maps the transformation between generic and hypothetical lemmas at lexical level. The resultant lexical transducer is very compact and fast.
Fig. 6: Lexical transducer for the analysis of unknown words Our system has a user lexicon and an interface to the update process too. Some information about the new entries (mainly part of speech) is necessary to add them to the user lexicon. The user lexicon is combined with the general one increasing the coverage of the morphological analyser. This mechanism is very useful in the process of spelling correction but an on line updating of the user lexicon is necessary. This treatment is carried out in our original implementation but, when we use lexical transducers the updating operation is slow (it is necessary to compile everything together) and therefore, there are problems for on-line updating. Carter (1995) proposes compiling affixes and rules, but no lemmas, in order to have flexibility when dealing with open lexicons, but it presents problems managing compounds at run-time.
106 6
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA Conclusions
A two-level formalism based morphological processor has been designed in a incremental way in three main modules: the standard analyser, the analyser of linguistic variants produced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque. A B 4.846 2.343 2.607 1.429 307 85 101 28 22 85 (84%) (79%) 21 4 Full wrong analysis Precision 99,2% 99,7%
Concept Number of words Different words Unknown words Linguistic variants Analysed
A+B 7.207 4.036 392 129 107 (83%) 25 99,4%
Table 1: Figures about the different kinds of analysis Figures about the precision of the analyser are given in Table 6. Two different corpora were used: (A) a text of a magazine where foreign names appear and (B) a text about philosophy. The percents of unknown words and precision are calculated on different words, so, the results with all the corpus would be better. Using lexical transducers for our analyser we have improved both the speed and the description of the different components of the tool. Some slight limitations have been found too. Acknowledgements. This work had partial support from the local Government of Gipuzkoa and from the Government of the Basque Country. We would like to thank to Xerox for letting us using their tools, and also to Ken Beesley and Lauri Karttunen for their help in using these tools and designing the lexical transducers. We also want to acknowledge to Eneko Agirre for his help with the English version of this manuscript.
IMPROVING MORPHOLOGY USING TRANSDUCERS
107
REFERENCES Aduriz, Itziar, E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, A, Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1993. "A Morphological Analysis Based Method for Spelling Correction". Proceedings of the 6th Conference of the European Association for Computational Lin guistics (EACL'93), 463-463. Utrecht, The Netherlands. , E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, Da Costa A., A. Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1994. "Xuxen-Mac: un corrector ortografico para textos en euskara". Proceedings of the 1st Conference Universidad y Macintosh, UNIMAC, vol.11, 305-310. Madrid, Spain. , I. Alegria, J.M. Arriola, X. Artola, Diaz de Ilarraza A., N. Ezeiza, K, Gojenola, M. Maritxalar. 1995. "Different issues in the design of a lemmatiser/tagger for Basque". From Text to Tag Workshop, SIGDAT (EACL''95), 18-23. Dublin, Ireland. Agirre, Eneko, I. Alegria, X. Arregi, X. Artola, A. Diaz de Illarraza, M. Maritx alar, K. Sarasola & M. Urkia. 1992. "XUXEN: A spelling checker/corrector for Basque based on Two-Level morphology". Proceedings of the 3rd Con ference Applied Natural Language Processing (ANLP'92), 119-125. Trento, Italy. , X. Arregi, J.M. Arriola, X. Artola, A. Diaz de Illarraza, J.M. Insausti & K. Sarasola. 1995. "Different issues in the design of a general-purpose Lexical Database for Basque". Proceedings of the 1st Workshop on Applications of Natural Language to Data Bases (NLDB'95), Versailles, France, 299-313. Alegria, Iñaki. 1995. Euskal morfologiaren tratamendu automatikorako tresnak. Ph.D. dissertation, University of the Basque Country. Donostia, Basque Country. Antworth, Evan L. 1990. PC-KIMMO: A two-level processor for morphological analysis. Dallas, Texas: Summer Institute of Linguistics. Black, Alan W., Joke van de Plassche & Briony Williams. 1991. "Analysis of Unknown Words through Morphological Descomposition". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'91), vol.1, 101-106. Carter, David. 1995. "Rapid development of morphological descriptions for full language processing system". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'95), 202-209. Dublin, Ireland. Chanod, Jean-Pierre. 1994. "Finite-state Composition of French Verb Morpho logy". Technical Report (Xerox MLTT-005). Meylan, France: Rank Xerox Research Center, Grenoble Laboratory.
108
INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA
Kaplan, Ronald M. & Martin Kay. 1994. "Regular models of phonological rule systems". Computational Linguistics 20:3.331-380. Karttunen, Lauri & Kenneth R. Beesley. 1992. "Two-Level Rule Compiler". Technical Report (Xerox ISTL-NLTT-1992-2). Palo Alto, Calif.: Xerox. Palo Alto Research Center. , Ronald M. Kaplan & Annie Zaenen. 1992. "Two-level morphology with composition". Proceedings of the 14th Conference on Computational Lin guistics (COLING'92), vol.1, 141-148. Nantes, Prance. 1993. "Finite-State Lexicon Compiler". Technical Report (Xerox ISTLNLTT-1993-04-02). Xerox. Palo Alto Research Center. 3333 Coyote Hill Road. Palo Alto, CA 94304 1994. "Constructing Lexical Transducers". Proceedings of the 15th Con ference on Computational Linguistics (COLING'94), vol.1, 406-411. Kyoto, Japan. Koskenniemi, Kimmo. 1983. Two-level Morphology: A general Computational Model for Word-Form Recognition and Production. Publications 11. Univer sity of Helsinki. Kwon, Hyuk-Chul & Lauri Karttunen. 1994. "Incremental construction of a lexical transducer for Korean". Proceedings of the 15th Conference on Com putational Linguistics (COLING,94)-l vol.11, 1262-1266. Kyoto, Japan. Ritchie, Graeme D., Alan W. Black, Graham J. Russell & Stephen G. Pulman. 1992. Computational Morphology. Cambridge, Mass.: MIT Press. Sproat, Richard. 1992. Morphology and Computation. Press.
Cambridge, Mass.: MIT
II SEMANTICS AND DISAMBIGUATION
Context-Sensitive Word Distance by Adaptive Scaling of a Semantic Space HIDEKI KOZIMA & AKIRA ITO
Communications Research Laboratory Abstract This paper proposes a computationally feasible method for measuring the context-sensitive semantic distance between words. The distance is computed by adaptive scaling of a semantic space. In the semantic space, each word in the vocabulary V is represented by a multi dimensional vector which is extracted from an English dictionary through principal component analysis. Given a word set C which specifies a context, each dimension of the semantic space is scaled up or down according to the distribution of C in the semantic space. In the space thus transformed, the distance between words in V becomes dependent on the context (7. An evaluation through a word prediction task shows that the proposed measurement successfully extracts the context of a text. 1
Introduction
Semantic distance (or similarity) between words is one of the basic meas urements used in many fields of natural language processing, information retrieval, etc. Word distance provides bottom-up information for text under standing and generation, since it indicates semantic relationships between words that form a coherent text structure (Grosz & Sidner 1986); word dis tance also provides a basis for text retrieval (Schank 1990), since it works as associative links between texts. A number of methods for measuring semantic word distance have been proposed in the studies of psycholinguistics, computational linguistics, etc. One of the pioneering works in psycholinguistics is the 'semantic differ ential' (Osgood 1952), which analyses the meaning of words by means of psychological experiments on human subjects. Recent studies in computa tional linguistics proposed computationally feasible methods for measuring semantic word distance. For example, Morris & Hirst (1991) used Roget's thesaurus as a knowledge base for determining whether or not two words are semantically related; Brown et al. (1992) classified a vocabulary into semantic classes according to the co-occurrency of words in large corpora;
112
HIDEKI KOZIMA & AKIRA ITO
Kozima & Furugori (1993) computed the similarity between words by means of spreading activation on a semantic network of an English dictionary. The measurements in these former studies are so-called context-free or static ones, since they measure word distance irrespective of contexts. How ever, word distance changes in different contexts. For example, from the word car, we can associate related words in the following two directions: • car → bus, t a x i , railway, • car → engine, t i r e , seat, • • • The former is in the context of 'vehicle', and the latter is in the context of 'components of a car'. Even in free-association tasks, we often imagine a certain context for retrieving related words. In this paper, we will incorporate context-sensitivity into semantic dis tance between words. A context can be specified by a set C of keywords of the context (for example, {car, bus} for the context 'vehicle'). Now we can exemplify the context-sensitive word association as follows: • C= {car, bus} → t a x i , railway, airplane, ••• • C— {car, engine} → t i r e , seat, headlight, ••• Generally, we observe a different distance for different context. So, in this paper we will deal with the following problem: Under the context specified by a given word set C, compute semantic distance d(w,w'\C) between any two words w,w' in our vocabulary V. Our strategy for this context-sensitivity is 'adaptive scaling of a semantic space'. Section 2 introduces the semantic space where each word in the vocabulary V is represented by a multi-dimensional semantic vector. Sec tion 3 describes the adaptive scaling. For a given word set C that specifies a context, each dimension of the semantic space is scaled up or down accord ing to the distribution of C in the semantic space. After this transformation, distance between Q-vectors becomes dependent on the given context. Sec tion 4 shows some examples of the context-sensitive word distance thus computed. Section 5 evaluates the proposed measurement through word prediction task. Section 6 discusses some theoretical aspects of the pro posed method, and Section 7 gives our conclusion and perspective. 2
Vector-representation of word meaning
Each word in the vocabulary V is represented by a multi-dimensional Qvector. In order to obtain Q-vectors, we first generate 2851-dimensional
CONTEXT-SENSITIVE WORD DISTANCE
113
Fig. 1: Mapping words onto Q-vectors P-vectors by spreading activation on a semantic network of an English dic tionary (Kozima & Furugori 1993). Next, through principal component analysis on P-vectors, we map each P-vector onto a Q-vector with a re duced number of dimensions (see Figure 1). 2.1
From an English dictionary to P-vectors
Every word w in the vocabulary V is mapped onto a P-vector P(w) by spreading activation on the semantic network. The network is systematic ally constructed from a subset of the English dictionary, LDOCE (Longman Dictionary of Contemporary English). The network has 2851 nodes corres ponding to the words in LDV (Longman Defining Vocabulary, 2851 words). The network also has 295914 links between these nodes — each node has a set of links corresponding to the words in its definition in LDOCE. Since every headword in LDOCE is defined by using LDV only, the network be comes a closed cross-reference network of English words. Each node of the network can hold activity, and this activity flows through the links. Hence, activating a node in the network for a certain period of time causes the activity to spread over the network and forms a pattern of activity distribution on it. Figure 2 shows the pattern gener ated by activating the node red; the graph plots the activity values of 10 dominant nodes at each step in time. The P-vector P(w) of a word w is the pattern of activity distribution generated by activating the node corresponding to w. P(w) is a 2851dimensional vector consisting of activity values of the nodes at T —10 as an approximation of the equilibrium. P(w) indicates how strongly each node of the network is semanticaliy related with w. In this paper, we define the vocabulary V as LDV (2851 words) in or der to make our argument and experiments simple. Although V is not a large vocabulary, it covers 83.07% of the 1006815 words in the LancasterOslo/Bergen (LOB) corpus. In addition, V can be extended to the set of
114
HIDEKIKOZIMA & AKIRA ITO
Fig. 2: Spreading activation
Fig. 3: Clustering of P-vectors
all headwords in LDOCE (more than 56000 words), since a P-vector of a non-LDV word can be produced by activating a set of the LDV-words in its dictionary definition. (Remember that every headword in LDOCE is defined using only LDV.) The P-vector P(w) represents the meaning of the word w in its rela tionship to other words in the vocabulary V. Geometric distance between two P-vectors P(w) and P(w') indicates semantic distance between the words w and w''. Figure 3 shows a part of the result of hierarchical clus tering on P-vectors, using Euclidean distance between centers of clusters. The dendrogram reflects intuitive semantic similarity between words: for instance, rat/mouse, t i g e r / l i o n / c a t , etc. However, the similarity thus observed is context-free and static. The purpose of this paper is to make it context-sensitive and dynamic. 2.2
From P-vectors to Q-vectors
Through principal component analysis, we map every P-vector onto a Qvector, of which we will define context-sensitive distance later. The principal component analysis of P-vectors provides a series of 2851 principal compon ents. The most significant m principal components work as new orthogonal axes that span m-dimensional vector space. By these m principal compon ents, every P-vector (with 2851 dimensions) can be mapped onto a Q-vector (with m dimensions). The value of m, which will be determined later, is much smaller than 2851. This brings about not only compression of the semantic information, but also elimination of the noise in P-vectors. First, we compute the principal components X 1 , X 2 , • • •, X 2851 — each
CONTEXT-SENSITIVE WORD DISTANCE
115
of which is a 2851-dimensional vector — under the following conditions: • For any x3 its norm |x2| is 1. • For any X3,X3(i ≠ j), their inner product (Xi,X3) is 0. • The variance vi of P-vectors projected onto Xi is not smaller than any vi (j> i). In other words, X1 is the first principal component with the largest variance of P-vectors, and X2 is the second principal component with the secondlargest variance of P-vectors, and so on. Consequently, the set of principal components X 1 , X2 ,..., X 2851 provides a new orthonormal coordinate sys tem for P-vectors. Next, we pick up the first m principal components X 1 , X2, ...,Xm. The principal components are in descending order of their significance, because the variance vi indicates the amount of information represented by Xi We found that even the first 200 axes (7.02% of the 2851 axes) can represent 45.11% of the total information of P-vectors. The amount of information represented by Q-vectors increases with m: 66.21% for the first 500 axes, 82.80% for the first 1000 axes. However, for large m, each Q-vector would be isolated because of overfitting — a large number of parameters could not be estimated by a small number of data. We estimate the optimal number of dimensions of Q-vectors to be m = 281, which can represent 52.66% of the total information. This optimisation is done by minimising the proportion of noise remaining in Q-vectors. The amount of the noise is estimated by ∑wЄF |Q(w)|, where F ( V) is a set of 210 function words — determiners, articles, prepositions, pronouns, and conjunctions. We estimated the proportion of noise for all m = 1, • • •, 2851 and obtained the minimum for m = 281. Therefore, from now we will use a 281-dimensional semantic space. Finally, we map each P-vector P(w) onto a 281-dimensional Q-vector Q(w). The i-th component of Q(w) is the projected value of P(w) on the principal component Xi; the origin of Xi is set to the average of the projected values on it. 3
Adaptive scaling of the semantic space
Adaptive scaling of the semantic space of Q-vectors provides context-sensitive and dynamic distance between Q-vectors. Simple Euclidean distance between Q-vectors is not so different from that between P-vectors; both are contextfree and static distances. The adaptive scaling process transforms the se mantic space to adapt it to a given context C. In the semantic space thus
116
HIDEKI KOZIMA & AKIRA ITO
Fig. 4: Adaptive scaling
Fig. 5: Clusters in a subspace
transformed, simple Euclidean distance between Q-vectors becomes depend ent on C. (See Figure 4.) 3.1
Semantic subspaces
A subspace of the semantic space of Q-vectors works as a simple device for semantic word clustering. In a semantic subspace with the dimensions appropriately selected, the Q-vectors of semantically related words are ex pected to form a cluster. The reasons for this are as follows: • Semantically related words have similar P-vectors, as illustrated in Figure 3. • The dimensions of Q-vectors are extracted from the correlations between P-vectors by means of principal component analysis. As an example of word clustering in the semantic subspaces, let us consider the following 15 words: 1. after, 2. ago, 3. before, 4. bicycle, 5. bus, 6. car, 7. enjoy, 8. former, 9. glad, 10. good, 11. l a t e , 12. pleasant, 13. railway, 14. s a t i s f a c t i o n , 15. vehicle. We plotted these words on the subspace I 2 x l 3 , namely the plane spanned by the second and third dimensions of Q-vectors. As shown in Figure 5, the words form three apparent clusters, namely 'goodness', 'vehicle', and 'past'. However, it is still difficult to select appropriate dimensions for mak ing a semantic cluster for given words. In the example above, we used only two dimensions; most semantic clusters need more dimensions to be well-separated. Moreover, each of the 2851 dimensions is simply selected
CONTEXT-SENSITIVE WORD DISTANCE
117
Fig. 6: Adaptive scaling of the semantic space or discarded; this ignores their possible contribution to the formation of clusters. 3.2
Adaptive scaling
Adaptive scaling of the semantic space provides a weight for each dimension in order to form a desired semantic cluster; these weights are given by scaling factors of the dimensions. This method makes the semantic space adapt to a given context C in the following way: Each dimension of the semantic space is scaled up or down so as to make the words in C form a cluster in the semantic space. In the semantic space thus transformed, the distance between Q-vectors changes with C. For example, as illustrated in Figure 6, when C has ovalshaped (generally, hyper-elliptic) distribution in the pre-scaling space, each dimension is scaled up or down so that C has a round-shaped (generally, hyper-spherical) distribution in the transformed space. This coordinate transformation changes the mutual distance among Q-vectors. In the raw semantic space (Figure 6, left), the Q-vector • is closer to C than the Qvector o; in the transformed space (Figure 6, right), it is the other way round — o is closer to C, while • is further apart. The distance d(w,w'\C) between two words w,w' under the context C = {w1, • • •, wn} is defined as follows:
where Q(w) and Q(w') are the m-dimensional Q-vectors of w and w'; re spectively: Q(w) = (q1 ..., qm), Q(w') = (q', • • •, q'm).
118
HIDEKI KOZIMA & AKIRA ITO
The scaling factor fi G [0,1] of the z'-th dimension is defined as follows:
where SD i (C) is the standard deviation of the z-th component values of w1, • • •, wn, and SD i (V) is that of the words in the whole vocabulary V. The operation of the adaptive scaling described above is summarised as follows. • If C forms a compact cluster in the i-th dimension (ri 0), the di mension is scaled up (fi 1) to be sensitive to small differences in the dimension. • If C does not form an apparent cluster in the z-th dimension (ri >>0), the dimension is scaled down (fi0) to ignore small differences in the dimension. Now we can tune the distance between Q-vectors to a given word set C which specifies the context for measuring the distance. In other words, we can tune the semantic space of Q-vectors to the context C. This tune-up procedure is not computationally expensive, because once we have computed the set of Q-vectors and SD 1 (V), • • •, SD m (V), then all we have to do is to compute the scaling factors f1,..., fm for a given word set C Computing distance between Q-vectors in the transformed space is no more expensive than computing simple Euclidean distance between Q-vectors. 4
Examples of measuring the word distance
Let us see a few examples of the context-sensitive distance between words computed by adaptive scaling of the semantic space with 281 dimensions. Here we deal with the following problem: Under the context specified by a given word set C, compute the distance d(w, C) between w and C, for every word w in our vocabulary V. The distance d(w,C) is defined as follows:
This means that the distance d(w, C) is equal to the distance between w and the center of C in the semantic space transformed. In other words, d(w ,C) indicates the distance of w from the context C.
CONTEXT-SENSITIVE WORD DISTANCE (7 = {bus, car, railway} +
wЄC (15) car_l r a i l way J. bus_l carriage-1 motor_l motor_2 track_2 track_l road-1 passenger_l vehicle_l engine.l garage-1 train_l belt.l
d(w, C) 0.1039 0.1131 0.1141 0.1439 0.1649 0.1949 0.1995 0.2024 0.2038 0.2185 0.2274 0.2469 0.2770 0.2792 0.2853
119
C = {bus, scenery, tour} wЄC+(15) bus_l scenery_l tour - 2 tour-l abroad-1 tourist-l passenger-l make-2 make-3 everywhere_l garage.l set.2 machinery_l something-l timetable.l
d(w, C) 0.1008 0.1122 0.1211 0.1288 0.1559 0.1593 0.1622 0.1691 0.1706 0.1713 0.1715 0.1723 0.1733 0.1743 0.1744
Table 1: Association from a given word set C Now we can extract a word set C+(k) which consists of the k closest words to the given context C. This extraction is done by the following procedure: 1. Sort all words in our vocabulary V in ascending order of d(w, C). 2. Let C+(k) be the word set which consists of the first k words in the sorted list. Note that C+(k) may not include all words in C, even if k > \C\. Here we will see some examples of extracting C+(k) from a given context C. When the word set C = {bus, car, railway} is given, our contextsensitive word distance produces the cluster C + (15) shown in Table 1 (left). We can see from the list1 that our word distance successfully associates related words like motor and passenger in the context of 'vehicle'. On the other hand, from C = {bus, scenery, t o u r } , the cluster C + (15) shown in Table 1 (right) is obtained. We can see the context 'bus tour' from the list. Note that the list is quite different from that of the former example, though both contexts contain the word bus. When the word set C = {read, paper, magazine}, the following cluster C + (12) is obtained. (The words are listed in ascending order of the dis tance.) {paper_l, read_l, magazine.l, newspaper_l, print_2, book_l, p r i n t _ l , wall_l, something_l, a r t i c l e _ l , s p e c i a l i s t - 1 , t h a t - l } . 1
Note that words with different suffix numbers correspond to different headwords (i.e., homographs with different word classes) of the English dictionary LDOCE. For in stance, motor_l / noun, motor_2 / adjective.
120
HIDEKI KOZIMA & AKIRA ITO n
e
1 2 3 4 5 6 7 8
0.3248 0.1838 0.1623 0.1602 0.1635 0.1696 0.1749 0.1801
Fig. 7: Word prediction task (left) and its result (right) It is obvious that the extracted context is 'education' or 'study'. On the other hand, when C = {read, machine, memory}, the following word set C+ (12) is obtained. {machine_l, memory_l, read_l, computer_i, remember_l, someone_l, have-2, t h a t - l , instrument-1, f eeling_2, that_2, what_2}. It seems that most of the words are related to 'computer' or 'mind'. These two clusters are quite different, in spite of the fact that both contexts contain the word read. 5
Evaluation through word prediction
We evaluate the context-sensitive word distance through predicting words in a text. When one is reading a text (for instance, a novel), he or she often predicts what is going to happen next by using what has happened already. Here we will deal with the following problem: For each sentence in a given text, predict the words in the sen tence by using the preceding n sentences. This task is not so difficult for human adults because a target sentence and the preceding sentences tend to share the same contexts. This means that predictability of the target sentence suggests how successfully we extract information about the context from preceding sentences. Consider a text as a sequence S 1 ,...., SN, where Si is the i-th sentence of the text (see Figure 7, left). For a given target sentence Si, let Ci be a set of the concatenation of the preceding n sentences: Ci = {Si-n . . . S i - 1 } . Then, the prediction error ei of Si is computed as follows: 1. Sort all the words in our vocabulary V in ascending order of d(w, Ci). 2. Compute the average rank ri of wij Є Si in the sorted list. 3. Let the prediction error ei be the relative average rank ri/ |V'/.
CONTEXT-SENSITIVE WORD DISTANCE
121
Note that here we use the vocabulary V which consists of 2641 words — we removed 210 function words from the vocabulary V. Obviously, the prediction is successful when ei0. We used 0 . Henry's short story 'Springtime a la Carte' (Thornley 1960: 56-62) for the evaluation. The text consists of 110 sentences (1620 words). We computed the average value e of the prediction error ei for each target sentence Si (i = n + l , . . . , 110). For different numbers of preceding sentences (n = 1 , . . . , 8) the average prediction error ē is computed and shown in Figure 7 (right). If prediction is random, the expected value of the average prediction error ē is 0.5 (i.e., chance). Our method predicted the succeeding words better than randomly; the best result was observed for n — 4. Without adaptive scaling of the semantic space, simple Euclidean distance resulted in ē = 0.2905 for n — 4; our method is better than this, except for n — 1. When the succeeding words are predicted by using prior probability of word occurrence, we obtained ē — 0.2291. The prior probability is estimated by the word frequency in West's five-million-word corpus (West 1953). Again our result is better than this, except for n = 1. 6 6.1
Discussion Semantic vectors
A monolingual dictionary describes the denotational meaning of words by using the words defined in it; a dictionary is a self-contained and selfsufficient system of words. Hence, a dictionary contains the knowledge for natural language processing (Wilks et al. 1989). We represented the meaning of words by semantic vectors generated by the semantic network of the English dictionary LDOCE. While the semantic network ignores the syntactic structures in dictionary definitions, each semantic vector contains at least a part of the meaning of the headword (Kozima & Furugori 1993). Co-occurrency statistics on corpora also provide semantic information for natural language processing. For example, mutual information (Church & Hanks 1990) and n-grams (Brown et al. 1992) can extract semantic re lationships between words. We can represent the meaning of words by the co-occurrency vectors extracted from corpora. In spite of the sparseness of corpora, each co-occurrency vector contains at least a part of the meaning of the word. Semantic vectors from dictionaries and co-occurrency vectors from corpora would have different semantic information (Niwa & Nitta 1994). The former
122
HIDEKI KOZIMA & AKIRA ITO
displays paradigmatic relationships between words, and the latter syntagmatic relationships between words. We should incorporate both of these complementary knowledge sources into the vector-representation of word meaning. 6.2
Word prediction and text structure
In the word prediction task described in Section 5, we observed the best average prediction error e for n = 4 , where n denotes the number of preceding sentences. It is likely that e will decrease with increasing n, since the more we read the preceding text, the better we can predict the succeeding text. However, we observed the best result for n = 4. Most studies on text structure assume that a text can be segmented into units that form a text structure (Grosz & Sidner 1986). Scenes in a text are contiguous and non-overlapping units, each of which describes certain objects (characters and properties) in a situation (time, place, and backgrounds). This means that different scenes have different contexts. The reason why n = 4 gives the best prediction lies in the alternation of the scenes in the text. When both a target sentence Si and the preceding sentences Ci are in one scene, prediction of Si from d would be successful. Otherwise, the prediction would fail. A psychological experiment (Kozima & Furugori 1994) supports this correlation with the text structure. 7
Conclusion
We proposed context-sensitive and dynamic measurement of word distance computed by adaptive scaling of the semantic space. In the semantic space, each word in the vocabulary is represented by an m-dimensional Q-vector. Q-vectors are obtained through a principal component analysis on P-vectors. P-vectors are generated by spreading activation on the semantic network which is constructed systematically from the English dictionary (LDOCE). The number of dimensions, m = 281, is determined by minimising the noise remaining in Q-vectors. Given a word set C which specifies a context, each dimension of the Q-vector space is scaled up or down according to the distribution of C in the space. In the semantic space thus transformed, word distance becomes dependent on the context specified by C. An evaluation through predicting words in a text shows that the proposed measurement captures the context of the text well.
CONTEXT-SENSITIVE WORD DISTANCE
123
T h e context-sensitive and dynamic word distance proposed here can be applied in many fields of natural language processing, information retrieval, etc. For example, the proposed measurement can be used for word sense disambiguation, in t h a t the extracted context provides bias for lexical am biguity. Also prediction of succeeding words will reduce the computational cost in speech recognition tasks. In future research, we regard the adaptive scaling method as a model of human memory and attention t h a t enables us to follow a current context, to put a restriction on memory search, and to predict what is going to happen next. REFERENCES Brown, Peter F., Vincent J. Delia Pietra, Peter V. deSouza, Jenifer C. Lai & Robert L. Mercer. 1992. "Class-Based n-gram Models of Natural Language". Computational Linguistics 18:4.467-479. Church, Kenneth W. & Patrick Hanks. 1990. "Word Association Norms, Mutual Information, and Lexicography". Computational Linguistics 16:1.22-29. Grosz, Barbara J. & Candance L. Sidner. 1986. "Attention, Intentions, and the Structure of Discourse". Computational Linguistics 12:3.175-204. Kozima, Hideki & Teiji Furugori. 1993. "Similarity between Words Computed by Spreading Activation on an English Dictionary". Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL'93), 232-239. Utrecht, The Netherlands. Kozima, Hideki & Teiji Furugori. 1994. "Segmenting Narrative Text into Coher ent Scenes". Literary and Linguistic Computing 9:1.13-19. Morris, Jane and Graeme Hirst. 1991. "Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text". Computational Linguist ics 17:1.21-48. Niwa, Yoshiki & Yoshihiko Nitta. 1994. "Co-occurrence Vectors from Corpora vs. Distance Vectors from Dictionaries". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 304-309. Kyoto, Japan. Osgood, Charles E. 1952. "The Nature and Measurement of Meaning". Psycho logical Bulletin 49:3.197-237. Schank, Roger C. 1990. Tell Me a Story: A New Look at Real and Artificial Memory. New York: Scribner. Thornley, G. C. 1960. British and American Short Stories. Harlow: Longman. West, Michael. 1953. A General Service List of English Words. Harlow: Long man.
124
HIDEKI KOZIMA & AKIRA ITO
Wilks, Yorick, Dan Fass, Cheng-Ming Guo, James McDonald, Tony Plate, & Brian Slator. 1989. "A Tractable Machine Dictionary as a Resource for Computational Semantics". Computational Lexicography for Natural Lan guage Processing ed. by Bran Boguraev & Ted Briscoe, 193-228. Harlow: Longman.
Towards a Sublanguage-Based Semantic Clustering Algorithm M. VICTORIA A R R A N Z , 1 IAN R A D F O R D , SOFIA ANANIADOU & JUN-ICHI T S U J I I
Centre for Computational Linguistics, UMIST Abstract This paper presents the implementation of a tool kit for the ex traction of ontological knowledge from relatively small sublanguagespecific corpora. The fundamental idea behind this system, that of knowledge acquisition (KA) as an evolutionary process, is discussed in detail. Special emphasis is given to the modular and interactive approach of the system, which is carried out iteratively. 1
Introduction
Not knowing which knowledge to encode happens to be one of the main reas ons for difficulties in current NLP applications. As mentioned by Grishman & Kittredge (1986), many of these language processing problems can for tunately be restricted to the specificities of the language usage in a certain knowledge domain. The diversity of language encountered here is consid erably smaller and more systematic in structure and meaning than that of the whole language. Approaching the extraction of knowledge on a sublan guage basis reduces the amount of knowledge to discover, as well as easing the discovery task. One such case of this sublanguage-based research is, for instance, the work carried out by Grishman & Sterling (1992) on selectional pattern acquisition from sample texts. However, we should also bear in mind the necessity for systematic meth odologies of knowledge acquisition, duly supported by software, as already emphasised by several authors (Grishman et al. 1986; Tsujii et al. 1992). Preparation of domain-specific knowledge for a NLP application still relies heavily on human introspection, due mainly to the non-trivial relationship between the ontological knowledge and the actual language usage. This makes the process complex and very time-consuming. In addition, while traditional statistical techniques have proven useful for knowledge acquisition from large corpora (Church & Hanks 1989; Brown 1
Sponsored by the Departamento de Education, Universidades e Investigation of the Basque Government, Spain. */****
126
ARRANZ, RADFORD, ANANIADOU & TSUJII
et al. 1991), they still present two main drawbacks: opacity of the process and insufficient data. The black box nature of purely statistical processes makes them com pletely opaque to the human specialist. This causes great difficulty when judging whether intuitionally uninterpretable results reflect actual language usage, or are simply errors due to the insufficient data. Results therefore have to be either revised to meet the expert's intuition or accepted without revision. To this problem one should also add the fact that statistical methods usually require very large corpora to obtain reasonable results, which is highly unpractical and often unfeasible. This is especially the case if work takes place at a sublanguage level as large corpora become even more inac cessible. Following the research initiated in Arranz (1992) and based on the Epsilon system described in Tsujii & Ananiadou (1993), our aim is to discover a systematic methodology for sublanguage-specific semantic KA, applicable to different subject domains and multilingual corpora. The tool kit [e] being developed at CCL supports the principles of KA as an evolutionary process and from relatively small corpora, making it very practical for current NLP applications. This work represents an iterative and modular approach to statistical language analysis, where the acquired knowledge is stored in a Central Knowledge Base (CKB), which is shared and easy to access and update by all subprocesses in the system. Bearing these considerations in mind, we selected a highly specific cor pus, such as the Unix manual, of about 100,000 words. 2
Epsilon [Є]: process
Knowledge
acquisition
as
an
evolutionary
Epsilon'ts idea of knowledge acquisition as an evolutionary process avoids the above-mentioned problems by achieving the following: Stepwise acquisition of semantic clusters. Our system acquires knowledge as a result of stepwise refinement, therefore avoiding the opacity derived from the single-shot techniques used by purely statistical methods. The specialist inspects after every cycle the hypotheses of new pieces of knowledge proposed by the utility programs in [e]. Design of robust discovery methods. Early stages of the KA process are particularly problematic for statistical programs, due to the fact that the corpus is still very complex. We aim to reduce this complexity by
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
127
initially using more robust techniques (to cope, for e.g., with words with low frequency of occurrence) before applying statistical methods. Inherent links between acquired knowledge and language us age. Epsilon easily deals with the opacity caused by the non-trivial nature of the mapping between the domain ontology and the language usage. The cases of words which denote several different ontological entities, or con versely, one entity denoted by different words, are often encountered in actual corpora, [Є] keeps a record of the pseudo-texts produced during the KA process (cf. below), as well as of their relationships with the acquired knowledge, so that the specialist can check and understand why certain clusterings take place and when. Effective minimum human intervention. As emphasised by Arad (1991) in her quasi-statistical system, human intervention is inevitable. However, in [Є] this intervention remains systematised and is only applied locally, whenever required by the process. The general idea of Knowledge Acquisition as an evolutionary process is illustrated in Figure 1 (Tsujii & Ananiadou 1993). Application of utility programs to Text-i and human inspection of the results yield the next version of knowledge (the i-th version), which in turn is the input to the next cycle of KA. This general framework is simplified if the results of text description are text-like objects (pseudo-texts), where the i-th version presents a lesser degree of complexity than the previous pseudo-text. The pseudo-texts obtained are characterised by the following: they present the same type of data structure as ordinary texts, i.e., an ordered sequence of words. The words contained in these pseudo-texts include both pseudo-words as well as ordinary words. Such pseudo-words can denote semantic categories to which the actual words belong, words with POS information, single concept-names corresponding to multi-word terms and disambiguated lexical items (like in Zernik 1991). Also, these pseudo-texts are fully compatible with the existing utility programs, and neither the input data nor the tool itself require any alteration. Finally, the degree of complexity of the text is approximated in relation to the number of different words and word tokens resulting from the several passes of the programs. Working on lipoprotein literature, Sager (1986) also shows that it is pos sible to meassure quantitative features such as the complexity of information contained in a sublanguage.
128
ARRANZ, RADFORD, ANANIADOU & TSUJII
Fig. 1: General scheme of KA as an evolutionary process
3 3.1
Knowledge acquisition process POS information
Once the Classify subprocess (cf. Section 5) was put into practice, it was observed that since no part-of-speech information was provided, great con fusion was caused at the replacement stage. A series of illegitimate substi tutions were carried out, which resulted in serious incoherence within the generated pseudo-texts. The input text was then preprocessed with Eric Brill's (Brill 1993) rulebased POS tagger. The accuracy of the tagger for the corpus in current use oscillates between 87.89% and 88.64%, before any training takes place, and 94.05%, with a single pass of training. This is quite impressive, if we take into consideration the specificity and technicality of the text. After providing the sample text with POS information, the set of can didates for semantically related clusters was much more accurate, and the wrong replacements of mixed syntactic categories ceased to take place. In addition, this corpus annotation allowed us to establish a tag compatibility set, which contributed in recovering part of the incorrectly rejected hypo theses posed for replacement. Such tag compatibility set consisted of a group
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
129
of lines, each of them containing interchangeable part-of-speech markers. An example of one of these lines looks as follows: JJ JJR JJS VBN. 3.2
Modular configuration
The current version of the system consists of: 1. Central Knowledge Base, which stores all the relationships among words and pseudo-words obtained during the KA process. 2. Record of the pseudo-texts created, as well as the relationships between them, in terms of replacements or clusterings taking place. 3. A number of separate subprocesses (detailed below) which are involved during the processing of each pass of the system. These subprocesses rely upon the iterative application of simple analysis tools, updating the CKB with the knowledge acquired at each stage. The resulting modular system is of a simple-to-maintain and enhance nature. At present [e] contains three major processes involved in the KA task: (i) Compound] which generates hypotheses of multi-word expressions; (ii) Classify, which generates semantically-related term clusters; (iii) Re placement, which deals with the reduction of the complexity of the text, by replacing the newly-found pieces of information within the corpus. 4 4.1
The Compound
subprocess
Framework
This tool performs the search for those multi-word structures within the text that can be ranked as single ontological entities. This module was built to interact with the other existing module Classify, and with the CKB, so as to achieve any required exchange or storage of semantic information. Step 1. The first stage relies on the analysis of the corpus using a simple grammar, which is based upon pairs of words where the second word is a noun and the first is one of the class Noun, Gerund, Adjective. Using this grammar we extract descriptions of the structures of potential compound terms. Any single pass can thus only determine two-word compounds, re quiring multiple passes if longer compounds are to be found. These poten tial compounds are then filtered by simply ensuring that they occur in the corpus more than once. Step 2. The remaining candidates from Step 1 are then prioritised by calculating the mutual information (Church & Hanks 1989) of each pair.
130
ARRANZ, RADFORD, ANANIADOU & TSUJII
Step 3. Once the set of compound term candidates has been verified by the human expert, the replacement of each selected compound with a single token takes place. At present, this token is a composite which retains all of the original information within the corpus entry. For instance, the compound generated from the nouns environment/NN and variable/NN looks as follows: compound (environment/NNV~variable/NN)/NN where the whole structure maintains the grammatical category NN. Step 4. Among those potential compounds discovered, only 40% turned out to be positive cases (cf. Section 4.2). This problem was particularly acute in Adjective Noun and Gerund Noun cases, mainly as a result of the difficulty entailed by the distinction between such general language and domain-specific syntactic pairs. Due to the low frequency of some of the compounds in the corpus, the resulting MI scores were noisy and led to rather irregular results. The measurement of the specificity of the com pounding candidates was then carried out by means of a large corpus of general language (the LOB corpus (Johansson & Holland 1989)). Using the formula shown in equation 1, we established a specificity coefficient, which indicated how specific a particular word was to the sublanguage. (1) Step 5. This is another replacement stage, where the verified compound terms are substituted by compound identifiers, such as Compound67/NN. These identifiers are directly related to the CKB, where a record of the information relating to this token is stored. 4.2
Performance
Regarding the module's performance, the use of the simple grammar in Step 1 succeeds in filtering the around 500 hypotheses of multi-word expressions originally produced, reducing them to around 70 candidates. Out of these 70, 45 present Noun Noun pairs, and the remaining 25 are Adjective Noun or Gerund Noun pairs. As already discussed in Section 4.1, only 40% of the hypotheses belonging to the latter type of compounds were actually correct. Meanwhile, the Noun Noun pairs presented 85% of positive cases. By means of the filtering carried out with the LOB corpus, and using a threshold of 0.9 on adjectives, performance improves from a disappointing 40% to a promising 64% for those troublesome cases, and adds to a global
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
131
Iteration Number Fig. 2: Compounding results 77.5%, just after the first pass. A value of 1.0 in the specificity scale implies that the word is unique to the sublanguage, while negative values represent a word which is more common in general language than in our subject domain sample text. It should be pointed out though, that currently the statistics regarding the word frequencies in the LOB corpus do not take POS information into account, making this filtering a rather limited resource. The future application of an annotated general language text is already being considered, so as to attempt to detect remaining errors. The replacement in Step 5 facilitates the storage of the information in the CKB, and it makes it more accessible for the subprocesses. Once formed, compound identifiers will be treated as an ordinary word with a particular syntactic label. The results obtained by the compounding module are shown in Figure 2. 5 5.1
The Classify
subprocess
Inverse KWIC
This context matching module represents the initial stage in [Є]'s subprocess Classify. Based on the principle that linguistic contexts can provide us with enough information to characterise the properties of words, and to obtain accurate word classifications (Sekine et al. 1992; Tsujii et al. 1992), semantic clusters are extracted by means of the concordance program CIWK (or Inverse KWIC) (Arad 1991). The following is a sample output from CIWK
132
ARRANZ, RADFORD, ANANIADOU & TSUJII
for a [3 3] parameter (3 words preceding and three succeeding): input/NN ;output/NN ; #name/NN of/IN the/DT $ bar-file/NN using/VBG the/DT This indicates that both nouns input/NN and output/NN share the same context at least once in the corpus. Once the list of semantic clusters has been finalised, the corpus is updated with all occurrences of those words within each cluster being replaced by the first word of that cluster. For instance, in the example above, all occurrences of input/NN and output/NN would be replaced by input/NN. For our experiments, a relatively small contextual size parameter has been selected (a [2 2]), so as to obtain a larger set of hypotheses. A list of about 700 semantic classes has been produced with this parameter.
5.2
Evaluation
Among the 700 clusters generated, there is an interesting number of cases which present crucial ontological and contextual features for our KA process. Unfortunately, there is also a significant amount of ambiguous clusters which require filtering. Work is currently taking place on this filtering process and some preliminary results can already be seen in Section 7.2. In spite of the interesting results initially obtained from CIWK, the exact matching technique this tool is based on is rather inflexible for the semantic clustering task. The semantic classes formed and the actual instances of each class can be seen in Figure 3.
6
C e n t r a l knowledge base
Although not fully implemented, our Central Knowledge Base plays a very important role within the system's framework. Due to Epsilon's modular approach and the open nature of the links between the stored acquired know ledge and the different subprocesses within the system, there is no need to retain newly extracted information in the corpus. Everything is maintained in the CKB by means of referentials, such as Semantic-classl8/NN (to refer to a resulting cluster from Classify) or Compound67/NN (to present one of the acquired compound expressions). This provides an easy method of updating and improving the knowledge base, a well as an opportunity to add new modules to the whole configuration of the system.
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
133
Iteration Number Fig. 3: Semantic clustering results
7
7.1
Dynamic context matching techniques for semantic clustering disambiguation Word sense disambiguation
As mentioned in Section 5.2, an important number of ambiguous clusters take place with the use of Classify, which are in need of filtering. However, the CIWK algorithm is very inflexible and will only accept those candidates sharing exact matching contexts. In practice we often encounter instances of semantically-related words, but whose contexts vary slightly for various reasons. In other occasions one might find that differing contexts within the same term, or between different words, represent the different ontologies of such word(s), and therefore need disambiguating. Work on such filtering module is currently being undertaken, by means of a technique called Dy namic Alignment (Somers 1994). 7.2
Dynamic context matching
This technique allows us to compare the degree of similarity between two words, and it represents a much more flexible approach than the exact matching technique used in CIWK. Its aim is to discover all potential matches between a given set of individual words, attaching a value to each match according to its level of importance. Then, the set of matches pro ducing the highest total match strength is calculated. The obtained highest
134
ARRANZ, RADFORD, ANANIADOU & TSUJII
score is attributed to the pair of contexts, establishing thus a value on their similarity relation. For each pair of contexts, the best match value is calcu lated, which results in a correlation matrix. Figure 4 presents an example of the way all possible word matches are discovered for a particular pair of contexts. Given the constraint that the individual matches are not allowed to cross, the maximal set is chosen and thus, its value calculated. The fol lowing is the correlation matrix formed by the pair of words discussed/VBN and listed/VBN: '/, dynamic discussed/VBN listed/VBN +5 -5 < corpus Post context length set to 5 Pre context length set to 5 CIWK data read. 9 records found. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
5 10 8
7 7 27
6 8 10 8 3 5 7 11 9 6 8 6 14 9 14
7 10 5 4 5 4 4 4 6 4 1 3 4 5 5
Partial Match Full Match
Fig. 4: Example match between two contexts The clustering algorithm used to determine the strongest semantic cluster in the matrix operates in a simple manner. Initially, the pair of contexts with the highest correlation is selected as the core of the cluster. Then, each remaining context is considered in turn, adding to the cluster those
SUBLANGUAGE-BASED SEMANTIC CLUSTERING
135
contexts which present a correlation value above a certain threshold, with respect to more than half the contexts already in the cluster. This will be repeated until no more contexts can be added to the cluster. Although this process is still being tested and required thresholds and parameters are being set, it has proved to present important advantages over Classify: it is more flexible and it implicitly solves the ambiguity prob lem detailed above. The contexts provided contain the necessary ontological knowledge allowing us to extract the different senses of the cluster compon ents, e.g., the above matrix found two different contextual clusters, showing two different meanings. 8
Concluding remarks
This system attempts to avoid the pitfalls faced by purely statistical tech niques of knowledge acquisition. As for this, the idea of KA as an evolution ary process is described in detail, and applied to the task of sublanguagespecific KA from small corpora. The iterative nature of our system enables statistical measures to be performed, in spite of the relatively small size of our sample text. The interactive framework of our implementation provides a simple way to access and store the acquired ontological knowledge, and it also allows our subprocesses to exchange information so as to obtain desir able results. REFERENCES Arad, Iris. 1991. A Quasi-Statistical Approach to Automatic Generation of Lin guistic Knowledge. Ph.D. dissertation, CCL, UMIST, Manchester, U.K. Arranz, Victoria. 1992. Construction of a Knowledge Domain from a Corpus. M.Sc. dissertation, CCL, UMIST, Manchester, U.K. Brill, Eric. 1993. A Corpus-Based Approach to Language Learning. Ph.D. dis sertation, University of Pennsylvania, Philadelphia. Brown, Peter F., Stephen A. Delia Pietra, Vincent J. Delia Pietra & Robert L. Mercer. 1991. "Word-Sense Disambiguation Using Statistical Methods". Proceedings of the 29th Annual Conference of the Association for Compu tational Linguistics (ACL'91), Berkeley, Califs 264-270. San Mateo, Calif.: Morgan Kaufmann. Church, Kenneth W. & Patrick Hanks. 1989. "Word Association Norms, Mutual Information, and Lexicography". Proceedings of the 27th Annual Confer ence of the Association for Computational Linguistics (ACL'89), Vancouver, Canada, 76-82. San Mateo, Calif.: Morgan Kaufmann.
136
ARRANZ, RADFORD, ANANIADOU & TSUJII
Grishman, Ralph & Richard Kittredge. 1986. Analysing Language in Restricted Domains: Sublanguage Description and Processing. New Jersey: Lawrence Erlbaum Associates. & John Sterling. 1992. "Acquisition of Selectional Patterns". Proceedings of the 14th International Conference on Computational Linguistics (COLING'92), Nantes, France, 658-664. , Lynette Hirschman & Ngo Thanh Nhan. 1986. "Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments". Computational Linguistics 12:3.205-215. Johansson, Stig & Knut Hofland. 1989. Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus, vol.1: Tag Frequencies and Word Frequencies. Oxford: Clarendon Press. Sager, Naomi. 1986. "Sublanguage: Linguistic Phenomenon, Computational Tool". Analysing Language in Restricted Domains: Sublanguage Description and Processing ed. by Ralph Grishman & Richard Kittredge, 1-17. New Jersey: Lawrence Erlbaum Associates. Sekine, Satoshi, Jeremy J. Carroll, Sofia Ananiadou & Jun-ichi Tsujii. 1992. "Automatic Learning for Semantic Collocation". Proceedings of the 3rd Con ference on Applied Natural Language Processing (ANLP'92), Trento, Italy, 104-110. New Jersey: ACL. Somers, Harold, Ian McLean & Daniel Jones. 1994. "Experiments in Multi lingual Example-Based Generation". Proceedings of the 3rd Conference on the Cognitive Science of Natural Language Processing (CSNLP'94), Dublin, Ireland: Dublin City University. Tsujii, Jun-ichi & Sofia Ananiadou. 1993. "Epsilon [e] : Tool Kit for Knowledge Acquisition Based on a Hierarchy of Pseudo-Texts". Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS'93), 93-101. Fukuoka, Japan. Tsujii, Jun-ichi, Sofia Ananiadou, Iris Arad & Satoshi Sekine. 1992. "Linguistic Knowledge Acquisition from Corpora". Proceedings of the International Workshop on Fundamental Research for the future Generation of Natural Language Processing (FGNLP), 61-81. Manchester, U.K. Zernik, Uri. 1991. "Trainl vs. Train2: Tagging Word Senses in Corpus". Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon ed. by Uri Zernik, 91-112. New Jersey: Lawrence Erlbaum Associates.
Customising a Verb Classification to a Sublanguage ROBERTO BASILI*, MICHELANGELO DELLA ROCCA*, M A R I A T E R E S A PAZIENZA* & PAOLA VELARDI**
* Universita' di Tor Vergata, Roma ** Universita' di Ancona Abstract In this paper we study the relationships between a general purpose, human coded verb classification, proposed in the WordNet lexical reference system, and a corpus driven classification model based on context analysis. We describe a context-based classifier that tunes WordNet to specific sublanguages and reduces its over-ambiguity.1 1
Sense disambiguation and sense tuning
The purpose of this study is to define a context-based statistical method to constrain and customise the WordNet type hierarchy, according to a specific sublanguage. Our context-based method is expected to tune the initial WordNet categorisation to a given corpus, in order to: • Reduce the initial ambiguity • Order each sense according to its relevance in the corpus • Identify new senses typical for the domain. These results could be useful for any NLP systems lacking in human support for word categorisation. The problem that we consider in this paper is strongly related to the problem of word-sense disambiguation. Given a verb and a representative set of its occurrences in a corpus, we wish to determine a subset of its initial senses, that may be found in the sublanguage. In case, new senses may be found, that were not included in the initial classification. Word sense disambiguation is an old-standing problem. Recently, several statistically based algorithms have been proposed to automatically disam biguate word senses in sentences, but many of these methods are hopelessly unusable, because they require manual training for each ambiguous word. 1
This paper summarises the results presented in the International Conference on Recent Advances in Natural Language Processing. The interested reader may refer to the RANLP proceedings for additional details on the experiments.
138
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Exceptions are the simulated annealing method proposed in (Cowie et al. 1992), and the context-based method proposed in (Yarowsky 1992). Sim ulated annealing attempts to select the optimal combination of senses for all the ambiguous words in a sentence S. The source data for disambigu ation are the LDOCE dictionary definitions and subject codes, associated with each ambiguous word in the sentence S. The basic idea is that word senses that co-occur in a sentence will have more words and subject codes in common in their definitions. However in (Basili et al. 1996) we experimentally observed that sense definitions for verbs in dictionaries might not capture the domain specific use of a verb. For example, for the verb to obtain in the RSD we found patterns of use like: the algorithm obtains good results for the calculation... data obtained from the radar... the procedure obtains useful information by fitting... etc., while the (Webster's) dictionary definitions for this verb are: (i) to gain possession of: to acquire, (ii) to be widely accepted, none of which seems to fit the detected patterns. We hence think that the corpus itself, rather than dictionary definitions, should be used to derive disambiguation hints. One such approach is undertaken in (Yarowsky 1992), which inspired our method (Delia Rocca 1994). In this paper our objectives and methods are slightly different from those in (Yarowsky 1992). First, the aim of our verb classifier is to tune an exist ing verb hierarchy to an application domain, rather than selecting the best category for a word occurring in a context. Second, since in our approach the training is performed on an unbalanced corpus (and for verbs, that no toriously exhibit more fuzzy contexts), we introduced local techniques to reduce spurious contexts and improve the reliability of learning. Third, since we expect also domain-specific senses for a verb, during the classifica tion phase we do not make any initial hypothesis on the subset of categories of a verb. Finally, we consider globally all the contexts in which the verb is encountered in a corpus, and compute a (domain-specific) probability distri bution over its expected senses. In the next section the method is described in detail. 2
A context-based classifier
In his experiment, Yarowsky uses 726 Roget's categories as initial classi fication. In our study, we use a more recently conceived, widely available, classification system, WordNet.
CONTEXTS AND CATEGORIES CATEGORY
body (BD) change (CH) cognition (CO) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) perception (PE) possession (PS) social (SO) stative (ST)
139
#VERBS
#SYNSETS
78 287 200 240 63 48 209 124 47 76 122 217 162
76 412 218 299 73 41 279 133 50 80 156 240 183
Table 1: Excerpt of Kernel verbs in the RSD
We decided to adopt as an initial classification the 15 semantically distinct categories in which verbs have been grouped in WordNet. Table 2 shows the distribution of a sample of 826 RSD verbs among these categories, according to the initial WordNet classification. The average ambiguity of verbs among these categories is 3.5 for our sample in the RSD. In what follows we describe an algorithm to re-assign verbs to these 15 categories, depending upon their surrounding contexts in corpora. Our aim is to tune the WordNet classification to the specific domain as well as to capture rather technical verb uses that suggest semantic categories different from those proposed by WordNet. The method works as follows: 1. Select the most typical verbs for each category; 2. Acquire the collective contexts of these verbs and use them as a (dis tributional) description of each category; 3. Use the distributional descriptions to evaluate the (corpus-dependent) membership of each verb to the different categories. In step 1 of the algorithm we learn a probabilistic model of categories from the application corpus. When training is performed on an unbalanced corpus (or on verbs, that are highly ambiguous and with variable contexts), local techniques are needed to reduce the noise of spurious contexts. Hence, rather than training the classifier on all the verbs in the learning corpus, we select only a subset of prototypical verbs for each category. We call these verbs the salient verbs of a category C. We call typicality Tv(C)
140
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
CATEGORY
K E R N E L VERBS
body (BD) change (CH) cognition (CG) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) motion (MO) perception (PC) possession (PS) social (SO) stative (ST) weather (WE)
produce, acquire, emit, generate, cover calibrate, reduce, increase, measure, coordinate estimate, study, select, compare, plot, identify record, count, indicate, investigate, determine base, point, level, protect, encounter, deploy sample, provide, supply, base, host, utilise function, operate, filter, segment, line, describe design, plot, create, generate, program, simulate like, desire, heat, burst, shock, control well, flow, track, pulse, assess, rotate, sense, monitor, display, detect, observe, show provide, account, assess, obtain, contribute, derive experiment, include, manage, implement, test consist, correlate, depend, include, involve, exist scintillate, radiate, flare
Table 2: Excerpt of kernel verbs in the RSD of v in C, the following ratio: (1) where: Nv is the total number of synsets of a verb v, i.e., all the WordNet synonymy sets including v. Nv,c is the number of synsets of v that belong to the semantic category (7, i.e., synsets indexed with C in WordNet. The synonymy Sv of v in C, i.e., the degree of synonymy showed by verbs other than v in the synsets of the class C in which v appears, is modeled by the following ratio: (2) where: Ov = the number of verbs in the corpus that appear in at least one of the synsets of v. Ov,c — the number of verbs in the corpus appearing in at least one of the synsets oftv,that belongs to C. Given 1 and 2, the salient verbs v, for a category C, can be identified maximising the following function, that we call Score: Scorev(C) = OAv x Tv(C) x Sv(C)
(3)
where OAv are the absolute occurrences of v in the corpus. The value of Score depends both on the corpus and on WordNet. OAv depends obviously
CONTEXTS AND CATEGORIES
141
on the corpus. Instead, the typicality depends only on WordNet. A typical verb for a category C is one that is either non ambiguously assigned to C in WordNet, or that has most of its senses (synsets) in C. Finally, the synonymy depends both on WordNet and on the corpus. A verb with a high degree of synonymy in C is one with a high number of synonyms in the corpus, with reference to a specific sense (synset) belonging to C. Salient verbs for C are frequent, typical, and with a high synonymy in C. The kernel of a category kernel(C), is the set of salient verbs v with a 'high' Scorev(C). To select a kernel, we can either establish a threshold for Scorev(C), or fix the cardinality of kernel(C). We adopted the second choice, because of the relatively small number of verbs found in the medium-sized corpora that we used. Table 2 lists some of the kernel verbs in the RSD. In step 2 of the algorithm, the collective contexts for each category are acquired. The collective contexts of a category C is acquired around the salient words for each category (see (Yarowsky 1992)), though we collect salient words using a ±10 window around the kernel verbs. Figure 1 plots the ratio
vs. the number of contexts
acquired for each category, in the RSD and the MD. It is seen that, in the average and for both domains, very few new words are detected over the threshold of 1000 contexts. This phenomenon is called saturation and is rather typical of sublanguages. However, some of the categories (like weather and emotion in the RSD) have very few kernel verbs. In step 3, we need to define a function to determine, given the set of contexts K of a verb v, the probability distribution of its senses in the corpus. For a given verb v, and for each category C, we evaluate the following function, that we call Sense(v,C): (4) where (5) and Ki is the i-th context of v, and w is a word within Ki. In 5, Pr(C) is the (not uniform) probability of a class C, given by the ratio between the number of collective contexts for C and the total number of collective contexts. A verb v has a high Sense value in a category if:
142
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Fig. 1: New words per context vs. number of contexts in MD and RSD • it co-occurs 'often' with salient words of a category C: • it has few contexts related to C, but these are more meaningful than the others, i.e., they include highly salient words for C. The corpus-dependent distribution of the senses of v among the categories can be analysed through the function Sense. Notice that, during the clas sification phase 3, the initial WordNet classification of ambiguous verbs is no longer considered (unlike for (Yarowsky 1992)). WordNet is used only during the learning phase in which the collective contexts are built. Hence, new senses may be detected for some verb. We need to establish a threshold for Sense(v, C) according to which, the sense C is considered not relevant in the corpus for the verb v, given all its observed occurrences. Since the values of the Sense function do not have a uniform distribution across categories, we introduce the standard variable: (6) where ΜC and σc are the average value and the standard deviation of the Sense function for all the verbs of C, respectively.
143
CONTEXTS AND CATEGORIES A verb v is said to belong to the class C if N s e n s e ( v , C )≥
Nsense0
(7)
Under the hypothesis of a normal distribution for the values of 6, we exper imentally determined that a reasonable choice is Nsenseo
=1
(8)
With this threshold, we assign to a category C only to those verbs whose Sense value is equal or higher than μ+σc- In a normal distribution, this threshold eliminates 84% of the classifications. In the next section we discuss and evaluate the experimental results obtained for the two corpora. 3
Discussion of the results
Table 3 shows the sense values that satisfy the 7, for an excerpt of randomly selected RSD verbs. The sign "*" indicates the initial WordNet classifica tion. The average ambiguity of our sample of 826 RSD verbs is 2.2, while the initial WordNet ambiguity was 3.5. For 1,235 verbs of the MD, the average ambiguity is 2.1 and the initial was 2.9. We obtained hence a 30-40% reduction of the initial ambiguity. As expected, classes are more appropriate for the domain. Less relevant senses are eliminated (all empty boxes with a "*" in Table 3). New proposed categories are indicated by scores without the "*". The function Sense, defined in the previous section, produces a new, context-dependent, distribution of categories. In this section we evaluate and discuss our data numerically. First, we wish to study the commonalities and divergence between WordNet and our classification method. We introduce the following definition: A = {(v,C)\Nsense(v,C) W = {{v,C)\Scorev(C)
= 84Nsense 0 } > 0}
I = A∩w where A is the set of verbs classified in C according to their context, W is the set of verbs classified in C according to WordNet and I is the intersection between the two sets. Two performance measures, that assume WordNet as an oracle, are recall, defined as and the precision, i.e.,
144
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
BD CH CG CM CP CS CT CR MO PC PS SO ST apply 3.9* * * * 1.3* * calculate 1.1* * * change * * cover * * * * * * * * 1.1* gain * 1.38 * * 4.9* occur 3.8* * * operate 1.1* 3.0* * * point 1.0* * * 1.7* * 2.37 record * * 2.8* scan 2.1* 1.1* * * * * * survey 3.4* test * VERBS
Table 3: Sense values for an excerpt of RSD verbs This definition of recall measures the number of initial WordNet senses in agreement with our classifier. Under the perspective of sense tuning, the recall may be seen as measuring the capability of our classifier to reduce the WordNet initial ambiguity, while the percentage of new senses is given by 100% — precision. Domain Recall
RSD (200 verbs) 41%
MD (341 verbs) 40%
Table 4: A comparison between the corpus-driven classification and WordNet Table 3 summarises recall and precision values for the two domains and shows that the corpus-driven classifications fit the expectations of WordNet authors, while more than 1/2 of the initial senses (59% in RSD, 60% in MD) are pruned out! Furthermore, there are 13% and 18% new detected categor ies in the MD and in the RSD, respectively. Of course, it is impossible to evaluate, if not manually, the plausibility of these new classifications. We will return to this problem at the end of this section. A second possible evaluation of the method is a comparison between unambiguous verbs' classifications. We found that in the large majority of cases, there is a concordance between WordNet and our classifier. Verbs convoy flex wake
BD -2.53 -2.50 34.9*
CH -3.07 -4.76 0.21
CG -1.94 -2.23 0.21
Table 5: Nsense
CM -2.98 -4.42 -0.98
CP -3.08 -3.86 -1.34
CS 2.08 -4.20 1.70
CT -2.37 -3.94 -0.25
CR 0.41 -3.18 -0.17
MO 51.9* 9.14* -1.03
PC PS SO -1.19 -1.68 -2.19 -2.60 -1.97 -3.94 -0.58 -0.83 -0.08 values for three verbs unambiguous in WordNet
ST -4.59 -5.51 -1.16
Table 5 shows the standard variable 6 values for some unambiguous verbs.
CONTEXTS AND CATEGORIES DOMAIN
RSD (140 verbs)
145
MP (170 verbs)
Recall 91% 85% Table 6: Recall of the classification of unambiguous verbs Table 6 globally evaluates the performance of the classifier over unambigu ous verbs, for the two domains. We also attempted a global linguistic analysis of our data. We observed that for some verbs the collective contexts acquired may not express their intended meaning (i.e., category) in WordNet. Moreover technical uses of some verbs are idiosyncratic with respect to their WordNet category. Consider for example the verb to record in the medical domain. This verb is automatically classified in the categories communication and contact The contact classification is new, that is, it was not included among the WordNet categories for to record. Initially, we examined all the occurrences of this verb (45 sentences) with the purpose of manually evaluating the classi fication choices of our system. Each of the authors of this paper independ ently attempted to categorise each occurrence of the verb in the MD corpus as either belonging to the categories proposed by WordNet for to record (communication) or to the new class contact. However, since the WordNet authors provided only very schematic descriptions for each category, each of us used his personal intuition of the definition for each category. The result was a set of almost totally divergent classification choices! During the analysis of the sentences, we observed that the verb to record occurs in the medical domain in rather repetitive contexts, though the sim ilarity of these contexts can only be appreciated through a generalisation process. Specifically, we found two highly recurrent generalised patterns: A record(Z,X,Y): subject (z) object(physiological_state(X)), locative (individual(Y) or body-part(Y) ) .
(e.g., myelitis spinal cord injury tumors were recorded at the three levels paretial spinal cervical . . . ) . B
record(Z,X,Y): subject(Z), object(abstraction(X)), locative(information(Y) ) or time( time_period(Y) ) .
146 I ( ( ( ( ( ( ( ( | (
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
In, normal, patients, potentials, of, a, uniform, shape, were, # , during, flaccidity ) I At, cutoff, frequencies Cavernous, electrical, activity, was, # , in, patients, with, erectile, dysfunction ) Abnormal, findings, of, cavernous, electrical, activity, were , # , in, _, of, the, consecutive, impotent, patients ) Morbidity, and, mortality, rates, were, # , in, the, first, month, of, life, Juveniles, and, yearlings, rarely ) seconds, of, EMG, interference, pattern, were, # , at, a, maximum, voluntary, contractions, from, the, biceps ) interference, pattern, IP, in, studies, were, # , using, a, concentric, needle, electrode, MUAPs, were, recorded ) During, Hz, stimulation, twitches, # , by, measurement, of, the, ankle, dorsiflexor, group, displayed, increasing ) Macro-electromyographic, MUAPs, were, # , from, patients, in, studies, MUAP, analysis, revealed ) myelitis, spinal, cord, injury, tumours, The, SEPs, were, # , at, three, levels, parietal, spinal, cervical ) |
Table 7: Examples of contexts for the verb to record in MD (e.g., mortality rates were recorded in the study during the first month of life) Above unary functors (e.g., individual, information, . . . ) are WordNet labels. We then attempted to re-classify all the occurrences of the verb as either fitting the scheme A or B, regardless of WordNet categories. Table 3 shows a subset of contexts for the verb to record. The symbol "#" indicates an occurrence of the verb. Out of 45 sentences, only 5 did not clearly fit one of the two schemes. There was almost no disagreement among the four human classifiers, and, surprisingly enough (but not so much), we found a very strong correspond ence between our partition of the set of sentences and that proposed by our context-based classifier. If we name the class A contact, and the class B com munication, we found 37 correspondences over 40 sentences. In the three non correspondent cases the context included physiological states and/or body parts, though not as direct object or modifiers of the verb. The sys tem hence classified the verb as contact, though we selected the scheme B. Somehow, it seems that the context-based classifier categorises a verb as contact, not so much because it implies the physical contact of entities, but because the arguments of the verb are physical and are the same of truly contact verbs. For the same verb, a similar analysis has been performed on its 170 RSD contexts and comparable results have been obtained. This experiment suggests that, even if viable (especially but not exclus ively for verb investigation), a mere statistical analysis of the surround ing context of a single ambiguous word does not bring sufficient linguistic insight, though it provides a good global domain representation. Verb se mantics (although domain specific) is useful to explain and validate most of the acquired evidence. As an improvement, in the future, we plan to integ rate the method described in this paper with a more semantically oriented corpus based classification method, described in (Basili et al. 1995).
CONTEXTS AND CATEGORIES 4
147
Final remarks
It is broadly agreed that most successful implementations of NLP applic ations are based on lexica. However, ontological and relational structures in general purpose on-line lexica are often inadequate (i.e., redundant and over-ambiguous) at representing the semantics of specific sublanguages. In this paper we presented a context-based method to tune a general purpose on-line lexical reference system, WordNet, to sublanguages. The method was applied to verbs, one of the major sources of sense ambigu ity. In order to acquire more statistically stable contextual descriptors, we used as initial classification the 15 highest level semantic categories defined in WordNet for verbs. We then used local (corpus dependent) and global (WordNet dependent) evidence to learn the collective contexts of each cat egory and to compute the probability distribution of verb senses among the categories. This tuning method showed to be reliable for a lexical category, like verbs, for which other statistically-based classifiers proposed in literature obtained weak results. For two domains, we could eliminate about 60% of the initial WordNet ambiguity and identify 10-20% new senses. Further more we observed that, for some category, the collective context acquired may be spurious for the intended meaning of the category. A manual ana lysis revealed that a more semantically-oriented representation of a category context would be greatly helpful at improving the performance of the sys tem and at gaining more linguistically oriented information on category descriptions. REFERENCES Basili, Roberto, Maria Teresa Pazienza & Paola Velardi. 1996. "A Context Driven Conceptual Clustering Method for Verb Classification". Corpus Pro cessing for Lexical Acquisition ed. by Branimir Boguraev & James Pustejovsky. Cambridge, Mass.: MIT press. , Maria Teresa Pazienza & Paola Velardi. Forthcoming. "An Empirical Symbolic Approach to Natural Language Processing". To appear in Artificial Intelligence, vol. 85, August 1996. Cowie, Jim, J. Guthrie & L. Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359-365. Nantes, France.
148
BASILI, DELLA ROCCA, PAZIENZA & VELARDI
Delia Rocca, Michelangelo. 1994. Classificazione automatica dei termini di una lingua basata sulla elaborazione dei contesti [Context-Driven Automatic Clas sification of Natural Language Terms]. Ph.D. dissertation, Dept. of Electrical Engineering, Tor Vergata University, Rome. Fellbaum, Christian, R. Beckwith, D. Gross & G. Miller. 1993. "WordNet: A Lexical Database Organised on Psycholinguistic Principles". Lexical Acquis ition: Exploting On-Line Resources to Build a Lexicon ed. by U. Zernik, 211-232. Hillsdale, New Jersey: Lawrence-Erlbaum Associates. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Rogets Categories Trained on Large Corpora". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359365. Nantes, France.
Concept-Driven Search Algorithm Incorporating Semantic Interpretation and Speech Recognition A K I T O NAGAI, YASUSHI ISHIKAWA & KUNIO NAKAJIMA
MITSUBISHI Electric Corporation Abstract This paper discusses issues concerning incorporating speech recogni tion with semantic interpretation based on concept. In our approach, a concept is a unit of semantic interpretation and an utterance is re garded as a sequence of concepts with an intention to attain both linguistic robustness and constraints for speech recognition. First, we propose a basic search method for detecting concepts from a phrase lattice by island-driven search evaluating the linguistic likelihood of concept hypotheses. Second, an improved method to search effi ciently for N-best meaning hypotheses is proposed. Experimental results of speech understanding are also reported. 1
Introduction
A 'spoken language system' for a naive user must have linguistic robustness because utterances are shown by a large variety of expressions, which are often ill-formed (Ward 1993:49-50; Zue 1994:707-710). How does a language model cover such a variety of sentences? There is a crucial issue closely related to linguistic robustness: how do we exploit linguistic constraints to improve 'speech recognition'? Syntactic constraint contributes to improving speech recognition, but it is not robust because it limits sentential expressions. Several recent works have tried to solve these linguistic problems by relaxing grammatical con straints or applying the 'partial parsing' technique (Stallard 1992:305-310; Seneff 1992:299-304; Baggia 1993:123-126). This technique is based on the principle that a whole utterance can be analysed with syntactic grammar even if the utterance is partly ill-formed. It is, however, likely that the partial parser cannot create even a partial tree for an utterance in freephrase order in 'spontaneous speech' , and this linguistic feature is normal in Japanese. Thus, one key issue in attaining linguistic robustness is exploiting se mantic knowledge to represent relations between phrases by semantic-driven
150
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
processing. One of the methods for doing this is to use case frames based on predicative usage. In this approach, a hypothesis explosion, owing to both word-sense ambiguity and many recognised candidates, occurs if only se mantic constraint is used without syntactic constraint. Therefore, a frame work to evaluate growing meaning hypotheses, based on both syntactic and semantic viewpoints, is indispensable in the process of 'semantic interpret ation' from a 'phrase lattice' to a meaning representation. In our previous work (Nagai et al. 1994a, 1994b), we proposed a se mantic interpretation method for obtaining both linguistic robustness and constraints for speech recognition. This paper aims to focus on issues con cerning the integration of this semantic interpretation and speech recogni tion, and to evaluate the performance of 'speech understanding' . 2
Semantic interpretation based on concepts
Our approach is based on the idea that a semantic item represented by a partial expression can be a unit of semantic interpretation. We call this unit a concept. We consider that; (1) a concept is represented by phrases which are continuously uttered in a part of a sentence, (2) a sentence is regarded as a sequence of concepts, and (3) a user talks about concepts with an intention. A concept is defined to represent a target task: for example, concepts for the Hotel Reservation task are Date, Stay, Hotel Name, Room Type, Distance, Cost, Meal, etc.. The representation is based on a semantic frame. An intention is defined as an attributive type of meaning frame of a whole utterance. A meaning frame registers an intention that constrains a set of concept frames. The intention types are defined as reservation, change, cancel, WH-inquiry, Y/N-inquiry, and consultation. 2.1
Basic process
Figure 1 illustrates the principle of the proposed method. The total process can be divided into concepts detection and meaning hypotheses generation. In detecting concepts, slots are filled by phrase candidates which can be concatenated in the phrase lattice, based on examining the semantic value and a particle. A phrase candidate which has no particle is examined using only its semantic value. This phrase candidate has case-level ambiguity, and each case is hypothesised. In generating meaning hypotheses, the main process consists of two subprocesses. First, an intention type is hypothesised using; (1) key predicates
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
151
which relate semantically to each intention, (2) a particle standing for an inquiry, and (3) interrogative adverbs. If a key predicate is not detected, the intention type is guessed using the semantic relation between concepts. Second, concept hypotheses are combined using meaning frames which are associated with each intention type. All meaning hypotheses for an entire sentence are generated as the meaning frames which have slots filled with concept hypotheses.
hypothesis of phrase sequence derived from phrase lattice
Fig. 1: Semantic interpretation based on concepts
2.2
Reduction of ambiguity in concept hypotheses
Many senseless meaning hypotheses remain owing to ambiguity of word sense, cases of a phrase, and boundaries of concepts. Two methods are used to reduce the ambiguity. First, two existence conditions of a concept are supposed. One is that a concept should have filled slots which are indispensable to the gist of the concept. The other condition is that a concept should occupy a continuous part of a sentence. This assumes that a user talks about a semantic item as a chunk of phrases. Second, the linguistic likelihood of a concept hypothesis is evaluated by a scoring method which considers linguistic dependency between phrases. This method is based on penalising linguistic features instead of using syn tactic rules in order to obtain less rigid syntactic constraints. If a new
152
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
concept hypothesis is produced, it is examined on the basis of all penalty rules. The total score of all concept hypotheses is evaluated as the lin guistic likelihood of a meaning hypothesis. Some principles for defining penalty rules are shown in Table 1.
Syntactic features
Semantic features
• • • • • • • •
Deletion of key particle Inversion of attributive case and substantive case Adverbial case without predicative case Inadequate conjugation of verbs Inversion of predicative case and other cases Predicative case without other cases Semantic mismatch between phrase candidates Abstract noun without modifiers
Table 1: Principles for defining penalty rules The advantageous features of this semantic interpreting method are con sidered to be: (1) better coverage of sentential expressions than syntactic rules of a sentence, (2) suppression of an explosion by treating a concept as a target of semantic constraints, and (3) portability of common defined concepts to be shared for different tasks. 3
Integrating speech recognition
For integration with speech recognition, we use 'island-driven search' for detecting concept hypotheses (Figure 2). 3.1
Basic process
First, the speech recogniser based on 'phrase spotting' sends a phrase lattice and pause hypotheses to the semantic interpreter. A concept lattice is then generated from the phrase lattice by the island-driven search. In this pro cess, reliable phrase candidates are selected as seeds for growing concept hy potheses. Each concept hypothesis is extended both forward and backward considering existence of gaps, overlaps, and pauses. To select phrase can didates for the extension, several criteria concerning concatenating phrase candidates are used as follows; (1) Gaps and overlaps between phrases are permitted, if their length is within the permitted limit. (2) Pauses are permitted between phrases, considering gaps and overlaps, within the per mitted limit. (3) Phrases which satisfy two conditions of the existence of a concept are connected. (4) Both acoustic and linguistic likelihood are
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
153
given to a concept hypothesis whenever it is extended to integrate a phrase candidate. If the likelihoods are worse than their thresholds, the hypothesis is abandoned. Finally, meaning hypotheses for a whole sentence are generated by con catenating concept hypotheses in the concept lattice. This search is per formed in a best-first manner. In connecting concept hypotheses, the lin guistic likelihood of growing meaning hypotheses is also evaluated and the existence of gaps, overlaps, and pauses is considered between concept hypo theses within the permitted limit. The linguistic scoring method evaluates growing concept hypotheses and abandons hopeless hypotheses. The total score of acoustic and linguistic likelihood is given as ST — aSL + (1 — a)SU , where ST is the total score, SL is the linguistic score, SA is the acoustic score, and a is the weighting factor.
Fig. 2: Detecting concept hypotheses
3.2
Speech understanding experiments
Experiments were performed on 50 utterances of one male on the Hotel Re servation task. The uttered sentences were made by 10 subjects instructed to make conversational sentences with no limitation of sentential expres sions. The average number of phrases was 5.8 per sentence. Intra-phrase grammar with a 356-word vocabulary is converted into phrase networks. For the spotting model, the phrase networks are joined to background mod els which allows all connections of words or phrases (Hanazawa 1995:21372140). Speaker-independent phonemic 'hidden Markov model's ('HMM's)
154
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
are used. Phrase lattices provided by speech recognition included 'false alarm' s from 10 to 30 times the average number of input phrases. The standards for judging an answer correct are; (1) concepts and their boundaries are correctly detected, (2) cases are correctly assigned to phrase candidates, and (3) semantic values are correctly extracted. A best per formance of 92% at the first rank was achieved as shown in Table 2. This shows that the proposed semantic interpretation method is capable of ro bustly understanding various spoken sentences. Moreover, we see that using the total score improves the performance of speech understanding. This is because totalising both acoustic and linguistic likelihood improves the like lihood of a correct meaning hypothesis which is not always best in both acoustic and linguistic likelihood. background model rank 1 ≤ 2 ≤ 3 ≤4 ≤ 5
word A T 82 80 84 82 86 88 86 88
phrase A T 82 92 84 94 90 92 96
A: ordered with priority to acoustic score. T: total score. Table 2: Understanding rate (%): 50 utterances of one male These results, however, leave room for some discussion. First, performance was hardly improved in the case of the word background model, although total score was used. The reason for this is that the constraints of linguistic penalty rules were not powerful enough to exclude more false alarms than in the case of the phrase background model. The penalty rules have to be designed in more detail. Second, the errors were mainly caused in the fol lowing cases; (1) when length of gaps exceeded the permitted limit owing to deletion errors of particles and pauses, causing failure of phrase connection, and (2) when seeds for concept hypotheses were not detected in the seed selection stage. To cope with these errors, (1) speech recognition has to be improved using, for example, context-dependent precise HMMs, and (2) a search strategy considering the seed deletion error is required.
CONCEPT-DRIVEN SEMANTIC INTERPRETATION 4
155
Improving search efficiency
In this section, we propose an improved search method which overcomes computational problems arising from seed deletion errors (Nagai 1994:558563). In searching a phrase lattice, it is very important to perform an efficient search selecting reliable phrase candidates in as high a rank as possible. But if only reliable candidates are selected to limit the search space, correct phrase candidates with lower likelihoods will be missed, just like seed deletion errors. This compels us to lower the threshold to avoid the deletion error, and, as a result, the computational amount suddenly increases. To solve this problem, the improved method quickly generates initial meaning hypotheses which allow deletion of concepts. Then, these initial meaning hypotheses are repaired by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses.
Fig. 3: Principle of improved search method
4.1
Basic process
The total process is composed of concept lattice generation, initial meaning hypothesis generation, acceptance decision, and the repairing process (Fig ure 3). To start with, the concept lattice is generated using only a small number of reliable phrase candidates by the concept lattice generation mod ule. In this process, the number of concept hypotheses is also reduced to
156
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
improve the quality of the concept lattice. Next, the initial meaning hypo theses generation module generates meaning hypotheses which are incom plete as regards coverage of an utterance, but are reliable. Deletion sections are penalised in proportion to their length, because the initial meaning hy potheses should cover an utterance as widely as possible. Then, the acceptance decision module judges whether the initial meaning hypotheses are acceptable or not. Acceptable means that an initial meaning hypothesis satisfies two conditions; (1) it covers a whole utterance fully, and (2) it would not be possible to attain a better meaning hypothesis by re-searching the phrase lattice. This process is illustrated in Figure 4. The best likelihood possible after repairing hypotheses (set A) can be estimated, since the maximum likelihood in re-searching deletion sections will be less than the seed threshold value.
Fig. 4: Acceptance decision If the hypotheses are not acceptable, the repairing process module re-searches the phrase lattice for concepts in the limited search space of deletion sec tions. There is, however, a risk of failing to detect concepts because both concept hypotheses neighbouring a deletion section are considered not to be reliable. Therefore, additional meaning hypotheses are also generated to be repaired, assuming that such errors occur in either concept. We use a simple method to make these hypotheses; either concept hypothesis of the unreliable two is deleted and replaced with a new concept hypothesis which is re-searched and can fill the deletion. The search space of the re-searching process can be reduced by limiting concepts. Such concepts can be associated with both concept hypotheses and the intention of the initial meaning hypotheses which is already at tained. In the case as shown in Figure 5, for example, the concepts "Cancel"
CONCEPT-DRIVEN SEMANTIC INTERPRETATION
157
or "Distance" can be abandoned considering a situation where an intention "HOW MUCH" and concepts "Hotel Name", "Room Type", and "Cost" are obtained. As concept prediction knowledge, three kinds of coexistence rela tions are defined which concern (1) an intention and a verb, (2) an intention and a concept, and (3) two concepts.
Fig. 5: Prediction of concepts
4.2
Speech understanding experiments
To evaluate search efficiency, an experimental comparison was performed on two search methods; the basic search method mentioned in section 3 and this improved search method. The former searches all phrase candidates after detecting seeds in the stage of generating the concept lattice, while the latter searches limited reliable phrase candidates and re-searches predicted concepts if deletion sections exist. Experimental conditions were almost similar to those in section 3, but the number of false alarms in the phrase lattice was increased for the purpose of clarifying differences in processing time. The spotting model was the phrase background model. Thirteen types of intention were used. Table 3 shows the results of the baseline method without the re-searching technique, and Table 4 shows the results for the improved search method. Seeds in Table 3 means seeds for concept hypotheses in generating concept lattices, and seeds in Table 4 means reliable phrase candidates for generating initial meaning hypotheses. CPU times were computed on the DEC ALPHA 3600 workstation.
158
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA # seeds rate (%), 1st rank < 5th CPU time (s.)
100 88 98 15.6
30 88 96 14.2
20 88 96 16.9
15 90 96 12.3
10 84 90 11.2
5 66 72 6.0
Table 3: Understanding rate and processing time: baseline search method, 50 utterances of one male. Number of false alarms: max. 227, ave. 75 # seeds rate (%), 1st rank < 5th CPU time (s.) # utterances repaired
30 88 98 1.7 2
20 88 96 1.2 3
15 88 96 3.1 10
10 84 94 3.8 13
5 64 76 3.7 27
Table 4: Understanding rate and processing time: improved search method, 50 utterances of one male These results show that the proposed search method using the repairing technique achieved a successful reduction in processing time. Moreover, the repairing process effectively kept the understanding rate almost equal to the rate of the baseline method in the case when deletion errors occurred owing to a small number of seeds. Processing time, however, tends to increase if the number of repetitions of the repairing process increases. One of the reasons for this is considered to be that constraints of concept prediction were not so powerful in the Hotel Reservation task. In this task, there are slightly exclusive relations between concepts and intentions because most concepts can coexist as parameter values for retrieving the hotel database. If this method is applied to a task where the relations of concepts and intentions are more distinct, for example, a task where interrogative adverbs appear frequently, the constraints of the concepts are considered to become stronger. There is ample room for further improvement in the re-search method in repairing initial meaning hypotheses. The present method does not use in formation concerning both concept hypotheses neighbouring a deletion sec tion, but only replaces them with concept hypotheses which are re-searched. Using this information will help reduce search space in the repairing pro cess. One of the methods for this improvement will be to try to extend both concept hypotheses in order to judge whether a better likelihood can be obtained or not before replacing them.
CONCEPT-DRIVEN SEMANTIC INTERPRETATION 5
159
Concluding r e m a r k s
We proposed a two-stage semantic interpretation method for robustly un derstanding spontaneous speech and described the integration of speech recognition. In this approach, the proposed concept has three roles; as a robust interpreter of various partial expressions, as a target of semantic constraints, and as a basic unit of understanding a whole meaning. This se mantic interpretation was successfully integrated with speech recognition by island-driven lattice search for generating a concept lattice and exploiting linguistic scoring knowledge. This baseline system achieved good performance with a 92% understand ing rate at the first rank. Moreover, we developed an efficient search method which quickly generates initial meaning hypotheses allowing deletion errors of correct concepts, and repairs them by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses. This technique considerably reduced search processing time to approxim ately one-tenth in experimental comparison with the baseline method. Future enhancements will include; (1) detailed design of general lin guistic knowledge for scoring linguistic likelihood of concept, (2) evaluation of this semantic interpretation as applied to other tasks using spontan eous speech data from naive speakers, (3) development of an interpretation method for a 'complex sentence' (Nagai 1996: Forthcoming), and (4) dealing with 'unknown words'. REFERENCES Baggia, Paolo & Claudio Rullent. 1993. "Partial Parsing as Robust Parsing Strategy". Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 123-126. New York: The Institute of Electrical and Electronics Engineers (IEEE). Goodine, David, Eric Brill, James Glass, Christine Pao, Michael Phillips, Joseph Polifroni, Stephanie Seneff & Victor Zue. 1994. "GALAXY: A HumanLanguage Interface to On-Line Travel Information". Proceedings of the Inter national Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.11, 707-710. Tokyo: The Acoustical Society of Japan. Hanazawa, Toshiyuki, Yoshiharu Abe & Kunio Nakajima. 1995. "Phrase Spot ting using Pitch Pattern Information". Proceedings of 4th European Confer ence on Speech Communication and Technology (EUROSPEECH'95), Mad rid, Spain, vol.III, 2137-2140. Madrid: Graficas Brens.
160
AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA
Nagai, Akito, Yasushi Ishikawa & Kunio Nakajima. 1994a. "A Semantic In terpretation Based on Detecting Concepts for Spontaneous Speech Under standing" . Proceedings of the International Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.1, 95-98. Tokyo: The Acous tical Society of Japan. , Yasushi Ishikawa & Kunio Nakajima. 1994b. "Concept-Driven Semantic Interpretation for Robust Spontaneous Speech Understanding". Proceedings of Fifth Australian International Conference on Speech Science and Tech nology (SST'94), Perth, W.A., Australia, vol.1, 558-563. Perth: Univ. of Western Australia. , Yasushi Ishikawa & Kunio Nakajima. Forthcoming. "Integration of ConceptDriven Semantic Interpretation with Speech Recognition". To appear in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'96), Atlanta, Ga., Seneff, Stephanie. 1992. "A Relaxation Method for Understanding Spontan eous Speech Utterances". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N.Y., 299-304. San Mateo, Calif.: Morgan Kaufmann. Stallard, David & Robert Bobrow. 1992. "Fragment Processing in the DELPHI System". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N. V., 305310. San Mateo, Calif.: Morgan Kaufmann. Ward, Wayne & Sheryl R. Young. 1993. "Flexible Use of Semantic Constraints in Speech Recognition". Proceedings of the International Conference on Acous tics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 49-50. New York: The Institute of Electrical and Electronics Engineers (IEEE).
A Proposal for Word Sense Disambiguation Using Conceptual Distance E N E K O A G I R R E 1 & GERMAN RIGAU 2
Euskal Herriko Unibertsitatea & Universitat Politecnica de Catalunya Abstract This paper presents a method for the resolution of lexical ambiguity and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiment have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus. 1
Introduction
Word sense disambiguation is a long-standing problem in Computational Linguistics. Much of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach is to at tempt to use the context of the word to be disambiguated together with information about each of its word senses to solve this problem. Several interesting experiments in lexical ambiguity resolution have been performed in recent years using preexisting lexical knowledge resources. Cowie et al. (1992) and Guthrie et al. (1993) describe a method for lexical disambiguation of text using the definitions in the machine-readable version of the LDOCE dictionary as in the method described in Lesk (1986), but using simulated annealing for efficiency reasons. Yarowsky (1992) combines the use of the Grolier encyclopaedia as a training corpus with the categor ies of the Roget's International Thesaurus to create a statistical model for the word sense disambiguation problem with excellent results. Wilks et al. (1993) perform several interesting statistical disambiguation experiments 1 2
Eneko Agirre was supported by a grant from the Basque Government. German Rigau was supported by a grant from the Ministerio de Education y Ciencia.
162
ENEKO AGIRRE & GERMAN RIGAU
using co-occurrence data collected from LDOCE. Sussna (1993), Voorhees (1993) and Richarson et al. (1994) define a disambiguation programs based in WordNet with the goal of improving precision and coverage during doc ument indexing. Although each of these techniques looks somewhat promising for disam biguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real texts in the public domain sense tagged version of the Brown Corpus (Francis & Kucera 1967; Miller et al. 1993), also called Semantic Concordance or SemCor for short. We also use a public domain lexical knowledge source, WordNet (Miller 1990). The advantage of this approach is clear, as SemCor provides an appropriate environment for testing our procedures in a fully automatic way. It also defines, for the purpose of this study, word-sense as the sense present in WordNet. This paper presents a general automatic decision procedure for lexical ambiguity resolution based on a formula of the conceptual distance among concepts: Conceptual Density. The system needs to know how words are clustered in semantic classes, and how semantic classes are hierarchically organised. For this purpose, we have used a broad semantic taxonomy for English, WordNet. Given a piece of text from the Brown Corpus, our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the total Conceptual Density among senses. Even if this technique is presented as stand-alone, it is our belief, follow ing the ideas of McRoy (1992) that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density might be only one evidence of the plausibility of a certain word sense. Following this introduction, Section 2 presents the semantic knowledge sources used by the system. Section 3 is devoted to the definition of Con ceptual Density. Section 4 shows the disambiguation algorithm used in the experiment. In Section 5, we explain and evaluate the performed experi ment. In the last section some conclusions are drawn. 2
WordNet and the semantic concordance
Sense is not a well defined concept and often has subtle distinctions in topic, register, dialect, collocation, part of speech, etc. For the purpose of this study, we take as the senses of a word those ones present in WordNet
A PROPOSAL FOR WSD USING CD
163
version 1.4. WordNet is an on-line lexicon based on psycholinguistic theories (Miller 1990). It comprises nouns, verbs, adjectives and adverbs, organised in terms of their meanings around semantic relations, which include among others, synonymy and antonymy, hypernymy and hyponymy, meronymy and holonymy. Lexicalised concepts, represented as sets of synonyms called synsets, are the basic elements of WordNet. The senses of a word are represented by synsets, one for each word sense. The version used in this work, WordNet 1.4, contains 83,800 words, 63,300 synsets (word senses) and 87,600 links between concepts. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/ hypernymy relations. Nominal relations include also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. SemCor (Miller et al. 1993) is a corpus where a single part of speech tag and a single word sense tag (which corresponds to a WordNet synset) have been included for all open-class words. SemCor is a subset taken from the Brown Corpus (Francis & Kucera 1967) which comprises approximately 250,000 words out of a total of 1 million words. The coverage in WordNet of the senses for open-class words in SemCor reaches 96% according to the authors. The tagging was done manually, and the error rate measured by the authors is around 10% for polysemous words. 3
Conceptual density and word sense disambiguation
A measure of the relatedness among concepts can be a valuable prediction knowledge source to several decisions in Natural Language Processing. For example, the relatedness of a certain word-sense to the context allows us to select that sense over the others, and actually disambiguate the word. Relatedness can be measured by a fine-grained conceptual distance (Miller & Teibel 1991) among concepts in a hierarchical semantic net such as WordNet. This measure would allow to discover reliably the lexical cohesion of a given set of words in English. Conceptual distance tries to provide a basis for determining closeness in meaning among words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in Rada et al. (1989) as the length of the shortest path that connects the concepts in a hierarch ical semantic net. In a similar approach, Sussna (1993) employs the notion of conceptual distance between network nodes in order to improve preci sion during document indexing. Following these ideas, Agirre et al. (1994)
164
ENEKO AGIRRE & GERMAN RIGAU
describe a new conceptual distance formula for the automatic spelling cor rection problem and Rigau (1994), using this conceptual distance formula, presents a methodology to enrich dictionary senses with semantic tags ex tracted from WordNet. The measure of conceptual distance among concepts we are looking for should be sensitive to: - the length of the shortest path that connects the concepts involved. - the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer. - the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region. - and the measure should be independent of the number of concepts we are measuring. We have experimented with several formulas that follow the four criteria presented above. Currently, we are working with the Conceptual Density formula, which compares areas of sub-hierarchies.
Word to be disambiguated: W Context words: wl w2 w3 w4 ...
Fig. 1: Senses of a word in WordNet As an example of how Conceptual Density can help to disambiguate a word, in Figure 1 the word W has four senses and several context words. Each sense of the words belongs to a sub-hierarchy of WordNet. The dots in the sub-hierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the sub-hierarchy containing more senses of those, relative to the total amount of senses in the sub-hierarchy. The sense of W contained in the sub-hierarchy with highest Conceptual Density will be chosen as the
A PROPOSAL FOR WSD USING CD
165
sense disambiguating W in the given context. In Figure 1, sense2 would be chosen. Given a concept c, at the top of a sub-hierarchy, and given nhyp and h (mean number of hyponyms per node and height of the sub-hierarchy, respectively), the Conceptual Density for c when its sub-hierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below: (1) The numerator expresses the expected area for a sub-hierarchy contain ing m marks (senses of the words to be disambiguated), while the divisor is the actual area, that is, the formula gives the ratio between weighted marks below c and the number of descendant senses of concept c. In this way, formula 1 captures the relation between the weighted marks in the sub-hierarchy and the total area of the sub-hierarchy below c. The weight given to the marks tries to express that the height and the number of marks should be proportional. nhyp is computed for each concept in WordNet in such a way as to satisfy equation 2, which expresses the relation among height, averaged number of hyponyms of each sense and total number of senses in a sub-hierarchy if it were homogeneous and regular: (2) Thus, if we had a concept c with a sub-hierarchy of height 5 and 31 des cendants, equation 2 will hold that nhyp is 2 for c. Conceptual Density weights the number of senses of the words to be disambiguated in order to make density equal to 1 when the number m of senses below c is equal to the height of the hierarchy h, to make density smaller than 1 if m is smaller than h and to make density bigger than 1 whenever m is bigger than h. The density can be kept constant for different m's provided a certain proportion between the number of marks m and the height h of the sub-hierarchy is maintained. Both hierarchies A and B in Figure 2, for instance, have Conceptual Density l 3 . In order to tune the Conceptual Density formula, we have made several experiments adding two parameters, a and β. The a parameter modifies the 3
From formulas 1 and 2 we have:
166
ENEKO AGIRRE & GERMAN RIGAU
Fig. 2: Two hierarchies with CD strength of the exponential i in the numerator because h ranges between 1 and 16 (the maximum number of levels in WordNet) while m between 1 and the total number of senses in WordNet. Adding a constant (3 to nhyp, we tried to discover the role of the averaged number of hyponyms per concept. Formula 3 shows the resulting formula. (3) After an extended number of runs which were automatically checked, the results showed that β does not affect the behaviour of the formula, a strong indication that this formula is not sensitive to constant variations in the number of hyponyms. On the contrary, different values of a affect the performance consistently, yielding the best results in those experiments with a near 0.20. The actual formula which was used in the experiments was thus the following: (4) 4
The disambiguation algorithm using conceptual density
Given a window size, the program moves the window one word at a time from the beginning of the document towards its end, disambiguating in each step the word in the middle of the window and considering the other words in the window as context. The algorithm to disambiguate a given word w in the middle of a window of words W roughly proceeds as follows. First, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its sub-hierarchy (step 2). It selects the concept c with highest density (step 3) and selects the senses
A PROPOSAL FOR WSD USING CD
167
below it as the correct senses for the respective words (step 4). If a word from W: - has a single sense under c, it has already been disambiguated. - has not such a sense, it is still ambiguous. - has more than one such senses, we can eliminate all the other senses of w, but have not yet completely disambiguated w. The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate words in W (back to steps 2, 3 and 4). When no further disambiguation is possible, the senses left for w are processed and the result is presented (step 5). To illustrate the process, consider the text in Figure 3 extracted from SemCor. The jury(2) praised the administration(3) and operation(8) of the Atlanta Police_Department(l) , the Fulton-Tax-Commissioner-'s.Office. the Bellwood and Alpharetta prison_f arms(i) , Grady .Hospital and the Fulton_Health_Department.
Fig. 3: Sample sentence from SemCor The underlined words are nouns represented in WordNet with the number of senses between brackets. The noun to be disambiguated in our example is operation, and a window size of five will be used. Each step goes as follows: Step 1 Figure 4 shows partially the lattice for the example sentence. As far as Prison_farm appears in a different hierarchy we do not show it in the figure. The concepts in WordNet are represented as lists of synonyms. Word senses to be disambiguated are shown in bold. Underlined concepts are those selected with highest Conceptual Density. Monosemic nouns have sense number 0. Step 2 , for instance, has underneath 3 senses to be disambiguated and a sub-hierarchy size of 96 and therefore gets a Conceptual Density of 0.256. Meanwhile, , with 2 senses and subhierarchy size of 86, gets 0.062. Step 3 , being the concept with highest Con ceptual Density is selected. Step 4 In the example, Operation_3, police-department_0 and jury_l are the senses chosen for operation, Police-Department and jury. All the other concepts below are marked so that they are no longer selected. Other senses of those words are deleted from the lattice, e.g., jury_2. In the next loop of the algorithm will have only one disambiguation-word below it, and therefore its density will be much
168
ENEKO AGIRRE & GERMAN RIGAU police_department_0 local department, department of local government government department department jury-1,panel committee, commission operation_3, function division administrative unit unit organisation social group people
group
administration-1, governance. . . jury_2 body people group, grouping
Fig. 4: Partial lattice for the sample sentence lower. At this point the algorithm detects that further disambiguation is not possible, and quits the loop. Step 5 The algorithm has disambiguated operation_3, police_department_0, jury_l and prison_farm_0 (because this word is monosemous in WordNet), but the word administration is still ambiguous. The output of the algorithm , thus, will be that the sense for operation in this context, i.e., for this window, is operation_3. The disambiguation window will move rightwards, and the algorithm will try to disambiguate Police-Department taking as context administration, operation, prison_f arms and whichever noun is first in the next sentence. The disambiguation algorithm has and intermediate outcome between completely disambiguating a word or failing to do so. In some cases the algorithm returns several possible senses for a word. In this experiment we treat this cases as failure to disambiguate. 5
The experiment
We selected one text from SemCor at random: br-aOl from the gender "Press: Reportage". This text is 2079 words long, and contains 564 nouns. Out of these, 100 were not found in WordNet. From the 464 nouns in WordNet, 149 are monosemous (32%).
A PROPOSAL FOR WSD USING CD
169
<s> <wd>jury<sn>[noun.group.0]NN <wd>administration<sn>[noun.act.0]NN <wd>operation<sn>[noun.state.0]NN <wd>Police_Department<sn> [noun.group.0] NN <wd>prison_farms<mwd>prisonjfarm <msn>[noun.artifact.0]NN
Fig. 5: SemCor format jury administration operation PoliceJDepartment prisonfarm
Fig. 6: Input words The text plays both the role of input file (without semantic tags) and (tagged) test file. When it is treated as input file, we throw away all nonnoun words, only leaving the lemmas of the nouns present in WordNet. The program does not face syntactic ambiguity, as the disambiguated part of speech information is in the input file. Multiple word entries are also available in the input file, as long as they are present in WordNet. Proper nouns have a similar treatment: we only consider those that can be found in WordNet. Figure 5 shows the way the algorithm would input the example sentence in Figure 3 after stripping non-noun words. After erasing the irrelevant information we get the words shown in Fig ure 6 4 . The algorithm then produces a file with sense tags that can be compared automatically with the original file (cf. Figure 5). Deciding the optimum context size for disambiguating using Conceptual Density is an important issue. One could assume that the more context there is, the better the disambiguation results would be. Our experiment shows that precision5 increases for bigger windows, until it reaches window size 15, where it gets stabilised to start decreasing for sizes bigger than 25 (cf. Figure 7). Coverage over polysemous nouns behaves similarly, but with a more significant improvement. It tends to get its maximum over 80%, decreasing for window sizes bigger than 20. Precision is given in terms of polysemous nouns only. The graphs are drawn against the size of the context 6 that was taken into account when disambiguating. Figure 7 also shows the guessing baseline, given when selecting senses at random. First, it was calculated analytically using the polysemy counts for 4
5
6
Note that we already have the knowledge that police and prison farm are compound nouns, and that the lemma of prison farms is prison farm. Precision is defined as the ratio between correctly disambiguated senses and the total number of answered senses. Coverage is given by the ratio between total number of answered senses and total number of senses. Context size is given in terms of nouns.
170
ENEKO AGIRRE & GERMAN RIGAU
Fig. 7: Precision and coverage % w=25 polysemic overall
Cover. 83.2 88.6
Prec. 47.3 66.4
Recall 39.4 58.8
Table 1: Overall data for the best window size the file, which gave 30% of precision. This result was checked experimentally running an algorithm ten times over the file, which confirmed the previous result. We also compare the performance of our algorithm with that of the 'most frequent' heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the text. While the precision is similar to that of our algorithm, the coverage is nearly 10% worse. All the data for the best window size can be seen in table 5. The precision and coverage shown in the preceding graph was for polysemous nouns only. If we also include monosemic nouns precision raises from 47.3% to 66.4%, and the coverage increases from 83.2% to 88.6%. 6
Conclusions
The automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base. The algorithm is theoretically motivated and founded, and offers a general measure of the
A PROPOSAL FOR WSD USING CD
171
semantic relatedness for any number of nouns in a text. In the experiment, the algorithm disambiguated one text (2079 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. T h e results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. More extensive experiments on additional SemCor texts, including among others the use of meronymic links, testing of homograph level disambigu ation and direct comparison with other approaches, are reported in Agirre et al. (1996). This methodology has been also used for disambiguating nominal entries of bilingual MRDs against WordNet (Rigau & Agirre 1995). A c k n o w l e d g e m e n t s . We wish to thank all the staff of the CRL and specially Jim Cowie, Joe Guthtrie, Louise Guthrie and David Farwell. We would also like to thank Ander Murua, for mathematical assistance, Xabier Arregi, Jose Mari Arriola, Xabier Artola, Arantxa Diaz de Ilarraza, Kepa Sarasola, and Aitor Soroa from the Computer Science Department of EHU and Francesc Ribas, Horacio Rodriguez and Alicia Ageno from the Computer Science Department of UPC. REFERENCES Agirre, Eneko, Xabier Arregi, Arantza Diaz de Ilarraza & Kepa Sarasola. 1994. "Conceptual Distance and Automatic Spelling Correction". Workshop on Speech Recognition and Handwriting, 1-8. Leeds, U.K. & German Rigau. 1996. An Experiment in Word Sense Disambiguation of the Brown Corpus Using WordNet. Technical Report (MCCS-96-291). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University. Cowie, Jim, Joe Guthrie & Louise Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the DARPA Workshop on Speech and Natural Language, 238-242. Francis, Nelson & Henry Kucera. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston, Mass.: Houghton-Mifflin. Guthrie, Louise, Joe Guthrie & Jim Cowie. 1993. Resolving Lexical Ambiguity. Technical Report (MCCS-93-260). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University.
172
ENEKO AGIRRE & GERMAN RIGAU
Lesk, Michael. 1986. "Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone". Proceeding of the 1986 SIGDOC Conference, Association of Computing Machinery, 24-26. McRoy, Susan W. 1992. "Using Multiple Knowledge Sources for Word Sense Discrimination". Computational Linguistics 18:1.1-30. Miller, George A. 1990. "Five Papers on WordNet". Special Issue of the Inter national Journal of Lexicography. 3:4. & Daniel A. Teibel. 1991. "A Proposal for Lexical Disambiguation". Pro ceedings of the DARPA workshop on Speech and Natural Language, 395-399. , Claudia Leacock, Tengi Randee & Ross T. Bunker. 1993. "A Semantic Concordance". Proceedings of the DARPA Workshop on Human Language Technology, 303-308. Rada, Roy, Hafedh Mili. Ellen Bicknell & Maria Blettner. 1989. "Development an Application of a Metric on Semantic Nets". IEEE Transactions on Systems, Man and Cybernetics. 19:1.17-30. Richarson, Ray, Allan F. Smeaton & John Murphy. 1994. Using WordNet as a Konwledge Base for Measuring Semantic Similarity between Words. Tech nical Report (CA-1294). Dublin, Ireland: School of Computer Applications, Dublin City University. Rigau, German. 1995. "An Experiment on Semantic Tagging of Dictionary Definitions". Workshop "The Future of the Dictionary". Uriage-les-Bains, Prance. & Eneko Agirre. 1995. "Disambiguating Bilingual Nominal Entries against WordNet". Proceedings of the Computational Lexicon Workshop, 7th European Summer School in Logic, Language and Information, 71-82. Barcelona, Spain. Sussna, Michael. 1993. "Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network". Proceedings of the 2nd International Conference on Information and Knowledge Management, 67-74. Airlington, Virginia, U.S.A. Voorhees, Ellen. 1993. "Using WordNet to Disambiguate Word Senses for Text Retrieval", Proceedings of the 16th Annual International ACM SIGIR Con ference on Research and Development in Information Retrieval, 171-180. Wilks, Yorick et al. 1993. "Providing Machine Tractable Dictionary Tools". Semantics and the Lexicon ed. by James Pustejovsky, 341-401. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora", Proceedings of the ARPA Workshop on Human Language Technology, 266-271.
An Episodic Memory for Understanding and Learning OLIVIER F E R R E T * & B R I G I T T E GRAU* **
*LIMSI-CNRS
**IIE-CNAM
Abstract In this article we examine the incorporation of pragmatic knowledge learning in natural language understanding systems. We argue that this kind of learning can and should be done incrementally. In order to do so we present a model that is able simultaneously to build a case library and to prepare the abstraction of schemata which represent general situations. Learning takes place on the basis of narratives whose representations are collected in an episodic memory. 1
Introduction
Text understanding requires pragmatic knowledge about stereotypical situ ations. One must go beyond the information given so that inferences can be performed to make explicit the links between utterances. By determining the relations between individual utterances the global representation of the entire text can be computed. Unless one is dealing with specific domains it is not reasonable to assume that a system has a priori all the information needed. In most cases texts are made of known and unknown bits and pieces of information. Text analysis is therefore best viewed as a complex process in which understanding and learning take place, and which must improve itself (Schank 1982). Methods of reasoning that are exclusively analytic are no longer suffi cient to assure the understanding of texts, as these typically include new situations. Hence alternatives such as synthetic and analogical reasoning, which use more contextualised knowledge, are also needed. Thus, a memory model dedicated to general knowledge must be extended with an episodic component that organises specific situations, and must be able to take into account the constraints coming from gathering the understanding and learn ing processes. In the domain of learning pragmatic knowledge from texts, the short comings of one-dimensional approaches such as Similarity-Based Learning — IPP (Lebowitz 1983) — or Explanation-Based Learning — GENESIS (Mooney & DeJong 1985) — have become apparent and have given place to a multistrategy approach. OCCAM (Pazzani 1988) is an attempt in this
174
OLIVIER FERRET & BRIGITTE GRAU
direction as it uses Similarity-Based Learning techniques in order to com plete a domain theory for an Explanation-Based Learning process. Despite their differences, all these approaches share the same goal or means: each new causal representation constructed by the system is generalised as soon as possible in order to classify it on the basis of the system's background knowledge. However learning is not an all-or-nothing process. We follow Vygotsky's (Vygotsky 1962) views on learning, namely, that learning is an incremental process whereby general knowledge is abstracted on the basis of cumulative, successive experiences (in our case, the representations of texts). In this perspective generalisations should not occur every time a new situation is encountered. Rather, we suggest to store them in a buffer, the episodic memory, where the abstraction takes place at a later stage. The result of this abstraction process is a graph of schemata, akin to the MOPs introduced by Schank (Schank 1982). Before we became interested in this topic other researchers have made proposals. Case-Based Reasoning (CBR) systems such as SWALE (Schank & Leake 1989) and AQUA (Ram 1993) have been designed in order to exploit the kind of representations we are talking about. However, these systems start out with a lot of knowledge. They do not model the incre mental aspect we are proposing, that is, an abstraction must be performed only when sufficient reinforced information has been accumulated. Further more, the memory structure of these systems is fixed a priori. Thus, the criteria for determining whether a case can be considered as representat ive cannot be dynamically determined. Despite these shortcomings, CBR systems remain a very good model in the context of learning and must be taken into account when specifying a dynamic episodic memory. 2
Structure of the episodic memory
2.1
Text representation
Before examining the structure of the episodic memory, we will consider the form of its basic component: the text representations. In our case these representations come from short narratives such as the following. A few years ago, [I was in a department store in Harlem] (1) [with a few hun dred people around me](2). [I was signing copies of my book "Stride toward Freedom"] (3) [which relates the boycott of buses in Montgomery in 1955-56] (4). Suddenly, while [Iwas appending my signature to a page] (5), [I felt a pointed thing sinking brutally into my chest] (6). [I had just been stabbed with a paper knife by
AN EPISODIC MEMORY
175
a woman] (7) [who was acknowledged as mad afterwards] (8). [I was taken immedi ately to the Harlem Hospital] (9) [where I stayed on a bed during long hours] (10) while [many preparations were made] (11) [in order to remove the weapon from my body] (12). Revolution Non-Violente by Martin Luther King (based on a French version of the original text)
The texts' underlying meanings are expressed in terms of conceptual graphs (Sowa 1984). The clauses are organised according to the situations men tioned in the texts (See Figure l 1 ). Hence, each of these situations (a dedication meeting in a department store, a murder attempt and a stay in hospital in our example) corresponds to a Thematic Unit (TU).
Fig. 1: The representation of the text about Martin Luther King A text representation, which we call an episode, is a structured set of TUs which are thematically linked in either one of two ways: • thematic deviation: this relation means that a part of a situation is elaborated. In our example, the hospital situation is considered to be a deviation from the murder attempt because these two situations are thematically related to the Martin Luther King's wound. More precisely, a deviation is attached to one of the graphs of a TU. Here, the Hospital TU is connected to the turning graph (9) expressing that Martin Luther King is taken to the hospital. • thematic shift: this relation characterises the introduction of a new situation. In the example below, there is a thematic shift between the dedication meeting situation and the murder attempt one because they are not intrinsically tied together, fortunately for the book writers. Among all the TUs of an episode, at least one has the status of being the main topic (MT). In the Martin Luther King text, the Murder attempt TU plays this role. More generally, a main topic is determined by applying heuristics based on the type of the links between the TUs (Grau 1984). TUs have a structure. Depending on the aspect of the situation they describe, graphs are distributed among three slots: 1
Propositions 6 and 7, also 3 and 5, are joined together in one conceptual graph. This is possible through the definition graph associated to the types of concept.
176
OLIVIER FERRET & BRIGITTE GRAU • circumstances (C): states under which the situation occurs; • description (D): actions which characterise the situation; • outcomes (0): states resulting from the situation.
A TU is valid only if its description slot is not empty. Nevertheless, as shown in the example below, certain slots may remain empty if the corresponding information is not present in the text. Inside the description slot, graphs may be linked by temporal and causal relations. For example, in the Hospital TU graphically represented in Fig ure 1, graphs (10) and (11) are causally tied with the graph (12). Text representations have so far been built manually. However, prelim inary studies show that this analysis could be done automatically without using any particular abstract schemata. A CBR mechanism using both text representations and linguistic clues (such as connectives, temporal markers ɔr other cohesive devices) is under study. 2.2
The episodic memory
The structure of the episodic memory is governed by one major principle: all similar elements are stored in the same structure. As a result, accumulation occurs and implicit generalisations are made by reinforcing the recurrent features of the episodes or the situations. This principle is applied to the episodes and the TUs, and the memory is organised by storing this information accordingly. That is, similar episodes and similar TUs are grouped such as to build aggregated episodes in one case and aggregated TUs in the other. We show an example of the memory in Figure 2. Episode 1 and episode 2, which talk about the same topic, a murder attempt with a knife, have been grouped together in one aggregated episode. In this episode, the TUs that describe more specifically the murder attempt have been gathered in the same aggregated TU. It should be noted that TUs coming from different episodes without being their main topic can still be grouped in a same aggregated TU (see the Scuffle TU or the Speech TU in Figure 2). The principle of aggregation is not applied at the memory scale for smaller elements such as concepts or graphs. Aggregated graphs exist in the memory; but their scope is limited to the slot of the aggregated TU containing them. An aggregated graph gathers only those similar graphs that belong to the same slot of similar TUs coming from different episodes. Similarly, an aggregated concept makes no sense in isolation of the aggreg ated graph of which it is part of, hence, it cannot be found in another graph.
AN EPISODIC MEMORY
177
It is in fact the product of a generalisation applied to concepts which re semble each other in the context of graphs which are also considered to be similar. This explains why the accumulation process can be viewed as the first step of a generalisation process.
Fig. 2: The episodic memory For instance, in the aggregated graph (a) of the description slot below (see Figure 3), Stab has Man for agent, because the type Man is the result of the aggregation of the more specific types Soldier and Young-man. On the other hand, we have no aggregated concept for recipient because the aggregation was unsuccessful for Arm and Stomach. The accumulation process has been designed in such a way as to make apparent the most relevant features of the situations by reinforcing them. This is done by storing similar elements in the same structure and by as signing them a weight. This weight quantifies the degree of recurrence of an element. Figure 3 shows these weights for aggregated graphs and aggregated con cepts. These weights characterise the relative importance of aggregated graphs with regard to the aggregated TU and the relative importance of aggregated concepts with regard to the aggregated graph. This principle of cumulation holds also for the relations between the entities. This is shown in Figure 3 for casual relations in the aggregated graphs. In a description slot, temporal and causal relations coming from different episodes are also aggregated and similarly for the thematic relations between the TUs of an episode. This example illustrates not only the accumulative dimension of our memory model but also its potential for being a case library. Even though aggregated concepts are generalisations, they still maintain a link to the
178
OLIVIER FERRET & BRIGITTE GRAU
Circumstances (b) [Quarrel] (0.5) [event] (1.0) (agent) (1.0) —> [young-man] (1.0) event [1] (agent) [2] young-man [2] [airport] (1.0) (object) (1.0) - > [money] (1.0) (object) [2] money [2] airport [1] (accomp.) (1.0) —> [young-man] (1.0) (accomp.) [2] young-man [2] Description (a) [Stab] (1.0) " (b) [Arrest] (1.0) (agent) (1.0) —> [man] (1.0) (agent) (1.0) —> [human] (1.0) (agent) [1,2] policeman [1], human [2] (agent) [1,2] soldier [1], young-man [2] (recipient) (1.0) —> [ ] — (object) (1.0) - > [man] (1.0) (recipient) [1,2] arm (0.5) [1], stomach (0.5) [2] (object) [1,2] soldier [1], young-man [2] (part) (1.0) - > [man] (1.0) (part) [1,2] head-of-state [1], young-man [2] (d) [Stumble] (0.5) (instrument) (1.0) - > [knife] (1.0) (agent) (1.0) - > [soldier] (1.0) (agent) [1] soldier [1] (instrument) [1,2] bayonet [1], flick knife [2] (c) [Attack] (0.5) (e) [Hit] (0.5) (agent) (1.0) - > [soldier] (1.0) (agent) (1.0) —> [young-man] (1.0) (agent) [1] soldier [1] (agent) [2] young-man [2] (object) (1.0) —> [head-of-state] (1.0) (recipient) (1.0) —> [young-man] (1.0) (object) [1] head-of-state [1] (recipient) [2] young-man [2] (manner) (1.0) —> [suddenly] (1.0) | (manner) [1] suddenly [1] Outcomes (a) [Located] (1.0) (b) [Wounded] (0.5) (experiencer) (1.0) —> [man] (1.0) (experiencer) (1.0) —> [head-of-state] (1.0) (experiencer) [1,2] soldier [1], young-man [2] (experiencer) [1] head-of-state [1] (location) (1.0) —> [prison] (1.0) (manner) (1.0) - > [light] (1.0) (location) [1,2] prison [1,2] (manner) [1] light [1] (c) [Dead] (0.5) (experiencer) (1.0) —> [young-man] (1.0) (experiencer) [2] young-man [2] (a) [Located] (0.5) (experiencer) (1.0) —> (experiencer) [1] (location) (1.0) - > (location) [1]
[Stab]: predicate of an aggregated graph. (1.0) : weight value, [man] : aggregated concept, (agent) : aggregated relation, soldier [1]: a concept, i.e. an instance, occurring in episode 1. It is linked to the aggregated concept above it. (recipient) [1,2]: a relation which occurs in episodes 1 and 2. It is linked to the aggregated relation above it.
Fig. 3: An aggregated TU (the Murder Attempt TU of Figure 2) concepts from which they have been built 2 . Thus, following the references to the episodes, we know that the agent of the Stab predicate in the episode 1 is a Soldier. Hence, a Case-Based Reasoner will be able to use this fact in order to exploit the specific situations stored in the aggregates and improve an automatic comprehension process. Such a reasoner could use the aggregated information and the specific information simultaneously. The former would be used to evaluate the relative importance of a piece of data, and the latter to reason more precisely on the basis of similarities and differences. The multidimensional aspect of this model also has implications on the way of retrieving information from the memory when it is used as a case 2
Unlike the aggregated concepts, concepts in texts, i.e. instances, may belong to several graphs and are therefore starting points for roles.
AN EPISODIC MEMORY
179
library. Unlike most CBR systems, the library here has a relatively flat structure: similar episodes and similar TUs are simply grouped together. Aggregated episodes can be considered as typical contexts for the aggreg ated TUs, which are the central elements, but there is no structural means (for instance, a hierarchical structure of relevant features) for searching a case. This operation is achieved in an associative way by a spreading activation mechanism which works on all different knowledge levels. The interaction between the concepts and the structures of the memory (aggreg ated episodes, aggregated TUs or schemata) leads to a stabilised activation configuration from which the cases with the highest activation level are se lected. This process is akin to what Lange and Dyer (Lange & Dyer 1989) call evidential activation. In our case, the weights upon which the propaga tion is based are those that characterise the element's relative importance in our memory model. This mechanism presents two major advantages from the search-phase point of view. First of all, no a priori indexing is necessary. This is useful in a learning situation where the setting is not stable. Secondly, a syntactic match is performed at the same time. 3
Episode matching and memorisation
When the building of the text's underlying meaning representation is com pleted, one, or possibly several memorised episodes have been selected by the spreading activation mechanism. They are related to either the text's main situation, the main TU, or a secondary one. Matching episodes amounts thus to comparing memorised TUs with TUs of the text. In this section we examine under what conditions TUs are similar. 3.1
Similarity of TUs
The relative similarity between two TUs depends on the degree of their slot matching. We proceed in two steps. At first we compute two ratios obtained from the number of similar graphs, in relation to the number of graphs present in the memorised slot as well as to the number of graphs in the text slot. Thus, we first evaluate each slot in the lump by comparing these ratios with an interval of arbitrary thresholds [t 1 ,t 2 ] we have established. When the two ratios are under the lower limit, the similarity is rejected: neither the memorised slot nor the text slot contains a sufficient number of common points with regard to their differences. If one of these two ratios is above
180
OLIVIER FERRET & BRIGITTE GRAU
the upper limit, the proportion of common points of one slot or the other is sufficient to consider the slots as highly similar. If both ratios happen to be within the interval, we conclude in favour of a moderate similarity that has to be evaluated by another more precise method. In this case, we compute a score based on the importance of the graphs inside the slots. This computation is described in detail in the next section. When this score is above another given threshold t3, we conclude that there is a high similarity. Thus, two slots sharing an average number of graphs can be very similar if these graphs are important for this slot. The thresholds are parameters of the system. In the current version, t1 — 0.5, t2 = 0.8 and t3 = 0.7. Finally, two TUs are similar if they correspond to any of the following rules: R1'. highly similar circumstances and moderately similar description R2: similar circumstances and similar outcomes, with at least one of the two dimensions highly similar. R3: moderately similar description and highly similar outcomes R4: highly similar description. 3.2
Similarity of slots and similarity of graphs
The score of a slot is based on the score of its similar graphs, weighted by their relative importance into the slot. We compute the score of two graphs only when they contain the same predicate and at least one similar concept related by an equivalent casual relation. Two concepts are similar if the most specific abstraction of their types is less than the concept type of the canonical graph. By definition, the graphs we compare are derived from the same canonical graph and, for each relation, their concept types are restrictions of the same type inside this canonical graph. In the comparison of two concepts, if the aggregated one does not exist, the resulting type is the one which abstracts the maximum number of concept occurrences. Thus, the evaluation function of the similarity of two graphs containing the same predicate is the following:
with Sim Concept(ci,c'i) = 1 when the concepts are similar otherwise 0, wci is the weight of the concept inside the memorised graph and the ci are the concepts other than the predicate.
AN EPISODIC MEMORY
181
Two graphs, g and g', are similar if SimGraph(g,g') > 0. The weight wci is either the weight of the aggregated concept or the sum of the weights of the regrouped occurrences. The following illustrates the computation of the similarity between the graph (a) of the description slot in Figure 3 and the graph of the Martin Luther King text which has the same predicate (it corresponds to the clauses 6 and 7): [Stab] — (agent) —> (recipient) —> (part) — > (instrument) — > (manner) — >
SimGraph [woman] [chest] — [man] [paper-knife] [brutally]
~
= (1.0 SimConcept(man,woman)+ 0.5 SimConcept(chest,stomach or arm)+ 1.0 SimConcept(man,man)+ 1.0 • SimConcept(knife,paper-knife)/3.5 = (1.0 + 0.0 + 1.0 + 1.0)/3.5 0.86
We can now define the evaluation function of two identically named slots as follows:
where
wpi is the weight of the aggregated predicate and SimGraph(txtgi,memgi) > 0.
The eventual presence of a chronological order between graphs in the de scription slots does not intervene in the similarity evaluation. We do not want to favour an unfolding of events with regard to another, the various combinations having actually occurred in the original situations. More generally, the way in which the similarity between structures is computed resembles Kolodner and Simpson's (Kolodner & Simpson 1989) method, with the computation of an aggregate match score. There are however two big differences: first of all, the similarity is context dependent because the relative importance of any element is always evaluated within the context of the component to which it belongs. Second, this importance can change, since it is represented by the recurrence of the element and not by an hierarchy of features established on a priori grounds. Because situations are not related in the same way, nor with the same level of precision, the structure of episodes may be different even if they deal with the same topic. For instance, a TU may be detailed by another TU in one episode and not in another one. Hence, graphs that could be matched may be found in two different TUs as we can see in Figure 4. This peculiarity must be taken into account when we compare two slots. We do so by first recognising similar graphs in identically named slots; then we try to find the remaining graphs in the appropriate slots of an eventually
182
OLIVIER FERRET & BRIGITTE GRAU
C: Circumstances D : Description O : Outcomes
memorized TUs: TU2 gives details concerning the circumstances of TU1
Fig. 4: Matching two different structures detailed TU. For example, when examining the similarity of the circum stance slots of text TU and TU1 in Figure 4, the remaining states (g2) are searched either in the outcomes slot of an associated TU (TU2), or in the resulting states of the actions in its description slot. This process will be applied to the remaining graphs of the text and to those of the memorised TU. The difference of structure is bypassed during the computation of the similarity measure, but it will not be neglected during the aggregation pro cess. In such cases, the aggregation of the first similar graphs will take place while the other similar graphs will be represented in their respective TU. No strengthening of the structure between the concerned TUs will occur. 3.3
Memorisation of an episode: The aggregation process
The spreading activation process leads to the selection of memorised epis odes which are ordered according to their activation level. To decide if one of these is a good candidate for an aggregation with the incoming episode, even if this aggregation is only a partial one, we have to find similar TUs between them. Episodes can be aggregated only if their principal TUs are similar. If this similarity is rejected, we are brought back to the sole ag gregation of TUs and the incoming episode leads to the creation of a new aggregated episode. Otherwise, the process continues in order to decide whether the topic structuring of the studied text is similar to the structur ing of the held episode. If similar secondary TUs are found in the same relation network, their links will be reinforced accordingly. This last part of the process will be applied even if no match is found at the episode level. The reinforcement of such links means that a more general context than a single TU is recurrent. Whatever level of matching is recognised, TUs are aggregated. In doing so, the graphs of the text TU are memorised according to the slot they belong and to the result of the similarity process. If new predicates appear,
AN EPISODIC MEMORY
183
the corresponding graphs are added to the memorised slot with a weight equal to 1 divided by the number of times the TU has been aggregated. In the case of graphs which contain an existing predicate and whose similarity has been rejected, they are joined with no strengthening of the predicate. New concepts related to existing casual relations are related to the corres ponding aggregated concept. Existing aggregated concepts, which are the abstraction amalgamating the maximum number of occurrences, may be questioned when a new concept is added to a graph. If any of them no longer fulfills this definitional constraint, it is suppressed. Pre-generalisation and reinforcement occur when the graphs are similar. As a result, the weight of the predicate increases. According to the res ults of the similarity process, aggregated concepts may evolve and become more abstract. The weights of the modified concepts inside the graphs are computed in order for them to be always equal to the number of times the concept has been strengthened, divided by the number of predicate's aggregations. The result of the aggregation of the Stab graph (see 3.2) coming from the Martin Luther King text (episode 5) with the Stab aggregated graph of the Murder Attempt aggregated TU (see Figure 3) is shown below: [Stab](1.0) — (agent)(1.0)—> (agent) [1,2,5]
[human] (1.0), soldier[l] young-man[2] woman[5] ( i n s t r u m e n t ) ( 1 . 0 ) — > [knife](1.0), (instrument) [1,2,5] bayonet[l] flick knife[2] paper-knife[5]
4
(recipient) ( 1 . 0 ) — > (recipient) [1,2,5]
(part)(1.0)—> (part)[l,2,5]
[] — arm(0.33)[l] stomach(0.33)[2] chest (0.33) [5] [man](1.0) head-of-state[l] young-man [2] man[5]
Conclusion
Natural Language Understanding systems must be conceived in a learning perspective if they are not designed for a specific purpose. Within this approach, we argue that learning is an incremental process based on the memorisation of its past experiences. That is why we have focused our work on the elaboration and the implementation of an episodic memory that is able to account for progressive generalisations by aggregating similar situations and reinforcing recurrent structures. This memory model also constitutes a case library for analogical reasoning. It is characterised by the two levels of cases it provides. These cases give different sorts of information: on one hand, specific cases can be used as sources given their richness coming from the situations they represent. On another hand, the aggregated cases,
184
OLIVIER FERRET & BRIGITTE GRAU
being a more reliable source of knowledge, guide and validate the retrieval and the use of the specific cases. More generally, our approach prepares the induction of schemata and the selection of their general features, a step which is still necessary to stabilise and organise abstract knowledge. This approach also provides a robust model of learning insofar as it allows for a weak text understanding. Even misunderstandings resulting from an incomplete domain theory will be compensated on the basis of the treatment of lots of texts involving analogous subjects. REFERENCES Grau, Brigitte. 1984. "Stalking Coherence in the Topical Jungle". Proceedings of the 5th Generation Computer System (FGCS'84), Tokyo, Japan. Kolodner, Janet L. & R.L. Simpson. 1989. "The MEDIATOR: Analysis of an Early Case-Based Problem Solver". Cognitive Science 13:4.507-549. Lange, Trent E. & Michael G. Dyer. 1989. "High-level Inferencing in a Connectionist Network". Connection Science 1:2.181-217. Lebowitz, Michael. 1983. "Generalization from Natural Language Text". Cog nitive Science 7.1-40. Mooney, Raymond & Gerald De Jong. 1985. "Learning Schemata for Natural Language Processing". Proceedings of the 9th International Joint Conference on Artificial Intelligence (IJCAF85), Los Angeles, 681-687. Pazzani, Michael J. 1988. "Integrating Explanation-based and Empirical Learn ing Methods in OCCAM". Third European Working Session on Learning (EWSL'88) ed. by Derek Sleeman, 147-165. Ram, Ahswin. 1993. "Indexing, Elaboration and Refinement: Incremental Learn ing of Explanatory Cases". Machine Learning (Special Issue on Case-Based Reasoning) ed. by Janet L. Kolodner, 10:3.201-248. Schank, Roger C. 1982. Dynamic Memory: A Theory of Reminding and Learning in Computers and People. New York: Cambridge University Press. & David B. Leake. 1989. "Creativity and Learning in a Case-Based Ex plainer". Artificial Intelligence (Special Volume on Machine Learning) ed. by Jaime G. Carbonell, 40:1-3.353-385. Sowa, John F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading: Addison Wesley. Vygotsky, Lev S. 1962. Thought and Language. Cambridge, Mass.: MIT Press.
Ambiguities & Ambiguity Labelling: Towards Ambiguity D a t a Bases CHRISTIAN BOITET* & MUTSUKO
*GETA, CLIPS, IMAG **ATR Interpreting
TOMOKIYO**
(UJF, CNRS & INPG) Telecommunications
Abstract This paper has been prepared in the context of the MID DIM project (ATR-CNRS). It introduces the concept of'ambiguity labelling', and proposes a precise text processor oriented format for labelling 'pieces' such as dialogues and texts. Several notions concerning ambiguities are made precise, and many examples are given. The ambiguities labelled are meant to be those which state-of-the-art speech analysers are believed not to be able to solve, and which would have to be solved interactively to produce the correct analysis. The proposed labelling has been specified with a view to store the labelled pieces in a data base, in order to estimate the frequency of various types of ambiguities, the importance to solve them in the envisaged contexts, the scope of disambiguation decisions, and the knowledge needed for disambiguation. A complete example is given. Finally, an equivalent data base oriented format is sketched. 1
Introduction
As has been argued in detail in (Boitet 1993; Boitet 1993; Boitet & LokenKim 1993), interactive disambiguation technology must be developed in the context of research towards practical Interpreting Telecommunications sys tems as well as high-quality multi-target text translation systems. In t h e case of speech translation, this is because the state of the art in the foresee able future is such t h a t a black box approach to spoken language analysis (speech recognition plus linguistic parsing) is likely to give a correct o u t p u t for no more t h a n 50 to 60% of the utterances ('Viterbi consistency'(Black, Garside & Leech 1993)) 1 , while users would presumably require an overall success rate of at least 90% to be able to use such systems at all. However, the same spoken language analysers may be able to produce 1
According to a study by Cohen & Oviatt, the combined success rate is bigger than the product of the individual success rates by about 10% in the middle range. Using a formula such as S2 = S1*S1 + (1-S1)*A with A=20%, we get:
186
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
sets of outputs containing the correct one in about 90% of the cases ('struc tural consistency' (Black, Garside & Leech 1993) ) 2 . In the remaining cases, the system would be unable to analyse the input, or no output would be correct. Interactive disambiguation by the users of the interpretation or translation systems is then seen as a practical way to reach the necessary success rate. It must be stressed that interactive disambiguation is not to be used to solve all ambiguities. On the contrary, as many ambiguities as pos sible should be reduced automatically. The remaining ones should be solved by interaction as far as practically possible. What is left would have to be reduced automatically again, by using preferences and defaults. In other words, this research is complementary to the research in auto matic disambiguation. Our stand is simply that, given the best automatic methods currently available, which use syntactic and semantic restrictions, limitations of lex icon and word senses by the generic task at hand, as well as prosodic and pragmatic cues, too many ambiguities will remain after automatic analysis, and the 'best' result will not be the correct one in too many cases. We suppose that the system will use a state-of-the-art language-based speech recogniser and multilevel analyser, producing syntactic, semantic and pragmatic information. We leave open two possibilities: • an expert system specialised in the task at hand may be available. • an expert human interpreter/translator may be called for help over the network. The questions we want to address in this context are the following: • what kinds of ambiguities (unsolvable by state-of- the-art speech ana lysers) are there in dialogues and texts to be handled by the envisaged systems ? • what are the possible methods of interactive disambiguation, for each ambiguity type? • how can a system determine whether it is important or not for the overall communication goal to disambiguate a given ambiguity?
2
SR of 1 component (S1) 40% 45% 50% 55% 60% SR of combination (S2) 28% 31% 35% 39% 44% S1 65% 70% 75% 80% 85% 90% 95% 100% S2 49% 55% 61% 68% 75% 83% 91% 100% 50~60% overall Viterbi consistency corresponds then to 65~75% individual success rate, which is already optimistic. According to the preceding table, this corresponds to a structural consistency of 95% for each component, which seems impossible to attain by strictly automatic means in practical applications involving general users.
AMBIGUITIES & AMBIGUITY LABELLING
187
• what kind of knowledge is necessary to solve a given ambiguity, or, in other word, whom should the system ask: the user, the interpreter, or the expert system, if any? • in a given dialogue or document, how far do solutions to ambiguities carry over: to the end of the piece, to a limited distance, or not at all? In order to answer these questions, it seems necessary to build a data base of ambiguities occurring in the intended contexts. In this report, we are not interested in any specific data base management software, but in the collection of data, that is, in 'ambiguity labelling'. First, we make more precise several notions, such as ambiguous repres entation, ambiguity, ambiguity kernel , ambiguity type, etc. Second, we specify the attributes and values used for manual labelling, and give a text processor oriented format. Third, we give a complete example of ambiguity labelling of a short dialogue, with comments. Finally, we define a data-base oriented exchange format. 2
A formal view of ambiguities
2.1 2.1.1
Levels and contexts of ambiguities Three levels of granularity for ambiguity labelling
First, we distinguish three levels of granularity for considering ambiguities. There is an ambiguity at the level of a dialogue (resp. a text) if it can be segmented in at least two different ways into turns (resp. paragraphs). We speak of ambiguity of segmentation into turns or into paragraphs. There is an ambiguity at the level of a turn (resp. a paragraph) if it can be segmented in at least two different ways into utterances (We use the term 'utterance' for dialogues and texts, to stress that the 'units of analysis' are not always sentences, but may be titles, interjections, etc.). We speak of ambiguity of\
segmentation into utterances. There is an ambiguity at the level of an utterance if it can be analysed in at least two different ways, whereby the analysis is performed in view of translation into one or several languages in the context of a certain generic task. There are various types of utterance-level ambiguities. Ambiguities of segmentation into paragraphs may occur in written texts, if, for example, there is a separation by a (new-line) character only, without or (paragraph). They are much more frequent and problematic in dialogues.
188
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
For example, in ATR's transcriptions of Wizard of Oz interpretations dia logues (Park, Loken-KIM, Mizunashi & Fais 1995), there are an agent (A), a client (C), and an interpreter (I). In many cases, there are two success ive turns of I, one in Japanese and one in English. Sometimes, there are even three in a row (ATR-ITL 1994: J-E-J-32, E-J-J-33). If I does not help the system by pressing a button, this ambiguity will force the system to do language identification every time there may be a change of language. There are also cases of two successive turns by C (ATR-ITL 1994: E-27), and even three by A (ATR-ITL 1994: J-52) and I (ATR-ITL 1994: J-E-J-55, E-E-J-80) or four (ATR-ITL 1994: I,E-J-E-J-99). Studying these ambigu ities is important for discourse analysis, which assumes a correct analysis in terms of turns. Also, if successive turns in the same language are collapsed, this may add ambiguities of segmentation into utterances, leading in turn to more utterance-level ambiguities. Ambiguities of segmentation into utterances are very frequent, and most annoying, as we assume that the analysers will work utterance by utterance, even if they have access to the result of processing of the preceding context. There are for instance several examples of "right |? now |? turn left...". Or (Park, Loken-KIM, Mizunashi & Fais 1995:50):"OK |? so go back and is this number three |? right there |? shall I wait here for the bus?". An utterance may be spoken or written, may be a sentence, a phrase, a sequence of words, syllables, etc. In the usual sense, there is an ambiguity in an utterance if there are at least two ways of understanding it. This, however, does not give us a precise criterion for defining ambiguities, and even less so for labelling them and storing them as objects in a data base. Because human understanding heavily depends on the context and the com municative situation, it is indeed a very common experience that something is ambiguous for one person and not for another. Hence, we say that an utterance is ambiguous if it has an ambiguousl representation in some formal representation system. We return to that later. 2.1.2
Task-derived limitations on utterance-level ambiguities
As far as utterance-level ambiguities are concerned, we will consider only those which we feel should be produced by any state-of-the-art analyser constrained by the task. For instance, we should not consider that "good morning" is ambiguous with "good mourning", in a conference registration task. It could be different in the case of funeral arrangements.
AMBIGUITIES & AMBIGUITY LABELLING
189
Because the analyser is supposed to be state-of-the-art, "help" should not give rise to the possible meaning "help oneself" in "can I help you". Know ledge of the valencies and semantic restrictions on arguments of the verb "help" should eliminate this possibility. In the same way, "Please state your phone number" should not be deemed ambiguous, as no complete analysis should allow "state" to be a noun, or "phone" to be a verb. That could be different in a context where "state" could be construed as a proper noun, "State", for example in a dialogue where the State Department is involved. However, we should consider as ambiguous such cases as: "Please state (N/V) office phone number" (ATR-ITL 1994:33), where "phone" as a verb could be eliminated on grammatical grounds, but not "state office phone" as a noun, with "number" as a verb in the imperative form. The case would of course be different if the transcription would contain prosodic marks, but the point would continue to hold in general. 2.1.3
Necessity to consider utterance-level ambiguities in the context of full utterances
Let us take another example. Consider the utterance: (1) Do you know where the international telephone services are located? The underlined fragment has an ambiguity of attachment, because it has two different 'skeletons' (Black, Garside & Leech 1993) representations: [ i n t e r n a t i o n a l telephone] services / i n t e r n a t i o n a l [telephone services] As a title, this sequence presents the same ambiguity. However, it is not enough to consider it in isolation. Take for example: (2) The international telephone services many countries. The ambiguity has disappeared! It is indeed frequent that an ambiguity relative to a fragment appears, disappears and reappears as one broadens its context in an utterance. For example, in (3) The international telephone services many countries have established are very reliable. the ambiguity has reappeared. From the examples above, we see that, in order to define properly what an ambiguity is, we must consider the fragment within an utterance, and clarify the idea that the fragment is the smallest (within the utterance) where the ambiguity can be observed.
190 2.2 2.2.1
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Representation
systems
Types of formal representation systems
Classical representation systems are based on lists of binary features, flat or complex attribute structures (property lists), labeled or decorated trees, various types of feature-structures, graphs or networks, and logical formulae. What is an 'ambiguous representation'? This question is not as trivial as it seems, because it is often not clear what we exactly mean by 'the' rep resentation of an utterance. In the case of a classical context-free grammar G, shall we say that a representation of U is any tree T associated to U via G, or that it is the set of all such trees? Usually, linguists say that U has several representations with reference to G. But if we use f-structures with disjunctions, U will always have one (or zero!) associated structure S. Then, we would like to say that S is ambiguous if it contains at least one disjunction. Returning to G, we might then say that 'the' representation of U is a disjunction of trees T. In practice, however, developers prefer to use hybrid data structures to represent utterances. Trees decorated with various types of structures are very popular. For speech and language processing, lattices bearing such trees are also used, which means at least 3 levels at which a representation may be ambiguous. 2.2.2
Computable representations and 'reasonable' analysers
Now, we are still left with two questions: 1. which representation system(s) do we choose? 2. how do we determine the representation or representations of a par ticular utterance in a specific representation system? The answer to the first question is a practical one. The representation system(s) must be fine-grained enough to allow the intended operations. For instance, text-to-speech requires less detail than translation. On the other hand, it is counter-productive to make too many distinctions. For example, what is the use of defining a system of 1000 semantic features if no system and no lexicographers may assign them to terms in an efficient and reliable way? There is also a matter of taste and consensus. Although different representation systems may be formally equivalent, researchers and developers have their preferences. Finally, we should prefer representations amenable to efficient computer processing. As far as the second question is concerned, two aspects should be dis tinguished. First, the consensus on a representation system goes with a
AMBIGUITIES & AMBIGUITY LABELLING
191
consensus on its semantics. This means that people using a particular rep resentation system should develop guidelines enabling them to decide which representations an utterance should have, at each level, and to create them by hand if challenged to do so. Second, these guidelines should be refined to the point where they may be used to specify and implement a parser producing all and only the intended representations for any utterance in the intended domain of discourse. A 'computable' representation system is a representation system for which a 'reasonable' parser can be developed. A 'reasonable' parser is a parser such as: • its size and time complexity are tractable over the class of intended utterances; • if it is not yet completed, assumptions about its ultimate capabilities, especially about its disambiguation capabilities, are realistic given the state of the art. _J Suppose, then, that we have defined a computable representation. We may not have the resources to build an adequate parser for it, or the one we have built may not yet be adequate. In that case, given the fact that we are specifying what the parser should and could produce, we may anticipate and say that an utterance presents an ambiguity of such and such types. This only means that we expect that an adequate parser will produce an ambiguous representation for the utterance at the considered level. 2.2.3
Expectations for a system of manual labelling
Our manual labelling should be such that: • it is compatible with the representation systems used by the actual or intended analysers. • it is clear and simple enough for linguists to do the labelling in a reliable way and in a reasonable amount of time. Representation systems may concern one or several levels of linguistic ana lysis. We will hence say that an utterance is phonetically ambiguous if it has an ambiguous phonetic representation, or if the phonetic part of its de scription in a 'multilevel' representation system is ambiguous, and so forth for all the levels of linguistic analysis, from phonetic to orthographic, mor phological, morphosyntactic, syntagmatic, functional, logical, semantic, and pragmatic.
192
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
In the labelling, we should only be concerned with the final result of analysis, not in any intermediate stage, because we want to retain only ambiguities which would remain unsolved after the complete automatic analysis process has been performed. 2.3
Ambiguous representations
A representation will be said to be ambiguous if it is multiple or underspecified. 2.3.1
Proper representations
In all known representation systems, it is possible to define 'proper repres entations', extracted from the usual representations, and ambiguity-free. For example, if we represent "We read books" by the unique decorated dependency tree: [["We" .
((lex "I-Pro") (cat pronoun) (person i) (number plur)...)] " r e a d " ((lex "read-V") (cat verb) (person 1) (number plur) (tense (\{pres past\}))...) ["books" ((lex "book-N") (cat noun)...)]]
there would be 2 proper representations, one with (tense pres), and the other with (tense past). For defining the proper representations of a representation system, it is necessary to specify which disjunctions are exclusive, and which are inclus ive. Proper and multiple representations A representation in a formal representation system is proper if it contains no exclusive disjunction. The set of proper representations associated to a representation R, is obtained by expanding all exclusive disjunctions of R (and eliminating duplicates). It is denoted here by Proper(R). R is multiple if |Proper(R)| > 1. R is multiple if (and only if) it is not proper. 2.3.2
Underspecified representations
A proper representation P is underspecified if it is undefined with respect to some necessary information.
AMBIGUITIES & AMBIGUITY LABELLING
193
There are two cases: the information may be specified, but its value is unknown, or it is missing altogether. The first case often happens in the case of anaphoras: (ref ?), or in the case where some information has not been exactly computed, e.g. (task_domain ?), (decade.of .month ?), but is necessary for translating in at least one of the considered target languages. It is quite natural to consider this as ambiguous. For example, an ana phoric reference should be said to be ambiguous • if several possible referents appear in the representation, which will give rise to several proper representations, • and also if the referent is simply marked as unknown, which causes no disjunction. The second case may never occur in representations such as Ariane-G5 decorated trees, where all attributes are always present in each decoration. But, in a standard f- structure, there is no way to force the presence of an attribute, so that a necessary attribute may be missing: then, (ref ?) is equivalent to the absence of the attribute ref. For any formal representation system, then, we must specify what the 'necessary information' is. Contrary to what is needed for defining Proper(R), this may vary with the intended application. 2.3.3
Ambiguous representations
Our final definition is now simple to state. A representation R is ambiguous if it is multiple or if Proper(R) contains an underspecified P. 2.4 2.4.1
Scope, occurrence, kernel and type of ambiguity Informal presentation
Although we have said that ambiguities have to be considered in the context of the utterances, it is clear that a sequence like "international telephone services" is ambiguous in the same way in utterances (1) and (3) above. We will call this an 'ambiguity kernel', and reserve the term of 'ambiguities' for what we will label, that is, occurrences of ambiguities. The distinction is the same as that between dictionary words and text words. It also clear that another sequence, such as "important business ad dresses" , would present the same sort of ambiguity in analogous contexts. This we want to define as 'ambiguity type'. In this case, linguists speak of
194
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
'ambiguity of attachment', or 'structural ambiguity'. Other types concern the acceptions (word senses), the functions (syntactic or semantic), etc. Our list will be given with the specification of the labelling conventions. Ambiguity patterns are more specific kinds of ambiguity types, usable to trigger disambiguation actions, such as the production of a certain kind of disambiguating dialogue. For example, there may be various patterns of structural ambiguities. 2.4.2
Scope of an ambiguity
We take it for granted that, for each considered representation system, we know how to define, for each fragment V of an utterance U having a proper representation P, the part of P which represents V. For example, given a context-free grammar and an associated tree struc ture P for U, the part of P representing a substring V of U is the smallest sub-tree Q containing all leaves corresponding to V. Q is not necessarily the whole subtree of P rooted at the root of Q. Conversely, for each part Q of P, we suppose that we know how to define the fragment V of U represented by Q. a. Scope of an ambiguity of underspecification Let P be a proper representation of U. Q is a minimal underspecified parti of P if it does not contain any strictly smaller underspecified part Q'. Let P be a proper representation of U and Q be a minimal underspecified part of P. The scope of the ambiguity of underspecification exhibited by Q is the fragment V represented by Q. In the case of an anaphoric element, Q will presumably correspond to one word or term V. In the case of an indeterminacy of semantic relation (deep case), e.g. on some argument of a predicate, Q would correspond to a whole phrase V. b. Scope of an ambiguity of multiplicity A fragment V presents an ambiguity of multiplicity n (n>2) in an utter ance U if it has n different proper representations which are part of n or more proper representations of U. V is an ambiguity scope if it is minimal relative to that ambiguity. This means that any strictly smaller fragment W of U will have strictly less than n associated subrepresentations (at least two of the representations of V are be equal with respect to W).
AMBIGUITIES & AMBIGUITY LABELLING
195
In example (1) above, then, the fragment "the international telephone ser vices", together with the two skeleton representations the [international telephone] services / the international [telephone services]
is not minimal, because it and its two representations can be reduced to the subfragment "international telephone services" and its two representations (which are minimal). This leads us to consider that, in syntactic trees, the representation of a fragment is not necessarily a 'horizontally complete' subtree (diagram on the right).
Fig. 1: Caption for the figure In the case above, for example, we might have the configurations given in the figure below. In the first pair (constituent structures), "international telephone services" is represented by a complete subtree. In the second pair (dependency structures), the representing subtrees are not complete subtrees of the whole tree.
196 2.4.3
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Occurrence and kernel of an ambiguity a. Ambiguity (occurrence)
An ambiguity occurrence, or simply ambiguity, A of multiplicity n (n>2) relative to a representation system R, may be formally defined as: A (U, V, (Pl,P2...Pm), (pl,p2...pn)), where m>n and: • U is a complete utterance, called the context of the ambiguity. • V is a fragment of U, usually, but not necessarily connex, the scope of the ambiguity. • Pl,P2...Pm are all proper representations of U in R, and pl,p2...pn are the parts of them which represent V. • For any fragment W of U strictly contained in V, if ql,q2 ... qn are the parts of pl,p2 ... pn corresponding to W, there is at least one pair qi,qj (i≠j) such that qi = qj. This may be illustrated by the following diagram, where we take the rep resentations to be tree structures represented by triangles (see Figure 2). Here, P2 and P3 have the same part p2 representing V, so that m > n.
Fig. 2: Caption for the figure b. Ambiguity kernel The kernel of an ambiguity A = (U, V, (P1, P2...Pm), (p1, p2...pn)) is the scope of A and its (proper) representations: K(A) = (V, (p1, p2...pn)). In a data base, it will be enough to store only the kernels, and references to the kernels from the utterances.
AMBIGUITIES & AMBIGUITY LABELLING 2.4.4
197
Ambiguity type and ambiguity pattern a. Ambiguity type
The type of A is the way in which the pi differ, and must be defined relative to each particular R. J If the representations are complex, the difference between 2 representations is defined recursively. For example, 2 decorated trees may differ in their geometry or not. If not, at least 2 corresponding nodes must differ in their decorations. Further refinements can be made only with respect to the intended in terpretation of the representations. For example, anaphoric references and syntactic functions may be coded by the same formal kind of attribute-value pairs, but linguists usually consider them as different ambiguity types. When we define ambiguity types, the linguistic intuition should be the main factor to consider, because it is the basis for any disambiguation method. For example, syntactic dependencies may be coded geometrically in one representation system, and with features in another, but disambigu ating questions should be the same. b. Ambiguity pattern An ambiguity pattern is a schema with variables which can be instantiated to a (usually unbounded) set of ambiguity kernels. Here is an ambiguity pattern of multiplicity 2 corresponding to the example above. NP[ x l NP[ x2 x3 ] ] , NP[ NP[ x l x2] x3 ] .
We don't elaborate, as ambiguity patterns are specific to a particular rep resentation system and a particular analyser. 3
Attributes and values used in manual labelling
The proposed text processor oriented format for ambiguity labelling is a first version, resulting from several attempts by the second author to label transcriptions or spoken and multimodal dialogues. We describe this format with the help of a classical context-free gram mar, written in the font used here for our examples, and insert comments and explanations in the usual font.
198 3.1
CHRISTIAN BOITET & MUTSUKO TOMOKIYO Top level (piece)
::= | ::= ::= 'LABELLED TEXT:' ::= ::= '"' "" ::= <paragraph> [<parag_sep> <paragraph>]* <paragraph> ::= [ ]* ::= 'II?' ::= ::= 'LABELLED DIALOGUE:' ::= ::= [ ]* ::= [ ]* ::= <speaker_code> ':'
This means that the labelling begins by listing the text or the transcrip tion of the dialogue, thereby indicating segmentation problems with the mark "||?". 3.2 3.2.1
Paragraph or turn level Structure of the list and associated separators
The labelling continues with the next level of granularity, paragraphs or turns. The difference is that a turn begins with a speaker's code. ::= + ::= <parag_text> I'PARAG' <parag_text> C'/PARAG'] <parag_text> ::= [ ]*
The mark PARAG must be used if there is more than one utterance. /PARAG is optional and should be inserted to close the list of utterances, that is if the next paragraph contains only one utterance and does not begin with PARAG. This kind of convention is inspired by SGML, and it might actually be a good idea in the future to write down this grammar in the SGML format.
::= [ ]* ::= '|?' ::= + ::= I'TURN5 ['/TURN']
AMBIGUITIES & AMBIGUITY LABELLING
199
We use the same convention for TURN and /TURN as for PARAG and /PARAG.
3.2.2
::= <speaker_code> ':' <parag_text>
Representation of ambiguities of segmentation
If there is an ambiguity of segmentation in paragraphs or turns, there may be more labelled paragraphs or turns than in the source. For example, A ||? B ||? C may give rise to A-B||C and A||B-C, and not to A-B-C and A||B||C. Which combinations are possible should be determined by the person doing the labelling. The same remark applies to utterances. Take one of the examples given at the beginning of this paper: OK |? so go back and is this number three |? right there |? shall I wait here for the bus?
This is an A | ? B | ? C |? D pattern, giving rise to 10 utterance possibilities. If the labeller considers only the 4 possibilities A|B|C-D, A|B|C|D, A|B-C|D, and A-B-C|D, the following 7 utterances will be labelled: A A-B-C B B-C C C-D D
3.3 3.3.1
OK OK so so go so go right right shall
go back and back and is back and is there there shall I wait here
is this number three right there this number three this number three right there I wait here for the bus? for the bus?
Utterance level Structure of the lists and associated separators
::= I ['UTTERANCES'] + ::=
(I-text) means 'indexed text': at the end of the scope of an ambiguity, we insert a reference to the corresponding ambiguity kernel, exactly as one inserts citation marks in a text. 3.3.2
Headers of ambiguity kernels
::= *
There may be no ambiguity in the utterance, hence the use of "*" instead of ".+ " as above.
200
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
::= ' ( ' ' ) ' ::= 'ambiguity' ['-' ] ::= ' - ' [ ' ] *
For example, a kernel header may be: "ambiguity EMMI10a-2'-5.1 ". This is ambiguity kernel number 2' in dialogue EMMI 10a, noted here EMMI 10a, and 5.1 is M. Tomokiyo's hierarchical code.
3.3.3
::=
Obligatory labels
::= <scope> \{<status> \}
By { A B C }, we mean any permutation of ABC : we don't insist that the labeller follows a specific order, only that the obligatory labels come first, with the scope as very first. a. Scope <scope> b. Status <status> <status_value>
::= '(scope' ' ) ' ::= '(status' <status_value> ' ) ' ::= 'expert_system'|'interpreter'I'user'
The status expresses the kind of supplementary knowledge needed to re liably solve the considered ambiguity. If 'expert_system' is given, and if a disambiguation strategy decides to solve this ambiguity interactively, it may ask: the expert system, if any; the interpreter, if any; or the user (speaker). If I is given, it means that an expert system of the generic task at hand could not be expected to solve the ambiguity. c. Importance ::= '(importance' ' ) ' ::= 'crucial' | 'important' | 'not-important' | 'negligible'
This expresses the impact of solving the ambiguity in the context of the intended task. An ambiguity of negation scope is often crucial, because it may lead to two opposed understanding, as in "A did not push B to annoy C" (did A push B or not?). An ambiguity of attachment is often only important, as the correspond ing meanings are not so different, and users may correct a wrong decision themselves. That is the case in the famous example "John saw Mary in the park with a telescope". From Japanese into English, although the number is very often am biguous, we may also very often consider it as 'not-important'. 'Negligible'
AMBIGUITIES & AMBIGUITY LABELLING
201
ambiguities don't really put obstacles to the communication. For example, "bus" in English may be "autobus" (intra-town bus) or "autocar" (intertown bus) in French, but either translation will almost always be perfectly understandable given the situation. d. Type ::= '(type' ' ) ' : := ('structure' | 'attachment') '(' <structure>+ ' ) ' I ('communication_act' | 'CA') '(' + ' ) ' | ('class' | 'cat') '(' <morpho_syntactic_class>+ ' ) ' | 'meaning' '(' <definition>+ ' ) ' | '(' + ' ) ' | 'reference' | 'address' '(' + ' ) ' | 'situation' <situation> | 'mode' <mode> | ...
The linguists may define more types. <structure>
::= '' ::= 'yes' | 'acknowledge' | 'yn-question' | 'inform' | 'confirmation-question'
<morpho_syntactic_class> <definition>
::= ::= ::= ::=
<defined_ref_value>
::= ::= ::=
<situation> <mode>
::= ::=
ι ...
3.3.4
'N' 1 'V' | 'Adj' | 'Adv' | ... | '(' (<defined_ref_value> | )+ ' )' '*somebody' | '*something' '*speaker' | '*hearer' | '*client' | '*agent' | '*interpreter' 'infinitive' | 'indicative' | 'conjunctive' | 'imperative' | 'gerund'
Other labels
Other labels are not obligatory. Their list is to be completed in the future as more ambiguity labelling is performed.
202
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
::= [ | <multimodality>...J* ::= 'definitive' I 'long_term' | 'short_term' | 'local' <multimodality> ::= 'multimodal' (<multimodal_help> I '(' <multimodal_help>+ ' ) ' <multimodal_help> ::= 'prosody' | 'pause' | 'pointing' | 'gesture' | 'facial_expression' |...
4
Conclusions
Although many studies on ambiguities have been published, the specific goal of studying ambiguities in the perspective of interactive disambiguation in automated text and speech translation systems has led us to explore some new ground and to propose the new concept of 'ambiguity labelling'. Several dialogues from EMMI-1(ATR-ITL 1994) and EMMI-2(Park &· Loken-KIM 1994) have already labelled (in Japanese and English). Attempts have also been made on French texts and dialogues. In the near future, we hope to refine our ambiguity labelling, and to label WOZ dialogues from EMMI3(Park, Loken-KIM, Mizunashi & Fais 1995). In parallel, the specification of MIDDIM-DB, a HyperCard based support for the ambiguity data base under construction, is being reshaped to implement the new notions intro duced here: ambiguity kernels, occurrences, and types. Acknowledgements. We are very grateful to Dr. Y. Yamazaki, president of ATR-ITL, Mr. T.Morimoto, head of Department 4, and Dr. Loken-Kim K-H., for their constant support to this project, which one of the projects funded by CNRS and ATR in the context of a memorandum of understanding on scientific cooperation. Thanks should also go to M. Axtmeyer, L.Fais and H.Blanchon, who have contributed to the study of ambiguities in real texts and dialogues, and to M.Kurihara, for his programming skills.
REFERENCES ATR-ITL. 1994. "Transcriptions of English Oral Dialogues Collected by ATRITL using EMMI (from TR-IT-0029, ATR-ITL)" ed. by GETA. EMMI re port. Grenoble & Kyoto. Axtmeyer, Monique. 1994. "Analysis of Ambiguities in a Written Abstract (MIDDIM project)". Internal Report. Grenoble, France: GETA, IMAG (UJF & CNRS).
AMBIGUITIES & AMBIGUITY LABELLING
203
Black, Ezra, R. Garside & G. Leech. 1993. Statistically-Driven Grammars of English: The IBM/ Lancaster Approach ed. by J. Aarts & W. Mejs, (= Language and Computers: Studies in Practical Linguistics, 8). Amsterdam: Rodopi. Blanchon, Hervé. 1993. "Report on a stay at ATR". Project Report (MIDDIM), Grenoble & Kyoto: GETA & ATR-ITL. 1994. "Perspectives of DBMT for Monolingual Authors on the Basis of LIDIA-1, an Implemented Mockup". Proceedings of 15th International Con ference on Computational Linguistics(COLING-94)', vol.1, 115-119. Kyoto, Japan. 1994. "Pattern-Based Approach to Interactive Disambiguation: First Definition and Experimentation". Technical Report 0073. Kyoto, Japan: ATR-ITL. Boitet, Christian. 1989. "Speech Synthesis and Dialogue Based Machine Trans lation". Proceedings of ATR Symposium on Basic Research for Telephone Interpretation, 22-22. Kyoto, Japan. & H. Blanchon. 1993. "Dialogue-based MT for Monolingual Authors and the LIDIA Project". Rapport de Recherche (RR-918-I). Grenoble: IMAG. GETA, UJF & CNRS. 1993. "Practical Speech Translation Systems will Integrate Human Expert ise, Multimodal Communication, and Interactive Disambiguation". Proceed ings of the 4th Machine Translation Summit, 173-176. Kobe, Japan. 1993. "Human-Oriented Design and Human-Machine-Human Interactions in Machine Interpretation". Technical Report 0013. Kyoto: ATR-ITL. _. 1993. "Multimodal Interactive Disambiguation: First Report on the MIDDIM Project". Technical Report 0014. Kyoto: ATR-ITL. & K-H. Loken-Kim. 1993. "Human-Machine-Human Interactions in Inter preting Telecommunications". Proceedings of International Symposium on Spoken Dialogue. Tokyo, Japan. & M. Axtmeyer. 1994. "Documents Prepared for Inclusion in MIDDIMDB". Internal Report. Grenoble: GETA, IMAG (UJF & CNRS). 1994. "On the design of MIDDIM-DB, a Data Base of Ambiguities and Dis ambiguation Methods". Technical Report 0072. Kyoto & Grenoble: ATRITL & GETA-IMAG. & H. Blanchon. 1995. "Multilingual Dialogue-Based MT for monolingual authors: the LIDIA project and a first mockup". Seminor Report on Machine Translation. Grenoble.
204
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
Maruyama, Hiroshi, H. Watanabe & S. Ogino. 1990. "An Interactive Japan ese Parser for Machine Translation" ed. by H. Karlgren, Proceedings of 15th International Conference on Computational Linguistics (COLING-90), vol.II/III, 257-262. Helsinki, Finland. Tomokiyo, Mutsuko & K-H.Loken-Kim. 1994. "Ambiguity Analysis and MIDDIMDB". Technical Report 0064. Kyoto & Grenoble: ATR-ITL & GETA-IMAG. . 1994. "Ambiguity Classification and Representation". Proceedings of Nat ural Language Understanding and Models of Communication (NLC-94 work shop). Tokyo. Park Young Dok & K-H.Loken-Kim. 1994. "Text Database of the Telephone and Multimedia Multimodal Interpretation Experiment". Technical Report 0086. Kyoto: ATR-ITL. , K-H. Loken-Kim & L. Fais. 1994. "An Experiment for Telephone versus Multimedia Multimodal Interpretation: Methods and Subject's Behavior". Technical Report 0087. Kyoto: ATR-ITL. , K-H.Loken-Kim, S.Mizunashi & L.Fais. 1995. "Transcription of the Col lected Dialogue in a Telephone and Multimedia/ Multimodal WOZ Experi ment". Technical Report 0091. Kyoto: ATR-ITL. Winship, Joe. 1994. "Building MIDDIM-DB, a HyperCard data-base of ambigu ities and disambiguation methods". ERASMUS Project Report. Grenoble Brighton: GETA, IMAG (UJF CNRS) University of Sussex at Brighton.
AMBIGUITIES & AMBIGUITY LABELLING E x a m p l e of a short dialogue I. C o m p l e t e l a b e l l i n g in t e x t p r o c e s sor o r i e n t e d f o r m a t The numbers in square brackets are not part of the labelling format and are only given for convenience.
205
[15] A:and y o u ' l l t a k e t h e subway n o r t h t o Sanjo s t a t i o n [16]AA:0K [17] A : / I s / a t Sanjo s t a t i o n y o u ' l l g e t off and change t r a i n s t o t h i Keihan Kyotsu l i n e [18]AA: [hmm] [19] A:OK I.2 Turns
I.1 Text of the dialogue LABELLED DIALOGUE:" EMMI 10a"
LABELLED TURNS OF DIALOGUE "EMMI 10a"
[1] A:Good morning conference office how can I help you TURN [2] AA:[ah] yes good morning [1] AA:Good morning, c o n f e r e n c e o f f i c e , could you tell me please | ? How can I h e l p you? how to get from Kyoto UTTERANCES station to your conference center AA:Good morning, c o n f e r e n c e [3] A : / I s / [ah] yes (can you t e l l office(l) me) [ah](you) y o u ' r e going t o t h e conference c e n t e r (ambiguity EMMI10a-l-2.2.8.3 today ((scope ''conference o f f i c e ' ' ) [4] AA:yes I am t o a t t e n d t h i [uh] (status expert_system) Second I n t e r n a t i o n a l ( a d d r e s s (*speaker * h e a r e r ) ) Symposium { o n } I n t e r p r e t i n g (importance not-important) Telecommunications (multimodal facial-expression) [5] A : { [ o ? ] } OK n ' where a r e you (desambiguation_scope d e f i n i t i v e ) ) ) c a l l i n g from r i g h t now [6] A A : c a l l i n g from Kyoto s t a t i o n AA:How can I h e l p you? [7] A : / I s / OK, y o u ' r e a t Kyoto /TURN is not necessary here because an s t a t i o n r i g h t now other TURN appears. [8] AA:{yes} [9] A : { / b r e a t h / } and t o g e t t o t h e TURN I n t e r n a t i o n a l Conference Center you can e i t h e r t r a v e l [2] AA:[ah] y e s , good morning. | by t a x i bus or subway how Could you t e l l me p l e a s e would you l i k e t o go how t o g e t from Kyoto [10]AA:I t h i n k subway sounds l i k e s t a t i o n t o your t h e b e s t way t o me conference center? [11] A:OK [ah] you wanna go by The labeller distinguishes here a sure seg subway and y o u ' r e a t t h e mentation into 2 utterances. s t a t i o n r i g h t now [12]AA:yes UTTERANCES [13] A:OK so [ah] y o u ' l l want t o g e t A A : [ a h ] y e s ( 2 ) , good morning. back on t h i subway going n o r t h [14]AA:[hmm]
206
CHRISTIAN BOITET & MUTSUKO TOMOKIYO
(ambiguity EMMI10a-2-5.1 ((scope "yes") (status user) (type CA (yes acknowledge)) (importance crucial) (multimodal prosody))) AA:Could you tell me please how to get from Kyoto station to your conference center(3)? (ambiguity EMMI10a-3-2.2.2 ((scope "your conference center") (status user) (type structure («your conferenceXcenter» «yourXconference center»)) (importance negligible) (multimodal prosody)))
/TURN
(type
Japanese
(importance
important)))
[6] AA:calling from Kyoto station [7] A A : / I s / OK, you're at Kyoto station(8) right now. (ambiguity EMMI10a-8-5.1 ((scope "you're at Kyoto station") (status expert_system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody))) [8] AA :
{yes}
TURN [9] A:{/breath/} and to get to the International Conference Center you can either travel by taxi bus or subway. | how would you like to go
TURN is not necessary if there is only one utterance with no ambiguity of segmenta U T T E R A N C E tion. A:{/breath/} and to get to the [3] A:/Is/[ah] yes (can you tell me) [ah] (you) you're going to the conference center today(4) (ambiguity EMMI10a-4-5.2 ((scope "today") (status expert_system) (situation "the day they are speaking") (importance negligible) (multimodal "built-in calendar on screen"))) [4] AA:yes I am to(5) attend thi [uh] Second International Symposium {on} Interpreting Telecommunications (ambiguity EMMIlOa-5-3.1.2 ((scope "am to") (status user)
International Conference Center you can(9) either travel(9', 9") by taxi bus or subway(10). (ambiguity EMMIiOa-9-2.1 ((scope "can") (status expert_system) (type class(verb modal_verb)) (importance crucial))) (ambiguity EMMI10a-9'-2.1 ((scope "the International Conference Center you can either travel") (status expert_system) (type structure (
AMBIGUITIES & AMBIGUITY LABELLING
207
subway and you're at the s t a t i o n right now") (status expert-system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody)))
>>>) (importance crucial) (multimodal prosody))) (ambiguity EMMI10a-9"-2.1 ((scope "travel") (status expert_system) (mode (infinitive imperative)) (importance crucial)))
[12]AA:yes [13] A:OK so [ah] you'll want to(13) get back on thi subway going north(14)
(ambiguity EMMI10a-10-2.2.2 ((scope "taxi bus or subway") (status expert_system) (type structure ( )) (importance important) (mult imodal prosody)))
(ambiguity EMMIlOa-13-3.1.2 ((scope "want to") (status interpreter) (type Japanese (type French ("vouloir" "devoir")) (importance important)))
A:How would you like to go /TURN
(ambiguity EMMI10a-14-2.2.2 This example is of the same kind as the very ( ( S C O p e " g e t back on t h i subway famous one:" Time flies like an arrow" !" Linguist's going n o r t h " ) examples" are often derided, but they really (status user) appear in texts and dialogues. However, as (type s t r u c t u r e (>" to hold must be at least 2. 4.2
Building on orders of constituents
The Preference Table ?? presents some of the main PREFERENCES for gener ating orders of Polish constituents depending on specific CONDITIONS. Each line of the table can be treated as an independent if-then rule co-specifying (certain aspects of) an order. Different rules can be applied independ ently thus possibly better determining a given order4. The JUSTIFICATION column provides some explanation of the validity of each rule. It might be the case that as a result of applying the Preference table, we obtain too many orders. The Discrimination Table 4 provides some ra tionale for excluding those matching ORDERS for which one of their DISCRI MINATION conditions fails. If the building stage left us with no possible or ders at all, we could allow any order and pick only those which successfully pass all their discrimination tests. It is purposeful that all orders apart from the canonical SVO have some discrimination conditions attached to them. The rarer the order tends to be the more strict the condition. Therefore, SVO is expected to prevail. Both the Preference table and the Discrim ination table are mostly based on statistical data described in (Siewierska 1987; 1993a,b). There remains a number of cases which escape simple characterisation in terms of 'preferred and not-discriminated'. The Preprocessing Table 5 4
Orders derived by co-operation of several rules could be preferred in some way.
220
MALGORZATA STYS & STEFAN ZEMKE
Pref.
CONDITIONS
i ii iii iiib
Orderings implied by center information center (Any) < 0 -Any Final position of new center(Cl) >> center(A2) -Anyl-Any2Given-new principle center(X) > 1 XAdjunct topic fronted discrete_center(Prim) (X-)(V-)Prim- Primary center fronted
iv ν vi vii viii ix X
xi xii xiii
PREFERENCE
JUSTIFICATION
Statistical positioning preferences Statistical XV-S-O-V-S-O- & -XStatistical XV-O-S-0-S- & XStatistical XV-O-S-V-O-S- & -XStatistical XS-V-O-S-V-O- & -XS-V-OX Statistical -S-V-O- & -XStatistical O-V-SX -O-V-S- & -XO-VXS Statistical -O-V-S- & -XPron(S) (& center_shift(Un)) Stylistic -vsGeneral Statistical -v-oStatistical preferences -s-o-
(66%) (53%) (32%) (30%) (29%) (26%)
(89%+) (81%)
Table 3: Center values for example clauses offers some solutions under such circumstances. It is to be checked for its conditions before any of the previous tables are involved. If a condition holds, its result (e.g., 0-anaphora) should be noted and only then the other tables applied to co-specify features of the translation as described above. The Preprocessing table can yield erroneous results when applied repeatedly for the same clause. Therefore, unlike the other tables, it should be used only once per utterance. In Table we continue the example from Table 3. The orderings built on by a cooperation of the Preprocessing/Preference and not refused by the Discrimination table appear in the last column. 5
Conclusion
One of the aims of this research was to exploit the notion of center in Polish and put it forward in context of machine translation. The fact that centers are conceptualised and coded differently in Polish and English has clear repercussions in the process of translation. Through exploring the pragmatic, semantic and syntactic conditions underlying the organisation
DISCOURSE ASPECTS IN ENGLISH - POLISH MT Discr,. i ii iii iv V
vi vii viii ix X
ORDER
DISCRIMINATION
221
JUSTIFICATION
-V-S-O- length(S) < length(O) -V-S-O- -V-S-0 -V-S-O- Pron(S) -V-O-S- length(O) < length(S) -V-O-S- -X- present -S-O-V- SOV -S-O-V- center(S,[Un+1) > 0 -O-S-V-O-S-V- length(O) > length(S) -O-V-S length(O) > length(S)
osvx
Statistical Statistical Stylistic Statistical Statistical Statistical Statistical Statistical Statistical Statistical
(99%) (87%) (96%) (89%) (50%+) (79%) (100%) (64%)
Table 4: Discrimination table RESULT
JUSTIFICATION
Pre.
CONDITIONS
i ii iii iv
0-anaphora S='we' S=[ ] pron(O) & pron(S) S=[ ] Sub(Un) = S u b ( U n ) (& pron(S)) S=[ ] center_continuing(Un) S=[ ]
Rhythmic Stylistic Stylistic Stylistic
v vi
Special constructions -'only' SV- & pron(S) -'tylko' SVX=[ ] & pron(O) SOV
Focus binding expr. Special: S,0,V only
Table 5: Preprocessing table of utterances in both languages, we have been able to devise a set of rules for communicatively motivated ordering of Polish constituents. Among the main factors determining this positioning are pronominalisation, lexical reiteration, definiteness, grammatical function and special centered constructions in the source language. Their degree of topicality is coded by the derived center values. Those along with additional factors, such as the length of the originating Polish constituents and the presence of adjuncts, are used to determine justifiable constituent order in the resulting Polish clauses.
222
MALGORZATA STYS & STEFAN ZEMKE
1
2
PREFERENCE
PARTIAL
DISCRIMINATION
CRITERIA
ORDERS
(FAILING)
Pref.xii Pref.xiii
VSO
SVO (Discr.iii)
No rules apply, order unchanged
3
Pref.iiib (Pref.xii)
OVS VOS OSV
4
Pre.iii Pref.xi
S=[] -VS-
5
Pref.iiib (Pref.xii)
SVO VSO
Discr.x (Discr.v) (Discr.viii)
RESULTING ORDER(S)
SVO
SVX OVS
V[S]X
SVO (Discr.i)
Table 6: Example continued: Deriving constituent orders In further research, we wish to extend the scope of translated constructions to di-transitives and passives. We shall also give due attention to relative clauses. Centering in English can be further refined by allowing verbal and adjectival centers as well as by determining anti-center constructs. We have thus tackled the question of information distribution in terms of communicative functions and examined its influence on the syntactic structure of the source and target utterances. How and why intersentential relations are to be transmitted across the two languages remains an intricate question, but we believe to have partially contributed to the solution of this problem. REFERENCES Brennan, Susan E., Marilyn W. Friedman & Carl J. Pollard. 1987. "A Centering Approach to Pronouns". Proceedings of the Annual Conference of the Asso ciation for Computational Linguistics (ACL'87), 155-162. Stanford, Calif. Firbas, Jan. 1992. Functional Sentence Perspective in Written and Spoken Com munication. Cambridge: Cambridge University Press.
DISCOURSE ASPECTS IN ENGLISH - POLISH MT
223
Grosz, Barbara J. 1986. "The Representation and Use of Focus in a System for Understanding Dialogs". Readings in Natural Language Processing ed. by Grosz, Barbara, K. Jones & B. Webber, 353-362. Los Altos, Calif.: Morgan Kaufmann Publishers. , Aravind K. Joshi & Scott Weinstein. 1995. "Centering: A Framework for Modelling the Local Coherence of Discourse". Computational Linguistics 21:2.203-225. Gundel, Jeanette K. 1993. "Centering and the Givenness Hierarchy: A Pro posed Synthesis". Workshop on Centering Theory in Naturally Occurring Discourses. Philadelphia: University of Pennsylvania. Kameyama, Megumi. 1986. "A Property Sharing Constraint in Centering". Pro ceedings of the 24th Annual Conference of the Association for Computational Linguistics (ACL'86), 200-206. Columbia, N.Y. Mitkov, Ruslan. 1994. "A New Approach for Tracking Center". Proceedings of the International Conference "New Methods in Language Processing"', 150154. Manchester: UMIST. Siewierska, Anna. 1987. "Postverbal Subject Pronouns in Polish in the Light of Topic Continuity and the Topic/Focus Distinction". Getting One's Words into Line ed. by J. Nuyts and G. de Schutter, 147-161. Dordrecht: Foris. 1993a. "Subject and Object Order in Written Polish: Some Statistical Data". Folia Linguistica 27:1/2.147-169. 1993b. "Syntactic Weight vs. Information Structure and Word Order Variation in Polish". Journal of Linguistics 29:233-265. Szwedek, Aleksander J. 1976. Word Order, Sentence Stress and Reference in English and Polish. Edmonton: Linguistic Research, Inc. Walker, Marilyn Α., Masayo Ida & S. Cote. 1994. "Japanese Discourse and the Process of Centering". Computational Linguistics 20:2.193-227.
Two Engines Are Better Than One: Generating More Power and Confidence in the Search for the Antecedent RUSLAN M I T K O V
University of Wolverhampton Abstract The paper presents a new combined strategy for anaphor resolution based on the interactivity of two engines which, separately, have been successful in anaphor resolution. The first engine incorporates the constraints and preferences of an integrated approach for anaphor resolution reported in (Mitkov 1994), while the second engine follows the principles of the uncertainty reasoning approach described in (Mitkov 1995). The combination of a traditional and an alternative approach aims at providing maximal efficiency in tackling the tough problem of anaphor resolution. Preliminary results already show improved performance when both approaches are united into a more powerful and confident searcher for the antecedent. 1
Introduction
Approaches to anaphor resolution have so far been mostly linguistic (Carbonel & Brown 1988; Hayes 1981; Hobbs 1978; Ingria & Stallard 1989; Lapin & McCord 1990; Nasukawa 1994; Pérez 1994; Preuß et al. 1994; Rich & LuperFoy 1988; Rolbert 1989) with the exception of a few pro jects where statistical (Dagan & Itai 1990) or machine learning (Cononoly, Burger & Day 1994) methods have been developed. Given the complexity of the problem and its central importance in Natural Language Processing, it would be wise to consider a combination of various approaches to comple ment the traditional methods and increase chances of success by combining the advantages of each method used. We have already reported on an integrated approach for anaphor resolu tion based on linguistic constraints and preferences and a statistical method for center tracking (Mitkov 1994). As an alternative, we have successfully developed an uncertainty reasoning approach (Mitkov 1995). To improve performance, we have recently developed a combined strategy based on two engines: the first engine searches for the antecedent using the integrated ap proach, whereas the second engine performs uncertainty reasoning to rate
226
RUSLAN MITKOV
the candidates for antecedents. The preliminary tests show encouraging results. 2
A n i n t e g r a t e d a n a p h o r resolution approach
Our anaphor resolution model described in (Mitkov 1994) incorporates mod ules containing different types of knowledge — syntactic, semantic, domain, discourse and heuristic (Figure 1).
Fig. 1: An integrated anaphor resolution architecture The syntactic module, for example, knows that the anaphor and antecedent must agree in number, gender and person. It checks if the c-command constraints hold and establishes disjoint reference. In cases of syntactic parallelism, it prefers the noun phrase with the same syntactic role as the anaphor as the most probable antecedent. It knows when cataphora is
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
227
possible and can indicate syntactically topicalised noun phrases, which are more likely to be antecedents than non-topicalised ones. The semantic module checks for semantic consistency between the anaphor and the possible antecedent. It filters out semantically incompatible candidates following verb semantics or animacy of the candidate. In cases of semantic parallelism, it prefers the noun phrase, which has the same semantic role as the anaphor, as the most likely antecedent. Finally, it generates a set of possible antecedents whenever necessary. The syntactic and semantic modules have been enhanced by a discourse module which plays a very important role because it keeps a track of the centers of each discourse segment (it is the center which is, in most cases, the most probable candidate for an antecedent). Based on empirical studies from the sublanguage of computer science, we have developed a statistical approach to determine the probability of a noun (verb) phrase to be the center of a sentence. Unlike the known approaches so far, our method is able to propose the center with high probability in every discourse sentence, including the first one. The approach uses an inference engine based on Bayes' formula which draws an inference in the light of some new piece of evidence. This formula calculates the new probability, given the old probability plus some new piece of evidence (Mitkov 1994). The domain knowledge module is practically a knowledge base of the concepts of the domain considered and the discourse knowledge module knows how to track the center of the current discourse segment. The heuristic knowledge module can sometimes be helpful in locating the antecedent. It has a set of useful rules (e.g., the antecedent is preferably to be located in the current sentence or in the previous one) and can forestall certain impractical search procedures. The referential expression filter plays an important role in filtering out the impersonal 'it'-expression (e.g., "it is important", "it is necessary", "it should be pointed out" etc.), where 'it' is not anaphoric. The syntactic and semantic modules usually filter the possible candid ates and do not propose an antecedent (with the exception of syntactic and semantic parallelism). Generally, the proposal for an antecedent comes from the domain, heuristic, and discourse modules. 3
A n uncertainty reasoning approach
We have developed a new uncertainty reasoning approach for anaphor res olution (Mitkov 1995). The strategy for determining the antecedent of a
228
RUSLAN MITKOV
pronoun uses AI uncertainty reasoning techniques. Uncertainty reasoning was selected as an alternative because: 1. in Natural Language Understanding, the program is likely to estimate the antecedent of an anaphor on the basis of incomplete information: even if information about constraints and preferences is available, it is natural to assume that a Natural Language Understanding program is not able to understand the input completely; 2. the necessary initial constraint and preference scores are determined by human beings; therefore the scores are originally subjective and should be regarded as uncertain facts. The uncertainty reasoning approach makes use of various 'anaphor resol ution symptoms' which have already been studied in detail. Apart from the widely used syntactic and semantic constraints and preferences such as agreement, c-command constraints, parallellity, topicalisation, verb-case role, the approach makes use of further symptoms based on empirical evid ence like subject preference, domain concept preference, object preference, section head preference, reiteration preference, definiteness preference, main clause preference etc. The availability/non-availability of a certain symptom will correspond to an appropriate score or certainty factor (CF) which is attached to it. For instance, the availability of a certain symptom s assigns CFs a w (0 < CFsav < 1), whereas the non-availability corresponds to CFsnon-av ( - 1 < CFs -av CFthreshoid for affirmation, or CFhyp < CFmin for rejection of the hypothesis. The evaluation process is clearly divided into two steps: 1. proposal of a hypothesis on the basis of preliminary (usually 3-5) tests on the most 'significant' symptoms; and 2. hypothesis verification.
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
229
We use a hypothesis verification formula for recalculation of the hypothesis on the basis of availability (in our case also of non-availability) of certain symptoms. The present version of the formula is a modified version of the formula in (Pavlov, Mitkov & Filev 1989) which we have already successfully used in adaptive testing. 4
T h e two-engine s t r a t e g y
Two engines are better than one: on the basis of the above two developed and tested approaches, we have studied and proposed a combined strategy which incorporates the advantages of each of these approaches, generating more power and confidence in the search for the antecedent. The two-engine strategy evaluates each candidate for anaphor from the point of view of both the integrated approach and the uncertainty reason ing approach. If opinions coincide, the evaluating process is stopped earlier than would be the case if only one engine were acting. This also makes the searching process shorter: our preliminary tests show that the integrated approach engine needs about 90% of the search it would make when oper ating on its own; similarly, the uncertainty reasoning engine does only 67% of the search it would do when operating as a separate system. In addition, the results of using both approaches are more accurate (see table below). This combined strategy enables the system to consider all the symptoms in a consistent way; it does not regard any symptom as absolute or uncon ditional. This 'behaviour' is very suitable for symptoms like 'gender' (which could be regarded as absolute in languages like English but 'conditional' in languages like German) or 'number' 1 . The rationale for selecting a two-engine approach is the following: 1. two independent estimations, if confirmed, bring more confidence in proposing the antecedent; 2. the use of two approaches could be usefully complementary: e.g., the conditionality of gender is better captured by uncertainty reasoning; 3. in sentences with more than one pronoun, center tracking alone (and therefore the integrated approach) is not very helpful for determining the corresponding antecedents; 4. though the uncertainty reasoning approach may be considered more stable in such situations, it is comparatively slow but could adopt a 1
In German for instance, "Mädchen" (girl) is neuter, but one can refer to "Mädchen" by a female pronoun (sie). In other languages, singular pronouns (e.g., some singular pronouns denoting a collective notion) may be referred to by a plural pronoun.
230
RUSLAN MITKOV
lower CF, if intermediate results obtained by both engines are reported to be close (lower CF will result in making its process faster); and 5. the two-engine strategy does not depend exclusively on the notion of 'center' which is often considered as 'intuitive' in nature. We see that in certain situations one of the engines could be expected to operate more successfully than the other and complement it but it is also the parallel confirmation of the results obtained that generates more confidence in the search for the antecedent. We have implemented the integrated model as a program which runs on Macintosh computers and the following table shows its success rate. Four text excerpts served as inputs, each text taken from a computer science book. Excerpts ranging from 500 to 1000 words, estimated to contain a comparatively high number of pronouns were selected (it was not always easy to find paragraphs abundant in pronominal anaphors). These docu ments were different from the corpus initially used for the development of various 'symptom rules' and were hand-annotated (syntactic and semantic roles). Other versions of these excerpts, which contained anaphoric refer ences marked by a human expert, were used as an evaluation corpus. We tested on these inputs the three programs (i) the integrated ap proach, (ii) the uncertainty reasoning approach, and (iii) the two-engine approach. The results (Table 1) show an improvement in resolving anaphors when the integrated approach and the uncertainty reasoning approach are combined into a two-engine strategy. Table 1: Anaphor resolution success rate using different approaches
Text Text Text Text
5
1 2 3 4
INTEGRATED
UNCERTAINTY
TWO-ENGINE
APPROACH
REASONING
STRATEGY
89.1 92.6 91.7 88.6
% % % %
87.3 93.6 90.4 89.2
% % % %
91.7 95.1 93.8 93.7
% % % %
Illustration
As an illustration of how the new approach works, consider the following sample text:
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
231
SYSTEM PROGRAMS
System programs, such as the supervisor and the language trans lator should not have to be translated every time theyi are used, otherwise this would result in a serious increase in the time spent in processing user's programs. System programsi are usually written in the assembly version of machine languages and are translated once into the machine code itself. From then on theyi can be loaded into memory in machine code without the need for any intermediate translation phase. Step 1. Integrated approach engine: A N A P H O R = they CANDIDATES: {system programs, machine languages, assembly version, machine code, user's programs} A G R E E M E N T CONSTRAINTS: {system programs, machine languages, user's programs} SEMANTIC CONSTRAINTS: {system programs, machine languages, user's programs} (no discrimination) C E N T E R TRACKING: {system programs} (proposed with higher probabil ity) Step 2. Uncertainty reasoning approach engine: A N A P H O R = they CANDIDATES: {system programs, machine languages, assembly version, machine code, user's programs} Candidate 1: system programs symptom 1: number — CFnumber = 0.3; symptom 2: person - CFperson = 0.3; CFhyp = 0.3 + 0.3 - 0.3 * 0.3 = 0.51 symptom 3: gender — CFgenader = 0.3; CFhyp = 0.657 symptom 4: verb case role — Fverbcaseroles = 0.3; CFhyp = 0.7029 symptom 5: syntactic parallelism - CFsyntactic parallelism = -0.2; CFhyp = (0.7029 - 0.2)/[l - min(|0.7029|, | - 0.2|)] = 0.6286 symptom 6: semantic parallelism - CFsemantic parallelism = 0.5; CFhyp = 0.8143 symptom 7: topicalisation - Ftocalisation = 0.0; CFhyp = 0.8143 symptom 8: subject - CFsubject - 0.25; CFhyp — 0.8607 symptom 9: repetition — CFrepetition — 0.6; CFhyp = 0.8906 symptom 10: head - CFhead ~ 0.35; CFhyp = 0.8923 symptom 11: previous — CFprevious — 0.15; CFhyp = 0.9085
232
RUSLAN MITKOV
Candidate 1 accepted Candidate 2: machine languages symptom 1: number — CFnumber = 0.3; symptom 2: person - CFperson — 0.3; CFhyp — 0.3 + 0.3 - 0.3 * 0.3 = 0.51 symptom 3: gender — CFgender — —0.6; CFhyp = (0.51 - 0.6)/[l - mm(|0.51|, | - 0.6|)] = -0.1836 symptom 4: = verb case role — CFverbcaseroles = 0.1; CFhyp = (-0.1836 + 0.1)/[1-min|0.1|,|-0.1836|) = -0.0929 symptom 5: syntactic parallelism - CFsyntacticparaellisrn = -0.2; CFhyp - -0.1672 symptom 6: semantic parallelism - Fsemanticparalllelsm = -0.2; CFhyp = -0.2938 Candidate 2 r e j e c t e d Due to results similar to those for candidate 2, candidates 3, 4 and 5 are also rejected. Step 3. Results in 1 and 2 confirm the selection of "system programs" as the antecedent. Using the uncertainty reasoning approach within the two-engine ap proach would mean that the CF could be put lower: a CF of 0.89 would be in this case satisfactory, shortening the procedure by two steps. On the other hand, the uncertainty approach confirms the proposal of the integ rated approach.
6
Conclusion
We have presented a two-engine strategy for pronoun resolution which com bines the engines and advantages of an integrated architecture for anaphor resolution (Mitkov 1994) and of an uncertainty-based anaphor resolution model (Mitkov 1995). Preliminary evaluations show improvement of per formance and though further investigations and comparisons have to be carried out, we believe that the first results can be regarded as promising.
TWO-ENGINE APPROACH TO ANAPHOR RESOLUTION
233
REFERENCES Barros, Flávia A. & Anne Deroeck. 1994. "Resolving anaphora in a portable natural language front end to databases". Proceedings of the 4th Conference on Applied Natural Language Processing (ANLP'94), 119-124. Stuttgart, Germany. Carbonell, James G. & Ralf D. Brown. 1988. "Anaphora Resolution: A MultiStrategy Approach". Proceedings of the 12th International Conference on Computational Linguistics (COLING-88), vol.1, 96-101. Budapest, Hungary. Connoly, Dennis, John D. Burger & David S. Day. 1994. "A Machine Learning Approach to Anaphoric Reference". Proceedings of the International Confer ence "New Methods in Language Processing", 255-261. Manchester: UMIST. Dagan, Ido & Alon Itai. 1990. "Automatic Processing of Large Corpora for the Resolution of Anaphora References". Proceedings of the 13th Interna tional Conference on Computational Linguistics (COLING-90), vol.III, 1-3. Helsinki, Finland. Hayes, Philip J. 1981. "Anaphora for Limited Domain Systems". Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI'81), 416-422. Vancouver, Canada. Hirst, Graeme. 1981. Anaphora in Natural Language Understanding. Springer Verlag.
Berlin:
Hobbs, Jerry R. 1978. "Resolving Pronoun References". Lingua 44.339-352. Ingria, Robert J.P. & David Stallard. 1989. "A computational mechanism for pronominal reference". Proceedings of the 27th Annual Meeting of the ACL, 262-271. Vancouver, British Columbia. Lappin, Shalom & Michael McCord. 1990. "Anaphora Resolution in Slot Gram mar". Computational Linguistics 16:4.197-212. Mitkov, Ruslan. 1994a. "An integrated model for anaphora resolution". Pro ceedings of the 15th International Conference on Computational Linguistics (COLING-94), 1170-1176. Kyoto, Japan. 1994b. "A New Approach for Tracking Center". Proceedings of the International Conference "New Methods in Language Processing", 50-54. Manchester: UMIST. 1995. "An Uncertainty Reasoning Approach for Anaphora Resolution". Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS'95), 149-154. Seoul, Korea. Nasukawa, Tetsuya. 1994. "Robust Method of Pronoun Resolution Using FullText Information". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94)·, 1157-1163. Kyoto, Japan.
234
RUSLAN MITKOV
Pavlov, Radoslav, Ruslan Mitkov & Philip Filev. 1989. "An Adaptive Uncer tainty Reasoning-Based Model for Computerised Testing". Proceedings of the 3rd International Conference "Children in the information age", 92-98. Sofia, Bulgaria. Rico Pérez, Celia. 1994. Statistical-algebraic approximation to discourse ana phora. Ph.D. dissertation, Department of English Philology, University of Alicante. Alicante, Spain. [In Spanish.] Preuß, Susanne, Birte Schmitz, Christa Hauenschild & Carla Umbach. 1994. "Anaphora Resolution in Machine Translation". Studies in Machine Trans lation and Natural Language Processing, vol.VI (Text and content in Machine Translation: Aspects of discourse representation and discourse processing) ed. by Wiebke Ramm, 29-52. Luxembourg: Office for Officiai Publications of the European Community. Rich, Elaine & Susann LuperFoy. 1988. "An Architecture for Anaphora Res olution". Proceedings of the 2nd Conference on Applied Natural Language Processing, 18-24. Austin, Texas. Rolbert, Monique. 1989. Résolution de formes pronominales dans Vinterface d'interogation d'une base de données [Resolution of Pronouns in Natural Lan guage Front Ends]. Ph.D. dissertation, Faculty of Science, Luminy, France. Sidner, Candace L. 1986. "Focusing in the Comprehension of Definite Anaphora". Readings in Natural Language Processing ed. by Barbara J. Grosz et al., 363394. Los Altos, Calif.: Morgan Kaufmann.
Effects of Grammatical Annotation on a Topic Identification Task TADASHI N O M O T O
Advanced Research Laboratory, Hitachi Ltd. Abstract The paper describes a new method for discovering topical words in discourse. It shows that text categorisation techniques can be turned into an effective tool for dealing with the topic discovery problem. Experiments were done on a large Japanese newspaper corpus. It was found that training the model on annotated corpora does lead to an improvement on the topic recognition task. 1
Introduction
The problem of identifying a topic or subject matter of discourse has long attracted attention from diverse research paradigms. In computational lin guistics, the problem more or less takes the form of resolving the anaphora (Hobbs 1978; Grosz & Sidner 1986; Lappin & Leass 1994) or locating the discourse center (Joshi & Weinstein 1981; Walker et al. 1994). In inform ation retrieval (IR), the problem came to be known as text categorisation (TC), which concerns classifying documents under a set of pre-defined cat egories (Lewis 1992; Finch 1994) While incorporating some of the insights from computational linguistics, the present work extends the text categorisation paradigm to solve the topic identification issue. The paper recasts the topic identification as a task of finding in a text representative words under which that text is most likely to classify. Thus to identify a topic in the text requires us working with an unbounded set of categories rather than with a bounded, possibly small number of pre-defined categories. Although the idea of using a complex representation of text has yet to prove its value in text classification (Lewis 1992), recent years have seen some progress in the area of the corpus-based NLP toward exploiting lin guistic representation more sophisticated than the simple word form (Hindle 1990). As our part of contribution to research in this direction, we are going to report some promising results from experiments with Japanese corpora. In particular, we will show that a representation based on postpositional
236
TADASHI NOMOTO
phrases (PP), i.e., phrases composed of a noun and following case particle, is more effective for the topic identification task than a simple word-based representation. (Futsu) (gin) -ga (Kiev)-ni (chuuzai) (in) (jimusho). French bank SBJ Kiev at resident staff office
[(Futsu) (gin) (Oote)
]-no
[(Société) (Général)] -wa
French bank big-name which is Société
General
(U*) U-
(Kiev)] Kiev
(kura*) kra-
(ina*)] ina
-no whose
[(shuto) capital
[(15-nichi**),
as for on 15th -ni at
[(chuzai) resident
(in) staff
(jimusho)]-wo [(kaisetsu)]-ø suruto [(happyo)]-ø shita. Sude-ni [(Kiev) (shi) office OBJ open plan disclose did Already Kiev city (tookyoku)]-no
authority
[(kyoka)
]-mo
eta
to-iwu.
whose permission as well obtained sources say
M A J O R F R E N C H BANK OPENS OFFICE IN K I E V
Société Général, a major French bank, disclosed on the 15th a plan to open a resident office in Kiev, capital of Ukraine. The bank has already obtained a permission from the city authority, sources say. Fig. 1: Annotating a news story
2
Topic recognition model
This section describes an approach to the topic identification problem. What we are going to do is formulate the problem as a text categorisation task, i.e., one of classifying documents with respect to a set of pre-defined categories. The formulation is fairly straightforward; we define the problem of finding a topic in text as that of categorising a text with respect to a set of nouns derived from that text. A most probable topic is, then, one with which the text in question is most likely to classify. The job of text categorisation is to estimate C(c | d), the likelihood that a document d is assigned to a category Let us call a word with which to classify the text a potential topic of the text. Given a set W(d) of words comprising a text d and a set S(d) of potential topics for d, the job of topic identification is to find an estimate C(c | d), for € 5(d), where S(d) W(d).
EFFECTS OF GRAMMATICAL ANNOTATION Now let us consider a likelihood function, (c | d) =
237
defined by:
P(c | t)P{t | d)
which is meant to be a relativisation of the relationship between and d to some index t (Fuhr 1989); the index t could be a simple word or linguistically more sophisticated representation. The set of such indices is said to represent a text. Assume that every index t will be assigned to some category. Then by Bayes' theorem, we have an equation1:
Given a set R(d) of indices for a text d, we define the likelihood function C(c\d) by: forcЄ S(d),
We refer to the formula above as 'TRM' hereafter. 'T = w ' is meant to denote an event that a randomly selected word from a document coincides with w. P(c) represents the probability that a randomly selected document is assigned to category c; P(T = w \ c) is the probability that a word randomly selected from a document coincides with w, given that category is assigned to that document; P(T = w) denotes the probability that a word w is selected from a randomly chosen document; P(T = w \ d) is the probability that word w is randomly selected from document d. We estimate the component probabilities by: P(c) P(T
= W\C)
=
Dc/D
= FCW/FC*
P(T = w\d) = Fwd/F*d P(T = w) = FwdF*D 1
There are some choices we can make as to the nature of t. In the binary term index ing (Lewis 1992), a document is represented as a binary vector {0, l } n , which records presence/absence of terms for the document and t ranges over a set of possible doc uments. On the other hand, in the weighted term indexing (Iwayama &; Tokunaga 1994), a document is represented as a set of term frequencies and t ranges over a set of possible terms {wi, ...,wn}. A most important difference between the two indexing policies is that the former is concerned with document frequency, i.e., the number of documents in which a term occurs, while the latter is concerned with term frequency, the frequency of a term within a document. We decided to go along with the weighted term indexing policy, because the binary policy, as it stands, is known to fail where training data is not sufficiently available (Iwayama & Tokunaga 1994).
238
TADASHI NOMOTO
where D is the number of texts found in the training corpus, Dc is the number of texts whose title contains a term c, Fc is the number of indices in DC' Fwc is the frequency of w in Dc, Fd is the count of token indices in d, Fwd is the count of w in D, and FD is the total number of token indices in D. TRM is based on the simple assumption that if an index is more typical or characteristic of a text, it is more likely to associate with a topic of that text. For instance, turmeric is a popular coloring spice used in many of the recipes for Indian food. Thus the word 'turmeric' appears very often in an Indian cookbook and therefore, does not serve to indicate a particular dish. (For that matter, peas or beans may better indicate what a partic ular recipe is for.) How much typical an index is of a particular text, is determined statistically by measuring the degree of its maldistribution or skewness (Umino 1988). TRM uses the following measure for evaluating the typicalness of an index (call it Icd(w)): Idc d
_ P(T = w\c)P(T = (w) P(T = w)
w\d)
Let x and be an index for a text d. Suppose that Fxe = Fyc, and FxD = FyD. If Fxd > Fyd then χ has a more skewed distribution and contributes more to the value than /, i.e., Icd(x) > Icd{) The same result follows if Fxc = Fyc, Fxd= Fyd,and Fxd < FyD. In either case, χ is said to be more typical of d than y. 3
Text r e p r e s e n t a t i o n
A major concern of the paper is with finding out whether annotating corpora with some grammatical information affects the model's performance on the topic recognition task. In text categorisation, a text is represented in terms of an indexing language, a set of indices constructed from the vocabulary that makes up that text. We make use of two languages for indexing a text: one is formed from nouns that occur in the text and another is formed from nouns tagged with a postposition of a phrase in which they occur. For a text di let R+(d) be a indexing language with taggings and R~(d) be one without. Annotating a text goes through two processes, the tokenising of a text into a array of words and tagging words in a postpositional phrase with its postposition, or case particle. We start by dividing a text, which is nothing but a stream of characters, into words. The procedure is carried out with
EFFECTS OF GRAMMATICAL ANNOTATION
239
the use of a program called JUMAN, a popular public-domain software for the morphological analysis of Japanese (Matsumoto et al. 1993). Since there was no Japanese parser robust enough to deal with free texts such as one used here, postpositional phrases were identified by using a very simple strategy of breaking an array of word tokens into groups at punctuation marks ('.,') as well as at case particles. After examining the results with 10 to 20 texts, we decided that the strategy was good enough. Each token in a group was tagged with a case particle which is a postposition of the group. Figure 1 lists a sample news article from the test data used in our exper iments. The part above the horizontal line corresponds to a headline; the part below the line corresponds to the body of article. We indicate nouns by a parenthesis '( )' and case particles by a preposed dash '—'. In addition, we use a square bracket '[ ]' to indicate a phrase for which a case particle is a postposition. A tokenisation error is marked with a single star ('*'), a parsing error is doubly starred('**'). 'φ' indicates that a noun it attaches to is part of the verbal morphology and thus does not take a regular case particle. For the sake of readability, we adopt the convention of repres R-(d)
=
{ French, bank, big-name, Societé, General, on 15th, U-, kra-, ine, capital, Kiev, resident, staff, office, open, disclose, city, authority,permission }
R+(d) =
{ Frenchα", bankα", big-nameα", Societéβ, General β , on 15th α , U- α , kra- α , ineα", capital γ , Kiev γ , residentδ, staffδ, officeδ, open ø , discloseø, Kiev α , city α , authority α , permissionε } Fig. 2: Indexing languages
enting Japanese index words by their English equivalents. A plain index language is made up of nouns found in the sample article; an annotated index language is like a plain one except that nouns are tagged with case particles (denoted by superscripts). The list of the particles is given along with explanations in Table 1. Shown in Figure 2 are two kinds of indexing vocabulary derived from the news article example in Figure 1. Superscripts on words, α, β, γ, δ and e correspond to particles no, wa, ni, wo and mo, respectively; thus 'Socitéβ', for instance, is meant to represent a Japanese wa-annotated term 'Societé wa' and similarly for others. Notice that un like the plain index language, the language with annotation contains two
240
TADASHI NOMOTO ga no
SUBJECT
WO
OBJECT
wa ni to de e mo ka
AS FOR, AS REGARDS T O
kara yori
FROM
O F , WHOSE
FOR, TO AND AT, IN T O , IN T H E DIRECTION OF AS WELL OR FROM
Table 1: Case particles based on (Sahuma 1983) instances of 'Kiev', i.e., 'Kiev γ ' and 'Kiev α ', reflecting the fact that there are two particles in the news piece (no, ni) which are found to occur with the word 'Kiev'. 4
Experiments
In this section, we will report performances of the topic recognition model on indexing languages, with and without grammatical annotation. Recall that an indexing language is something that represents text corpora and usually consists of a set of terms derived in one way or another from the corpora. Our experiments used the total of 44,001 full-text news stories from Ni kon Keizai Shimbun, a Japanese economics newspaper. All of the stories appeared in the first half of the year 1992. Of these, 40,401 stories, which appeared on May 31, 1992 and earlier, were used for training and the re maining 3,600 articles, which appeared on June 1, 1992 or later, were used for testing. 4.1
Test setting
We divided the test set into nine subsets of stories according to the length of the story. The subsets each contained 400 stories. The test set 1, for instance, contains stories less than 100 (Japanese) characters in length, the test set 2 consists of stories between 100 and 200 characters in length, and the test set 3 contains stories whose length ranges from 200 to 300 characters (Table 2).
EFFECTS OF GRAMMATICAL ANNOTATION test set
Ί 2 3 4 5 6 7 8 9
length (in char.) < 100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900
241
num. of doc. 400 400 400 400 400 400 400 400 400
Table 2: Test sets
The topic identification is a two-step process: (1) it estimates, for each potential topic, the degree of its relationship with the text, i.e., L(c | d), and (2) then identifies a potential topic which is likely to be an actual topic of the text 2 . This involves using decision strategies like k-per doc, proportional assignment and probabilistic thresholding. The estimating part will use TRM as a measure of relationship between a potential topic and a text d, for c G S(d) and d Є D. TRM takes as inputs a text d from the test corpus and a potential topic c, and determines how actual is with respect to d. Here are some details on how to estimate probabilities. The training set of 40,401 stories were used to determine prior probabilities for P(c), P(T = w), and P(T — w | c). P(c) is the probability that a story chosen randomly from the training set is assigned to a title term c. As mentioned in Section 2, we estimated the probability as Dc/D, where Dc is the number of texts whose title has an occurrence of c, and D is the total number of texts, i.e., D = 40,401. The estimation of P(T = w) and P(T = w \ c) ignored the frequency of w in a title. P(T = w) was estimated as FwD/F*D, with F ¿ = 3,213,617, the number of noun tokens found in the training corpus. We estimated P(T = w \ c) by FwcF*c where Fwc = ΣdeDcFwd and F*c = ΣdeDc R{d). Again in estimating P(T = w | c), we have counted out any of w's occurrences in a headline. P(T = w \ d) was estimated as Fwd/F*d for an input text d. We would have F*D = 19 for a text in Figure 1, which contains 19 noun tokens. Now for the deciding part. Based on the probability estimates of C(c \ d), we need to figure out which topic(s) should be assigned to the text. The text categorisation literature makes available several strategies for doing this (Lewis 1992). In the probabilistic thresholding scheme, a category (= 2
A potential topic is said to be actual if it occurs in the text's headline.
242
TADASHI NOMOTO
potential topic) is assigned to a document d just in case L(c \ d) > s, for some threshold constant s.3 In a k-per doc strategy, a category is assigned to documents with the top scores on that category. Another commonly used strategy is called proportional assignment A category is assigned to its top scoring documents in proportion to the number of times the category is assigned in the training corpus. In the experiments, we adopted a probabilistic thresholding scheme4. Al though it is perfectly all right to use the k-per doc here, the empirical truth is that the text categorisation fares better on the probabilistic thresholding than on the k-per doc. 4.2
Result and analysis
In what follows, we will discuss some of the results of the performance of the topic recognition model. The model was tested on 9 test sets in Table 2. For each of the test sets, we experimented with two indexing languages, one with annotation and one without, to observe any effects annotation might have on the recognition task. The goal was to determine terms most likely to indicate a topic of the article on the basis of estimates of C(c | d) for each indexing term in the article. Following (Gale et al. 1992), we compare our model against a baseline model, which establishes lower bounds on the recognition task. We estimate the lower bound as the probability that a title term is chosen randomly from the document, i.e., P(c | d). The baseline represents a simple, straw man approach to the task, which should be outperformed by any reasonable model. The baseline model P(c | d) represents a simple idea that a word with more frequency would be a more likely candidate for topichood. Figure 3 shows the performance of the recognition model on plain and annotated indexing languages for a test corpus with stories less than 100 character long (test set 1). The baseline performance is also shown as a comparison. As it turns out, at the break-even point 5 , the model's perform3
4
5
One of the important assumptions it makes is that the probability estimates are com parable across categories as well as across documents: that is, it assumes that it is possible to have an ordering, L(c1 | d1) > L(c1 | d2) > · · · > L(cn | dm), among the possible category/document pairs in the test corpus. There is an obvious reason for not using the proportional assignment policy in our experiments. Since the set of categories (title terms) in the training corpus is openended and thus not part of the fixed vocabulary, it is difficult to imagine how the assignment ratio of a category in the training corpus is reflected on the test set. A break even point is defined to be the highest point at which precision and recall are
EFFECTS OF GRAMMATICAL ANNOTATION
243
ance is higher by 5% on the annotated language (54%) than on the plain language (49%). Either score is much higher than the baseline (19%). Table 3 summarises results for all of the test sets. We see from the table that grammatical annotation does enhance the model's performance6. Note, however, that as the length of a story increases, the model's performance rapidly degrades, falling below the baseline at test set 5. This happens regardless of whether the model is equipped with extra information. The reason appears to be that benefits from annotating the text are cancelled out by more of the irrelevancies or noise contained in a larger text. The increase in text length affects factors like S(d) and R(d), which we assumed to be equal. Recall that the former denotes a set of potential topics and the latter a set of indices or nouns extracted from the text. Thus the increase in text length causes both R(d) and S(d) to grow accordingly. Since the title length stays rather constant over the test corpus, the possibility that an actual topic is identified by chance would be higher for short texts 6
equal. It is intended to be a summary figure for a recall precision curve. Figures in the table are micro-averaged, i.e., expected probabilities of recall/precision per categorisation decision (Lewis 1992).
244
TADASHI NOMOTO test set
Ï 2 3 4 5 6 7 8 9
length (in char.) < 1oo 100 - 200 200 - 300 300 -400 400 - 500 500 - 600 600 - 700 700 - 800 800 - 900
R-(d) 49% 42% 35% 31% 31% 30% 28% 25% 26%
R+{d) 54% 44% 37% 32% 33% 31% 29% 26% 26%
baseline
19% 33% 30% 32% 35% 35% 37% 34% 35%
Table 3: Summary statistics than for lengthy ones. Indeed we found that 13% of index terms were actual for the test set 1, while the rate went down to 3% for the test set 9. One way to increase its resistance to the noise would be to turn to the idea of mutual information (Hindle 1990) or use only those terms which strongly predict a title term (Finch 1994). Or one may try a less sophistic ated approach of reducing the number of category assignments to, say, the average length of the title. 5
Conclusion
In this paper, we have proposed a method for identifying topical words in Japanese text, based on probabilistic models of text categorisation (Fuhr 1989; Iwayama & Tokunaga 1994). The novelty of the present approach lies in the idea that the problem of identifying discourse topic could be recast as that of classifying a text with terms occurring in that text. The results of experiments with the Japanese corpus, showed that the model's performance is well above the baseline for texts less than 100 char acters in length, though it degrades as the text length increases. Also shown in the paper was that annotating the corpus with extra information is worth the trouble, at least for short texts. Furthermore, the model applies to other less inflectional languages, in so far as it works on a word-based represent ation. The next step to take would be to supply the ranking model with inform ation on the structure of discourse and develop it into a model of anaphora resolution (Hearst 1994; Nomoto & Nitta 1994; Fox 1987).
EFFECTS OF GRAMMATICAL ANNOTATION
245
Acknowledgements. The author is indebted to Makoto Iwayama and Yoshiki Niwa for discussions and suggestions about the work. REFERENCES Finch, Steven. 1994. "Exploiting Sophisticated Representations for Document Retrieval". Proceedings of the 4th Conference on Applied Natural Language Processing, 65-71, Stuttgart, Germany: Institute for Computational Lin guistics, University of Stuttgart. Fox, Barbara A. 1987. Discourse Structure and Anaphora. (= Cambridge Studies in Linguistics, 48). Cambridge: Cambridge University Press. Fuhr, Norbert. 1989. "Models for Retrieval with Probabilistic Indexing". formation Processing & Management 25:1.55-72.
In
Gale, William, Kenneth W. Church, & D. Yarowsky. 1990. "Estimating Up per and Lower Bounds on the Performance of Word-Sense Disambiguation Programs". Proceedings of the 22nd Annual Meeting of the Association for Computational Linguistics (ACL'90), 249-256. Grosz, Barbara & Candance Sidner. 1986. "Attention, Intentions and the Struc ture of Discourse". Computational Linguistics 12:3.175-204. Hearst, Marti A. 1994. "Multi-Paragraph Segmentation of Expository Text". Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (CL'94, 9-16. Hindle, Donald. 1990. "Noun Classification from Predicate-Argument Struc tures". Proceedings of the 22nd Annual Meeting of the Association ¡or Com putational Linguistics, 268-275. Hobbs, Jerry. 1978. "Resolving Pronoun References". Lingua 44.311-338. Iwayama, Makoto & Takenobu Tokunaga. 1994. "A Probabilistic Model for Text Categorisation: Based on a Single Random Variable with Multiple Values". Proceedings of the 4the Conference on Applied Natural Language Processing, 162-167. Joshi, Aravind K. & Scott Weinstein. 1981. "Control of Inference: Role of Some Aspects of Discourse Structure — Centering". Proceedings of the Interna tional Joint Conference on Artificial Intelligence, 385-387. Lappin, Shalom & Herbert J. Leass. 1994. "An Algorithm for Pronominal Ana phora Resolution". Computational Linguistics 20:4.235-561. Lewis, David D. 1992. "An Evaluation of Phrasal and Clustered Representations on a Text Categorisation Task". Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Re trieval, 37-50.
246
TADASHI NOMOTO
Matsumoto, Yuji, Sadao Kurohashi, Takehito Utsuro, Yutaka Taeki & Makoto Nagao. 1993. Japanese Morphological Analysis System JUMAN Manual Kyoto, Japan: Kyoto University. [In Japanese.] Nomoto, Tadashi & Yoshihiko Nitta. 1994. "A Grammatico-Statistical Approach to Discourse Partitioning". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94)·, 1145-1149, Kyoto, Japan. Sakuma, Kanae. 1983. Gendai Nihongohō-no Kenkyu [A Study on Grammar of Modern Japanese]. Tokyo, Japan: Kuroshio-Shuppan. Umino, Bin. 1988. "Shutsugen-Hindo-Jyouhou ni-motozuku Tango-Omomizukeno-Genri [Some Principles of Weighting Methods Based on Word Frequencies for Automatic Indexing]". Library and Information Science 26:67-88. Walker, Marilyn, Masayo Iida &· Sharon Cote. 1994. "Japanese Discourse and the Process of Centering". Computational Linguistics 20.2:193-232.
Discourse Constraints on Theme Selection W I E B K E RAMM
University of the Saarland Abstract In this paper we deal with the area of thematisation as a grammat ical as well as a discourse phenomenon. We investigate how discourse parameters such as text type and subject matter can effect sentencelevel theme selection as one of the grounding devices of language. Aspects of local and global thematisation are described in terms of a system-functionally oriented framework and it is argued that cor relations between text-level and sentence-level discourse features can be modelled as inter-stratal constraints in a stratificational text gen eration architecture. 1
Introduction
Our starting point is the observation that language is quite flexible regard ing how a piece of information can be communicated; the same state of affairs often can be expressed by very different linguistic means such as word order alternatives, by using different lexical material or by applying different grammatical constructions. In most cases these options are not arbitrarily interchangeable, however, since in addition to the transmission of propositional meaning, a linguistic utterance also aims to achieve cer tain pragmatic effects which can only be reached when the information is presented in an appropriate manner. To this end, language is provided with special grammatical and semantic devices guiding the foregrounding and backgrounding of particular parts of information in a sentence (cf. Ramm et al. (1995:34f.)): - Focusing1 is a textual means responsible for the information distribution in a clause. The focus, which is usually intonationally marked, is the locus of principal inferential effort within each message (cf. Lavid 1994a:24) and has a typical correlation with what is new (in contrast to what is given) in a sentence. 1
The notions of focus as well as theme have found diverging interpretations in different linguistic and computational-linguistic schools (for a comparison cf. Lavid 1994a). The definitions we are working with here are mainly inspired by the theory of systemicfunctional linguistics (SFL). We will outline some central concepts of this approach below.
248
WIEBKE RAMM
- Thematisation (in its sentence-grammatical notion) is guiding the local contextualisation of a sentence by assigning particular thematic prominence to a part of the message, the theme . "The theme is the element which serves as the point of departure of the message; it is that with which the clause is concerned. The remainder of the messages, the part in which the theme is developed, is called ... the rheme" (Halliday21994:37). - Ranking relates to how an element of a situation (e.g., an event or en tity) is encoded grammatically, for instance, whether it is realised as a verbal construction, a nominalisation, a complement or a circumstance. The gram matical mechanisms of ranking closely interact with the textual means of focusing and thematisation. - Taxis — with its basic options hypotaxis and parataxis — provides an other type of grounding distinction rooted in grammar, this time in terms of a dependency structure holding between clauses. How these linguistic devices are actually deployed in the realisation of a message in order to achieve a particular communicative goal depends on factors such as the (local) textual context in which it appears, but also on global parameters, such as the text type to which the whole discourse belongs, of which the message forms a part, and the subject matter it is about. In this paper we will focus on the area of thematisation in German. In particular, we will investigate in which way aspects of global discourse organisation, namely text type and subject matter, may influence the se lection of grammatical theme on sentence level. The types of correlations we are looking for can be relevant for different NLP applications where the local, sentence-level, as well as the global, text-level, organisation of dis course has to be accounted for. Our application domain is text generation where one of the notorious problems is the gap between global-level text planning (strategic generation) and lexico-grammatical expression (tactical generation), which has been termed the generation gap (cf. Meteer 1992). The output quality of many full-scale generators is compromised because the text planner cannot exercise sufficient control on the fine-grained dis tinctions available in the grammar. We argue that some of the problems can be accounted for by recognising the variety of linguistic resources involved as distinct modules or strata in a multi-stratal language architecture and by representing characteristic correlations between selections on different strata as inter-stratal constraints.
DISCOURSE CONSTRAINTS ON THEME SELECTION 2
249
Text type, subject matter and theme selection
Before having a look at the realisation of theme in some concrete text ex amples, we will start with a few more words on the conception of grammat ical theme we are proceeding from and the options the German language provides according to our model. As mentioned in the beginning, our notion of theme is inspired by the theory of systemic-functional linguistics (SFL) (for an overview of basic ideas of SFL cf. Halliday 21994; Matthiessen & Bateman 1991), according to which theme is a textual resource of the language system which is — together with other cohesive and structural means such as reference, sub stitution, ellipsis, conjunction, lexical cohesion and focus — responsible for the coherence of a text. Theme as "the resource for setting up the local context' Matthiessen (1992:449) in which each clause is to be interpreted, the point of departure in Halliday's definition (see above), provides only one of the textually significant variation possibilities of word order; it closely in teracts with other resources such as focus, transitivity, voice/diathesis, and mood. The theme is a function with particular textual status (thematic prominence) in the clause and becomes the resource for manipulating the contextualisation of the clause. Theme in this systemic-functional meaning has originally been described with respect to English grammar; the account of theme in the German clause, some basic ideas of which we will briefly summarise now, is described in more detail in Steiner & Ramm (1995). For the realisation of theme in German, there is a clear rough correspond ence with what is described as 'Vorfeld' in other approaches (see e.g., Hoberg 1981), i.e., the theme is realised in the position before the finite verb. One of the typical features of a systemic functional account of theme is the observa tion that the theme can be realised by metafunctionally different elements, i.e., it can be ideational, interpersonal or textual. Metafunctional diver sification is a central notion of systemic-functional theory that reflects the view of language as being functionally diversified into three generalised func tions: the ideational which is concerned with propositional-content type of linguistic information; the interpersonal which provides the speaker/writer with the resources for creating and maintaining social relations with the listener/reader; and the textual which provides the resources for contextualising the other two types of information, i.e., presents ideational and interpersonal information as text in context (cf. Matthiessen & Bateman 1991:68). The following examples illustrate the three types of information: contextualisation of a message or proposition employing ideational means
250
WIEBKE RAMM
draws on circumstantial and participant roles of an event, e.g., Ich werde ge hen. (I will go.) In grammatical terms, this is a subject-theme. An example of contextualisation by interpersonal means is thematisation of an interac tion marker, such as a modal circumstantial role, e.g., Vielleicht werde ich gehen. (Possibly I will go.) On the grammatical level, the theme is filled by a modal adjunct. Contextualisation by textual means operates on the resource of logico-semantic relations, expressed grammatically by conjunc tions or conjunctive adjuncts, e.g., Daher werde ich gehen. (Therefore I will go.) Theme variation in German comprises two further dimensions, namely simple vs. multiple, and unmarked vs. marked theme. The former distinguishes themes realised by a single semantic function from those filled by more than one, the latter relates to whether a certain theme choice leads to marked intonation which closely relates to the area of focus. We will now investigate how these options surface in 'real-life' texts of different text types. The two texts we are going to have a look at are taken from a more representative corpus of short texts covering text types ranging from narrative, descriptive and expository to argumentative and instructive texts. The texts which have been selected in correspondence with a parallel corpus of English texts (cf. Lavid 1994b) have been analysed according to a number of parameters such as discourse purpose, subject matter, global chaining strategy, and focus category (cf. Villiger 1995). The first sample text — a section from a travel guide — is of the descriptive type. Text 1: "Sevilla" (from: T. Schröder: Andalusion. M. Müller Verlag, Erlangen, 1993, pp.332-333.) 2 (01) Sevillas Zentrum liegt östlich eines Seitenkanals des Rio Guadalquivir, der die Stadt etwa in Nord-Süd-Richtung durchzieht. (The Centre of Seville is situated east of a side canal of the Rio Guadalquivir which runs through the city roughly from north to south.) (02) Hauptstraße ist die Avenida de la Constitucion; (The main street is the Avenida de la Constitucion;)
(03) in ihrer unmittelbaren Umgebung liegen mit Kathedrale und Giralda
sowie der Alcazaba die bedeutendsten Sehenswürdigkeiten der Stadt. (in its surroundings,
immediate
the most important sights of the city, the cathedral, the Giralda, and the
Alcazaba, are situated.)
(04) Östlich schließt sich das Barrio de Santa Cruz an, Sevil
las lauschiges Vorzeigeviertel. (In the east, the Barrio de Santa Cruz, Seville's
secluded
showpiece quarter, borders on the city.) (05) Die Avenida de la Constitucion beginnt im Süden am Verkehrsknotenpunkt Puerta de Jerez und mündet im Norden in den Dop2
English glosses of the German text passages are given in italics; the sentence theme of each clause is underlined. If English theme is roughly equivalent in type and meaning, we have also underlined the themes in the English version.
DISCOURSE CONSTRAINTS ON THEME SELECTION
251
pelplatz Plaza San Francisco/Plaza Nueva; (The Avenida de la Constitucion begins in the south at the Puerta de Jerez junction and leads into the double square Plaza San Francis co/Plaza Nueva in the north.) (06) Hier liegt auch das Geschäftsviertel um die Haupteinkaufsstraße Calle Sierpes. (Here also the shopping centre around the main shop ping street, Calle Sierpes, is situated.) (07) Südlich des engeren Zentrums erstrecken sich der Park Parque de Maria Luisa und das Weltausstellungsgelände von 1929, die Plaza de Espana. (South of the immediate centre the park Parque de Maria Luisa and the site of the world fair 1929, the Plaza de Espana, are located.) (08) Jenseits des Gualdalquivir sind zwei ehemals selbständige Siedlungen zu abendlichen und nächtlichen Anlaufad ressen avanciert: das volkstümliche Barrio de Triana auf Höhe des Zentrums und, südlich anschließend, das neuzeitlichere Barrio de los Remedios auf Höhe des Parque de Maria Luisa. (Beyond the Guadalquivir two formerly independent settlements have developed into places to go to in the evenings and at night: the traditional Barrio de Triana, which is on a level with the centre and, bordering on this area in the south, the more modern Barrio de los Remedios, which is on a level with the Parque de Maña Luisa.) The sentence themes 3 in this text constantly are ideational elements real ised as subject theme ((01) and (05)), subject complement (02), or circum stantials ((03), (04), (06), (07), and (08)). In terms of semantic categories, these themes are participants ((01), (05) and (02)), or circumstances (time & place) ((03), (04), (06), (07), and (08)). Before analysing the text in more detail, consider the thematic choice in another example. The second text is argumentative, a satirical article published in the commentary part of a German newspaper: Text 2: "Nostalgiekarte Jahrgang 1992" (Nostalgia map of the year 1992) (From: Saarbrücker Zeitung, December 14./15. 1991, p.5) (01) So war die politische Geographie einmal zu fernen Zeiten. (This is how the political geography used to be a long time ago.) (02) Deutschland noch nicht vereint, (Germany not yet united,) (03) der Saar-Lor-Lux-Raum ein weißer Fleck auf der Landkarte. (the Saar— Lor-Lux region a blank area on the map.) (04) Zu fernen Zeiten? (A long time ago?) (05) Mitnichten!!! (Far from it!!!) (06) Die oben abgebildete Deutschlandkarte fin det sich im neuen Taschen-Terminkalender 1992 der Sparkasse Saarbrücken. (The map of Germany shown above is published in the new pocket diary 1992 of the savings bank of Saarbrücken.) (07) Dort hat man noch nicht mitbekommen, (There no-one has yet 3
We have not applied our theme analysis to dependent clauses since in most cases, theme in dependent clauses is more or less grammaticalised (typically. realised by elements such as conjunctions or wh-elements), i.e., there is no real choice regarding what can appear in theme position. For a few further remarks on theme in dependent clauses see Steiner & Ramm (1995:75ff).
252
WIEBKE RAMM
noticed (08) daß Deutschland um die Kleinigkeit von fünf neuen Bundesländern größer geworden ist. (that Germany has grown by the trifling amount of five new
Bundesländer.)
(09) Zudem gibt es jenseits der alten DDR-Grenze noch andere Städte als Leipzig und Berlin, so zum Beispiel Rostock, Dresden, Magdeburg oder Saarbrückens Partnerstadt Cottbus. (Moreover, there are still other cities beyond the former frontier to the GDR apart from Leipzig and Berlin, for example Rostock, Dresden, Magdeburg, or Cottbus, the twin city of Saarbrücken.)
(10) Außerdem scheint den Herren von der Sparkasse entgan
gen zu sein, (Besides, it seems that the gentlemen of the savings bank didn't realise) (11) daß am Ende des Terminello-Jahres 1992 der europäische Binnenmarkt steht. (that at the end of the Terminello-Year
1992 the Single European Market will come into force.)
(12) Nicht zuletzt vermittelt das Kärtchen den Eindruck, (Last but not least the little map suggests,) (13) daß Saarbrücken der Nabel (Alt-)Deutschlands zu sein scheint. (that Saarbrücken was the navel of (the former) Germany.)
(14) Je nach anatomischer Sichtweis,
kann es aber auch ein anderes Körperteil sein, (depending on the anatomical point of view, however, it might also refer to another part of the body.)
Here we have a clear priority of ideational themes in the first part of the text (propositions (02), (03), (06) and (07)), whereas the rest of the text is dominated by textual themes, as in (09), (10) and (12). The question is now, how theme selection in text is motivated and whether the differences between the two texts are typical for the respective text types they belong to. As Fries (1983) shows for English texts, different kinds of theme selection patterns correlate both with different text types or genres and are closely related to the subject matter or domain of the text. In particular, there is a close relation between thematic content, i.e., the semantic content of the themes of a text segment, and the method of development of a text which comprises general organisations such as spatial, temporal, general to specific, object to attribute, object to parts or compare and contrast. As also Danes (1974:113) points out, theme plays a decisive constructional role for building up the structure of a text. Note that the method of development is not the same as Danes' thematic progression: the former relates to the semantic content of the grammatical themes and the relations holding between the themes, whereas the latter refers to possible types of patterns built between themes and rhemes of a text. Turning back to our sample texts, the most characteristic feature of Text 1 is its reflection of the spatial organisation of the underlying domain. This is a typical property of many descriptive texts and in this case leads to the incremental creation of a cognitive map of the domain, 'centre of Seville'. The centrality of the domain structure for the construction of the mean ing of the text is mirrored in its linguistic appearance, also with respect to
DISCOURSE CONSTRAINTS ON THEME SELECTION
253
thematic choice: all sentence themes in this text refer to spatial conceptu alisations which are inherently ideational, with a clear difference regarding linguistic realisation between object concepts (realised semantically as par ticipant themes as in (01), (02) and (05)) and (spatial) relational concepts (realised as circumstance themes (as in (03), (04), (06), (07) and (08)). As a result, the sequence of concepts verbalised as themes allows the reader of the text to navigate through a cognitive map of the domain by keeping to a strict spatial method of development In each of the clauses the rhematic part (which includes the focus) elaborates on the specific spatial concept in troduced as theme, i.e., adds certain attributes in the form of other spatial concepts in order to build up a spatial representation of the domain. What can be observed here is a typical 'division of labour' between theme and rheme, namely that the themes play the decisive constructional role by in troducing new domain concepts, whereas the foci, contained in the rhemes, add new pieces of information. In terms of subject matter, the second text basically deals with spatial information, too, but here the domain is not the main factor responsible for the structuring of the text. In this case, the underlying main discourse purpose is not to inform the reader about some state of affairs as in the descriptive example, but rather to argue in favour of some opinion taken by the author. This is clearly reflected in the linguistic structure of the text: propositions (01) - (06) represent the contra-argumentation, in the sense of providing the facts/arguments against which the author is going to argue. The task of this discourse segment is to present the background information on which the subsequent pro-argumentation ((07) - (14)) in which the author develops her/his opinion is based. The different commu nicative functions of these two stages of the macro structure of the text are also reflected in the means deployed for local contextualisation (i.e., thematisation): ideational elements referring to relevant concepts of the do main predominate in the contra-argumentation, whereas textual elements are chosen to guide the local contextualisation in the pro-argumentation. In this second segment of the text, a sequence of conjunctive themes func tions as the indicator of an (additive) argumentative chain formed by the rhematic parts of the respective sentences: 'zudem' (09) — 'außerdem' (10) — 'nicht zuletzt' (12)). To sum up our text analyses, in the descriptive text we have found a clear, text-type specific correlation between the structure of the domain and the method of development of the text (realised by ideational themes). The argumentative text, in contrast, exhibited two characteristic thematisation
254
VVIEBKE RAMM
strategies, one constructing the state of affairs under discussion and one supporting the chain of argumentation. So, what these sample analyses show is not only that text type and subject matter constrain theme options, but that the theme pattern is also sensitive to the individual stages of the macro structure (or generic structure) of a text. 3
Theme selection as interstratal constraints
How can such types of correlations between discourse features and sentencelevel realisation be accounted for? Correlations between the discourse char acteristics of a text and lexico-grammatical features such as the ones illus trated in the previous section can be straightforwardly employed for gener ation in an architecture that recognises the information types of text type and subject matter as necessary constraints on the well-formedness of a text. One such architecture is implemented in the systernic-functionally ori ented -PENMAN text generation system (cf. Teich & Bateman 1994, Bateman & Teich 1995), a German spin-off of the English PENMAN system (cf. Mann 1983). The system architecture reflects the stratifìcational organisation of the language system presupposed by systemic-functional theory, according to which a linguistic utterance is the result of a complex choice process which recursively selects among options provided by interconnected networks of semantic, grammatical and lexical choice systems associated with different levels of abstraction, strata, such as lexico-grammar, (sentence-)semantics, register and genre (cf. again Matthiessen & Bateman 1991 for an overview). Features of the text type are represented at the most abstract strata of genre and register (encoding the contexts of culture and situation). The typical structural configuration of the texts of a genre, i.e., their typical (global) syntagmatic organisation, is accounted for by representing their so-called generic structure potential, (GSP) (cf. Hasan 1984). A GSP consists of those stages that must occur in the development of a text in order to classify it as belonging to that specific genre. These stages roughly correlate with what is called 'macrostructures' in other approaches (cf. van Dijk 1980). Linguistic resources at all strata are represented as system networks which constitute multiple inheritance hierarchies consisting of various linguistic types. Proceeding from such an architecture, the correlation between text type and theme selection can be conceived of as a set of inter-stratal constraints between the global-level textual resource and the lexico-grammatical re-
DISCOURSE CONSTRAINTS ON THEME SELECTION
255
source which is mediated via a semantic stratum of a local-level textual resource that abstracts from the purely grammatical distinctions provided by the grammar. The representation of such inter-stratal constraints follows the lines presented in Teich & Bateman (1994): At the level of genre, a typo logy of texts is modelled as a system network (based on Martin 1992:560ff.) which covers various descriptive, expository, narrative and argumentative types of texts. Typical GSP structures are associated with individual genres providing the guideline for syntagmatic realisation in the form of global dis course structures. Moreover, depending on the specific communicative func tions pursued, either whole texts or single GSP stages are characterised by three metafunctionally distinct register parameters, namely field (referring to ideational properties, for instance, of the subject matter), tenor (describ ing the interpersonal relations among the participants in the discourse) and mode (the textual dimension, characterising the medium or channel of the language activity). Choices at the stratum of register have characteristic consequences on the lexico-grammatical level, i.e., lead to selections on the lower-level resources of the language system realising the higher ones by appropriate lexical and grammatical means.
Fig. 1: Theme selection as interstratal constraint This architecture also gives room for modelling aspects of discourse con straints on thematisation such as those addressed in this paper: proper ties of the domain or subject matter can be accounted for by the choice
256
WIEBKE RAMM
of appropriate field options at register level which are reflected at the ideational-semantic stratum as specific conceptual configurations (the do main model) with clear mappings defined for lexical and grammatical in stantiation (covered by the ideational-semantic resource, the 'upper model', cf. Bateman et al. 1990). Global thematisation strategies have to be ad dressed at register level as well and are paradigmatically reflected on the individual GSP stages for which a certain method of development holds. The choice of a certain method of development for (a stage of) a text con strains the options at the textual-semantic 4 and textual-grammatical level. For a (simplified) illustration of how this might work, for instance, with respect to a descriptive texts with spatial method of development — say, a travel guide (or a section from it) — see Figure 1. Two kinds of operations support the control of thematisation: The real isation operation of preselection takes as arguments a function inserted at a higher stratum (e.g., a stage inserted in the discourse structure) and a fea ture of a system at a lower stratum (e.g., a feature of the SEMANTIC-THEME system (cf. Ramm et al. 1995)). In the figure, inter-stratal preselection is marked by the arrow between (1) and (2). The chooser/inquiry interface (Mann 1983) is used to interface lexico-grammar and semantics (denoted in Figure 1 by the arrow between (2) and (3)). Each system at the lexicogrammatical stratum is equipped with a number of inquiries that are or ganised in a decision tree (a chooser). The inquiries are implemented to access information from the higher adjacent stratum (here: the local-level textual resource). The inquiries of the chooser of the lexico-grammatical system T H E M E - T Y P E , e.g., must be provided with information about se mantic theme selection in order to decide whether to generate a circum stance (for instance, as a prepositional phrase) or a participant theme (e.g., as a nominal phrase). 4
Conclusions
What we have tried to illustrate in this paper is how discourse paramet ers such as text-type and subject-matter can effect thematisation as one of the grounding devices of language. We have described aspects of local and global thematisations in terms of a system-functionally oriented framework that also underlies an implementation in a text generation system. We have suggested to model correlations between text-level and sentence4
Due to lack of space, we cannot go into details regarding this stratum here. For its motivation and description, see Erich Steiner's contribution in Ramm et al. 1995:36ff.
DISCOURSE CONSTRAINTS ON THEME SELECTION
257
level discourse features as interstratal constraints holding between different levels of the language system. The approach as it is now is certainly still limited, since the mechanisms currently deployed are quite strict and unflexible; they should be enhanced, for instance, by a better micro planning. However, although we could only very roughly sketch our ideas here, we feel t h a t they could provide a step towards closing the generation gap between global and local text planning. Acknowledgements. Most of the research described in this paper was done in the context of the Esprit Basic Research Project 6665 DANDELION. I am grateful to Elke Teich for her extensive feedback and support both with previous versions of this paper and with the implementation. I would also like to thank Claudia Villiger for providing the text corpus and the analyses on which this work is grounded. Last but not least, thanks are due to Erich Steiner for helping with the English — with full responsibility for still existing weaknesses remaining with the author, of course. REFERENCES Bateman, John Α., R. T. Kasper, J. D. Moore & R. A. Whitney. 1990. "A General Organization of Knowledge for Natural Language Processing: the PENMAN Upper Model". Technical Report (ISI/RS-90-192). Marina del Rey, Calif.: Information Science Inst., Univ. of Southern California. & E. Teich. 1995. "Selective Information Presentation in an Integrated Publication System: an Application of Genre-Driven Text Generation". In formation Processing and Management 31:5. 753-767. Danes, Frantisek. 1974. "Functional Sentence Perspective and the Organization of the Text". Papers on Functional Sentence Perspective ed. by F. Danes, 106-128. Prague: Academia. Fries, Peter H. 1983. "On the Status of Theme in English: Arguments from Discourse". Micro and Macro Connexity of Discourse ed. by J. S. Petöfi & E. Sözer (Papiere zur Textlinguistik; Bd. 45). 116-152. Hamburg: Buske. Halliday, Michael A. K. 1994. An Introduction to Functional Grammar. 2nd edition. London: Edward Arnold. Hasan, Ruqaiya. 1984. "The Nursery Tale as a Genre". Nottingham Linguistic Circular 13. 71-192. Hoberg, Ursula. 1981. Die Wortstellung in der geschriebenen deutschen Gegen wartssprache. München: Hueber. Lavid, Julia. 1994a. "Thematic Development in Texts". Deliv. Rl.2.1, ESPRIT Project 6665 DANDELION. Madrid: Universidad Complutense de Madrid.
258
WIEBKE RAMM 1994b. "Theme, Discourse Topic, and Information Structuring". De liverable R1.2.2b, ESPRIT Project 6665 DANDELION. Madrid: Universidad Complutense de Madrid.
Mann, William . 1983. "An Overview of the PENMAN Text Generation System". Proceedings of the National Conference on Artificial Intelligence (83), 261-265. Martin, James R. 1992. English Text: System and Structure. Philadelphia: John Benjamins.
Amsterdam &
Matthiessen, Christian M. I. M. 1988. "Semantics for a Systemic Grammar: the Chooser and Inquiry Framework". Linguistics in a Systemic Perspect ive ed. by J. D. Benson, M. Cummings & W. S. Greaves. Amsterdam & Philadelphia: John Benjamins. & J. Α. Βateman. 1991. Text Generation and Systemic-Functional Linguist ics: Experiences from English and Japanese. London: Frances Pinter. Forthcoming. Lexicogrammatical Cartography: English systems. Technical Report, Dept. of Linguistics. Sydney: University of Sydney. Meteer, Marie W. 1992. Expressibility and the Problem of Efficient Text Planning. London: Pinter. Ramm, Wiebke, A. Rothkegel, E. Steiner & . Villiger. 1995. "Discourse Gram mar for German". Deliverable R2.3.2, ESPRIT Project 6665 DANDELION. Saarbrücken: University of the Saarland. Steiner, Erich & W. Ramm. 1995. "On Theme as a Grammatical Notion for German". Functions of Language 2:1. 57-93. Teich, Elke & J. A. Bateman. 1994. "Towards the Application of Text Genera tion in an Integrated Publication System". Proceedings of the 7th Interna tional Workshop on Natural Language Generation, 153-162. Kennebunkport, Maine. van Dijk, Teun Α. 1980. Macro structures: An Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition. Hillsdale, New Jersey: Erlbaum. Villiger, Claudia. 1995. "Theme, Discourse Topic, and Information Structuring in German Texts". Deliverable R1.2.2c, ESPRIT Project 6665 DANDELION. Saarbrücken: University of the Saarland.
Discerning Relevant Information in Discourses Using TFA G E E R T - J A N M. K R U I J F F 1 & J A N SCHAAKE
University of Twente Abstract When taking the stance that discourses are intended to convey in formation, it becomes important to recognise the relevant informa tion when processing a discourse. A way to analyse a discourse with regard to the information expressed in it, is to observe the TopicFocus Articulation. In order to distinguish relevant information in particularly a turn in a dialogue, we attempt to establish the way in which the topics and foci of that turn are structured into a story line. In this paper we shall come to specifying the way in which the information structure of a turn can be recognised, and what relevant information means in this context. 1
Introduction
Discourses, whether written or spoken, are intended to convey information. Obviously, it is important to the processing of discourses that one is able to recognise the information that is relevant. The need for a criterion for relev ance of information arises out of the idea of developing a tool assisting in the extraction of definitions from philosophical discourses (PAPER/HCRAESprojects). A way to analyse a discourse with regard to the information ex pressed in it, is to observe the Topic-Focus Articulation. A topic of (part of) a discourse can be conceived of as already available information, to which more information is added by means of one or more foci. Several topics and foci of a discourse are organised in certain structures, characterised by a thematical progression ('story-line'). The theories about TFA and them atic progression have been developed by the Prague School of Linguistics. Particularised to our purposes, in order to discern the relevant information in a discourse, we try to establish the thematic progression(s) in a turn of a dialogue. It will turn out that it is important, not only how topics and foci relate to each other with regard to the thematic progression (sequentially, parallelly, etc.), but also how the topics and foci are related rhetorically (e.g. by negation). In this paper we shall come to specifying the way in which 1
Currently at the Dept. of Mathematics and Physics, Charles University, Prague,
260
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
the information structure of a turn can be recognised, and what relevant information means in this context. In order to develop and to test these definitions we regarded it necessary to choose a domain of small texts where discerning relevant information is also needed. This domain we found in the SCHISMA project. The SCHISMA project is devoted to the development of a theatre information and booking system. One of the problems to be met in analysing dialogues is to discern what exactly is or are the point(s) made in a turn of the client. As we will see below, in one turn a client may make just one relevant remark, the rest being noise or background information that is not relevant to the system. It may also be the case that two or more relevant points are made in just one turn. These points have to be discerned as being both relevant. Throughout the paper examples of the occurrence of relevant information in a turn will be given. In sections 2 and 3, Thematic Progression and Rhetorical Structure Theory will be applied to dialogues taken from the SCHISMA corpus. In section 4, relevant information will be related to what will be called generic tasks; tasks that perform a small function centred around the goal of acquiring a specific piece of information (Chandrasekaran 1986). Conclusions will be drawn in the final section. 2
The communication of information
Surely, it might almost sound like a commonplace that a dialogue conveys, or communicates, information2 But what can we say about the exact features of such communication? If we want to a logical theory of information to be of any use, we should elucidate how we arrive at the information we express in information states (Van der Hoeven et al. 1994). Such elucidation is the issue of the current section. The assumption we make about the dialogues to be considered is that they are coherent. Rather than being a set of utterances bearing no relation to each other, a dialogue -by the assumption- should have a 'story line'. For example, the utterances can therein be related by referring to a common topic, or by elaborating a little further upon a topic that was previously introduced. More formally, we shall consider utterances to be constituted of a Topic and Focus pair. The Topic of an utterance stands for given information, while the Focus of an utterance stands for new information. 2
Supposed that the dialogue is meant be purposeful, of course. Otherwise, they are called 'parasitic' with respect to communicative dialogues (cf. Habermas).
DISCERNING RELEVANT INFORMATION
261
The theory of the articulation of Topic and Focus (TFA) has been developed by members of the Modern Prague School, notably by Hajicová (Hajicová 1993; Hajicová 1994). Consequently, the 'story line' of a dialogue becomes describable in terms of relations between Topics and Foci. The communication of information thus is describable in terms of how given information is used and new inform ation is provided. The relations between Topics and Foci may be conceived of in two ways, basically: thematically, and rhetorically. The thematical way concerns basically the coreferential aspect, while the rhetorical way concerns the functional relationship between portions of a discourse. Let us therefore have a closer look at each of these ways, and how they are related to each other. First, the relations between Topics and Foci can be examined at the level of individual utterances. In that case we shall speak of thematic rela tions, elucidating the thematic progression. Thematic progression is a term introduced in (Danes 1979) as a means to analyse the thematic build-up of texts. We shall use it here in the analysis of the manner in which given and new information are bound to each other by utterances in a dialogue. According to Danes , there are three possibilities in which Topics and Foci are bindable, which are described as follows: 1. Sequential progression : The Focus of utterance m , F m , is con stitutive for the Topic of a (the) next utterance n,Tn. Diagrammatically:
2. Parallel progression : The Topic of utterance m,T m , bears much similarity to the Topic of a (the) next utterance n,Tn. Diagrammatically:
3. Hypertheme progression : The Topic of utterance m, Tm, a s well as the Topic of utterance ,n, refer to an overall Topic called the Hypertheme, TH. Utterances m and n are said to be related hyperthematically. Diagrammatically:
262
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
The following sentences are examples of these different kinds of progression: (1) The brand of GJ's car is Trabant. The Trabant has a two-stroke engine. (2) Trabis are famous for their funny motor-sound. Trabis are also wellknown for the blue clouds to puff. (3) Being a car for the whole family, the Trabant has several interesting features. One feature is that about every person can repair it. Another feature is that a child's finger-paint can easily enhance the permanent outlook of the car. It might be tempting to try to determine the kind of thematic progression between utterances by merely looking at the predicates and entities involved. In other words, directly in terms of information states. Especially sentences like (1) and (2) tend to underline such a standpoint. However, consider the following revision of (1), named (1'): (1') GJ has a Trabant. The motor is a cute two-stroke engine. Similar to (1) we would like to regard (Γ) as a sequential progression. Yet, if we would consider only predicates and entities, we would not be able to arrive at that preferred interpretation. It is for that reason that we propose to determine the kind of thematic progression obtaining between two utterances as follows. Instead of discerning whether the predicates and entities of a Topic Tm or a Focus Fm are the same as those of a Topic Tn' we want to establish whether Fm or Tm and Tn are coreferring. We take coreference to mean that two expressions, E1 and E2 a) are referring to the same concept, or b) are referring to a conceptual structure, where 1 is referring to a concept CEI which is the parent of a concept C E 2 , to which E2 is referring. Hence, the following relations hold3: 1. Fm and Tn are coreferring → sequential progression 2. Tm and Tn are coreferring → parallel progression 3. TH , Tm and Tn are coreferring → hypertheme progression By identifying a coreference obtaining between a focus or topic and a sub sequent topic, we conclude that such a pair has the same intensional content — they are about the same concept. Under the assumption that a concept is only instantiated once in a turn, we could even conclude further here that 3
The presented ideas about thematic progression and coreference result from discussions between Geert-Jan Kruijff and Ivana Korbayová.
DISCERNING RELEVANT INFORMATION
263
the focus or topic and subsequent topic have the same referential content — they refer to the same instantiation of the concept at hand. Clearly, if we would lift the assumption of single instantiation, it would be neces sary to establish whether the instantiations of the concept employed in the expressions are identical. 3
Rhetorical structure of turns
For our purposes we establish the thematic progression between a number utterances making up a single turn in a dialogue. As we already noted, utterances can also be related rhetorically besides thematically. Whereas the thematic progression shows us how information is being communicated by individual utterances, the rhetorical structure elucidates how parts of the communicated information functions in relation to other parts of in formation communicated within the same turn. In other words, the rhet orical structure considers the function of the information communicated by clusters of one or more utterances of a single turn. Such clusters will be called segments hereafter. When performing an analysis in order to explicate the rhetorical struc ture, we make use of Mann and Thompson's Rhetorical Structure Theory (RST) as laid down in Mann & Thompson (1987). Basically, RST enables us to structure a turn into separate segments that are functionally related to each other by means of so-called rhetorical relations. Important is that rhetorical relations are between segments, and that each segment in a rhetorical relation has an import relative to the other segment(s). Basically, two kinds can therein be distinguished: a nucleus N, and a satellite S. The distinction between them can be pointed out as fol lows. A nucleus is defined as a segment that serves as the focus of attention. A satellite is a segment that gains its significance through a nucleus. The concept of nuclearity is important to us: We would still have a coherent dialogue if we would consider the nuclei only. In our understanding, nuc learity is thus an expressive source that directs the response to a turn of a dialogue. Examples of such rhetorical relations are: (4) Segment S is evidence for segment N: (N) The engine of my car works really well nowadays. (S) It started yesterday within one minute. (5) Segment S provides background for segment N: (S) I spend a significant part of the year in Prague. (N) Nowadays, I am the proud owner of a Trabant.
264
GEERT-JAN M. KRUIJFF & JAN SCHAAKE (6)
Segment S is a justification for segment N: (S) When parking a little carelessly, I broke one of the rear lights. (N) I should buy a new rear light. A study of a corpus of dialogues we have gathered reveals that within our domain the following rhetorical relations are of importance: 1. Solutionhood: S provides the solution for N; "Yes, but grandma is a little cripple, so, well, then we'll go with the two of us." 2. Background: S provides background for N; "I would like to go to an opera. Is there one on Saturday?" 3. Conditional: S is a condition for N; "If the first row is right opposite to the stage, then the first row, please." 4. Elaboration: S elaborates on N; "I would like to go to Wittgenstein, because he was really enter taining last time." 5. R e s t a t e m e n t : S restates or summarises N; "So I have made a reservation for " 6. Contrast: Several N's are contrasted; "I would like to, but my friend does not. So, then we'd better not go to an opera; Can we go to an other performance?" 7. Joint: Several N's are joined; "How expensive would that be, and are there still vacant seats?" In case of rhetorical relations 1 through 3 the S is uttered after N, while in case of the relations 4 through 5 S is uttered before N. Relations and 7 are constituted by multiple nuclei. These orders are called canonical orders. Revisiting the thematic and rhetorical structure of a turn in a dialogue, we observe the following. The established thematic progression elucidates the actual flow of communicated information. Therein, we can observe which utterances convey what information. The rhetorical structure clarifies how information expressed by nuclei and satellites are functionally related to each other. Clearly, the question that might be raised subsequently is How does the segmentation of a turn into nuclei and satellites arise from the thematic progression? To answer the question, we should realise that we are actually deal ing with three smaller problems: First, how does a thematic progression segment a turn? A thematic progression divides a turn into discernible segments according to the flow of information. Intuitively, one might say that every time a new flow of information is commenced, a new segment
DISCERNING RELEVANT INFORMATION
265
is introduced. As we shall see in the example provided below, this means in general that when a parallel progression or hypertheme progression is invoked, a new segment starts. Second, how do we recognise the rhetorical relations involved? Mann and Thompson describe how rhetorical relations can be recognised by means of conditions (or constraints) that should hold for the textual structure. We conjecture that, in terms of our approach, rhetorical relations can be recognised by taking the thematic progression and the formed conceptual structure into account. Rephrased, rhetorical relations are conditioned by the thematic progression and the conceptual structure involved. Once the rhetorical relation has been recognised, the third problem of recognising nuclei and satellites is also solved (as Mann and Thompson state), for their characterisations follow inter alia from the canonical order of each rhetorical relation. 4
A n example
Here, we provide an example analysis of a turn into thematic progression and ensuing rhetorical structure. As will become obvious from the example, recognising the thematic progression as well as the rhetorical structure en ables us to observe which parts of a turn are to be considered as relevant. The issue of discerning relevance will be elaborated upon in the next section. (7)
For Wittgenstein tonight it is, yes. For four persons is fine. But the other one doesn't know. And because it is his birth day we would like to have our picture taken. Can you ask that too? Oh yes, and my husband would like to join us for dinner if that would be possible. No foreign stuff. So that is for three. Are you also in charge of the food?
Assuming that we have decent means to analyse the dialogue linguistically, let us commence with discerning the thematic progression. The schema displays sequential progressions (\\seq) and parallel progressions (||p0r) — see Figure 1. T3 and T3 refer hyperthematically to F 3 , being "(members of) the group that is going to the performance", but we shall not consider such in the case at hand. More interesting to observe is that the thematic progression quite naturally segmentates the turn of the dialogue, as we conjectured. Let us call the three segments STU S and ST6, the subscript denoting the Topic that initiates the segment. Subsequently, the segments can be said — quite uncontroversially, hope fully — to be rhetorically related as shown in Figure 2.
266
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
T1
→
[It]
F1 [Wittgenstein tonight] \\seq
T 2 (ellipsis)
()
→
→
F3 [four persons]
F3 [the other one] \\seq
T6
[husband]
→
T 4 ...[his birthday] ... → .... F 4 [picture] 11 seq T 5 Question F 6 [to join for dinner] 11 seq T 7 (ellipsis) → F7 [foreign stuff] T 8 (dinner) ||Iar
→
F 8 [three (persons)]
T 9 Question Fig. 1: Thematic progression in (7) ST1 5T6
← ←
[elaboration] [elaboration]
→ →
5T3 5T3
Fig. 2: Segments rhetorically related (a) Using the canonical order noted earlier, we can consequently determine the nuclei and satellites and construct the following hierarchical organisation (see Figure 3). Apparently, it suffices to maintain only the nucleus 5T1 and still have a coherent and justly purposeful dialogue. As we stated already, the concept of nuclearity is important to us. It directs the response to the turn of the dialogue, which in this case could for example be that there is no perform ance by Wittgenstein tonight at all. 5
Discerning relevant information
The current section will explain the fashion in which we discern relevant information in a dialogue, thereby building forth upon the previous section. First and foremost we should then clarify what we understand by relevance. When we state that a particular piece of information is relevant, we
DISCERNING RELEVANT INFORMATION
267
ST1 = Nucleus \ [elaboration] \ ST3 — Satellite wrt ST1 / Nucleus wrt S T 6 \ [elaboration] \ S T 6 = Nucleus Fig. 3: Segments rhetorically related (b) mean that it is relevant from a certain point of view. We do not want to take all the information that is provided into consideration. Rather, we are looking for information that fits our purposes. And what are these purposes? Recall the discussion above, where the concept of generic tasks was introduced. Generic tasks were presented as units to carry out simple tasks, units which could be combined into an overall structure that would remain flexible due to the functional individuality of the simple tasks. These generic tasks are our 'purposes'. More specifically, when carrying out a generic task, we look among the nuclei found in the rhetorical structure for one that presents us with the information that we need for performing the task at hand. In other words, such a nucleus presents us with relevant information. For example, if when carrying out the task IDENTIFY_PERFORMANCE, the following information is of importance to uniquely identify a performance: a) the name of the entertainer, the performing group, or the performance itself; b) the day (and if more performances on one day, also the time). Obviously, the nucleus ST1 is highly relevant to this task. For it provides us with both ENTERTAINER_NAME as well as PERFORMANCE_DAY. Interesting to note is that once we have such information, a proper response can be generated by the dialogue manager. For example, the system could respond that there is no performance by the entertainer on the mentioned day, or ask (in case of several performances on the same day), whether one would like to go in the afternoon or in the evening. Furthermore, things also work the other way around. As we noted earlier, a nucleus directs response. Therefore, a nucleus should also be re garded as a possibility to initiate the execution of a particular generic task.
268
GEERT-JAN M. KRUIJFF & JAN SCHAAKE
Such requires the following assumptions, though. First of all, a linguistic analysis should provide us with the concepts that are related to words or word-groups. Observe that this assumption has been made already above. Second, from each generic task it should be known which concepts are in volved in the performance of that task. Thus, what kinds of information it gathers. It basically boils down to the following then. Namely, if we know the concepts involved, we should be able to identify the generic task that should be initiated to respond properly to the user. It is realistic to assume that, based on all the information the user provides, several generic tasks might be invoked. Such tasks should then be placed in an order that would appear natural to the user. We must note, though, that it will not be the case that different generic tasks will be invoked based on identical information. Each generic task is functionally independent and has a simple goal, and as such works with information that is not relevant to other generic tasks. Recapitulating, we perceive of relevance in terms of information that is needed for the performance of tasks that are functionally independent and have simple goals: The so-called generic tasks. Based on the thematic progression and the rhetorical structure, we look for information in the nuclei that we have identified. If the information found is needed for a task that is currently being carried out, or if it can be used to initiate a new task, then we consider the information to be relevant information. Clearly, our system thereby no longer organises its responses strictly to prefixed scripts nor strictly to a recognition of the user's intentions. Due to our use of generic tasks and integrated with our understanding of relevant information, our system carries out its tasks corresponding the way the user provides it with information. Thus, the system is able to respond more flexibly as well as more natural to the user. 6
Conclusions
In this paper we stated that the information we are basically interested in is relevant information, and we provided the means by which one can arrive at relevant information. For that purpose, we discussed the Prague concepts of Topic and Focus Articulation (TFA) and thematic progression, the struc ture in which Topics and Foci get organised. Subsequently, we examined rhetorical structures in the light of Rhetorical Structure Theory, and showed how the rhetorical structure of a turn builds forth upon the turn's thematic progression. We identified genuine nuclei in a rhetorical structure to be
DISCERNING RELEVANT INFORMATION
269
potentially providers of relevant information. That is, information that a currently running generic task would need or that could initiate a generic task. We closed our discussion by noting how such leads to a system that is capable of responding to a user in a flexible and natural way. A couple of concluding remarks could be made. First of all, in the discussion we do not treat of thematic progressions spanning over more than one turn. Currently, thematic progressions and thus rhetorical structures are bound to single turns of a dialogue. We intend to lift this restriction after examining how we can completely integrate our logical theory of information with the views presented here. Second, we would like to elaborate on how the mechanisms described here would fit into a dialogue manager that parses dialogues on the level of generic tasks. Regarding the segmentation of discourses and its relation to the dy namics of the communication of information, a topic for further research could be to compare our point of view to that of Firbas' Communicative Dynamism as described in (Firbas 1992). REFERENCES Chandrasekaran, B. 1986. "Generic Tasks in Knowledge-Based Reasoning: HighLevel Building Blocks for Expert System Design". IEEE Expert. Danes, Frantisek. "Functional Sentence Perspective and the Organisation of Text". Papers on Functional Sentence Perspective ed. by F. Danes. Prague: Academia. Firbas, Jan. 1992. Functional Sentence Perspective in Written and Spoken Com munication. Cambridge: Cambridge University Press. Hajicová, Eva. 1993. From the Topic/Focus Articulation of the Sentence to Discourse Patterns. Vilém Mathesius Courses in Linguistics and Semiotics, Prague. 1993. Issues of Sentence Structure and Discourse Patterns. Prague: Charles University. 1994. Topic/Focus Articulation and Its Semantic Relevance. Vilém Math esius Courses in Linguistics and Semiotics, Prague. 1994. "Topic/Focus and Related Research". Prague School of Structural and Functional Linguistics ed. by Philip A. Luelsdorff, 245-275. Amsterdam & Philadelphia: John Benjamins. Van der Hoeven, G.F., Τ.Α. Andernach, S.P. van de Burgt, G.J.M. Kruijff, A. Nijholt, J. Schaake & F.M.G. de Jong. 1994. "SCHISMA: A Natural Language Accessible Theatre Information and Booking System". TWLT 8:
270
GEERT-JAN M. KRUIJFF & JAN SCHAAKE Speech and Language Engineering ed. by L. Boves, Α. Nijholt, 137-149. En schede: Twente University.
Mann, William . & Thompson, Sandra A. 1987. Rhetorical Structure Theory: A Theory of Text Organisation. Reprint, Marina del Rey, Calif.: Information Sciences Institute.
IV GENERATION
Approximate Chart Generation from Non-Hierarchical Representations NICOLAS NICOLOV, CHRIS MELLISH & G R A E M E R I T C H I E
Dept. of Artificial Intelligence, University of Edinburgh Abstract This paper presents a technique for sentence generation. We argue that the input to generators should have a non-hierarchical nature. This allows us to investigate a more general version of the sentence generation problem where one is not pre-committed to a choice of the syntactically prominent elements in the initial semantics. We also consider that a generator can happen to convey more (or less) information than is originally specified in its semantic input. In order to constrain this approximate matching of the input we impose ad ditional restrictions on the semantics of the generated sentence. Our technique provides flexibility to address cases where the entire input cannot be precisely expressed in a single sentence. Thus the gener ator does not rely on the strategic component having linguistic know ledge. We show clearly how the semantic structure is declaratively related to linguistically motivated syntactic representation. We also discuss a semantic-indexed memoing technique for non-deterministic, backtracking generators. 1
Introduction
Natural language generation is the process of realising communicative in tentions as text (or speech). The generation task is standardly broken down into the following processes: content determination (what is the meaning to be conveyed), sentence planning 1 (chunking the meaning into sentence sized units, choosing words), surface realisation (determining the syntactic struc ture), morphology (inflection of words), synthesising speech or formatting the text output. In this paper we address aspects of sentence planning (how content words are chosen but not how the semantics is chunked in units realisable as sen tences) and surface realisation (how syntactic structures are computed). We thus discuss what in the literature is sometimes referred to as tactical generation, that is "how to say it" — as opposed to strategic generation Note that this does not involve planning mechanisms!
274
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
— "what to say". We look at ways of realising a non-hierarchical semantic representation as a sentence, and explore the interactions between syntax and semantics. Before giving a more detailed description of our proposals first we mo tivate the non-hierarchical nature of the input for sentence generators and review some approaches to generation from non-hierarchical representations — semantic networks (Section 2). We proceed with some background about the grammatical framework we will employ — D-Tree Grammars (Section 3) and after describing the knowledge sources available to the generator (Sec tion 4) we present the generation algorithm (Section 5). This is followed by a step by step illustration of the generation of one sentence (Section 6). We then discuss further semantic aspects of the generation (Section 7), the memoing technique used by the generator (Section 8) and the implement ation (Section 9). We conclude with a discussion of some issues related to the proposed technique (Section 10). 2
Generation from non-hierarchical representations
The input for generation systems varies radically from system to system. Many generators expect their input to be cast in a tree-like notation which enables the actual systems to assume that nodes higher in the semantic structure are more prominent than lower nodes. The semantic representa tions used are variations of a predicate with its arguments. The predicate is realised as the main verb of the sentence and the arguments are real ised as complements of the main verb — thus the control information is to a large extent encoded in the tree-like semantic structure. Unfortunately, such dominance relationships between nodes in the semantics often stem from language considerations and are not always preserved across languages. Moreover, if the semantic input comes from other applications, it is hard for these applications to determine the most prominent concepts because lin guistic knowledge is crucial for this task. The tree-like semantics assumption leads to simplifications which reduce the paraphrasing power of the gener ator (especially in the context of multilingual generation). 2 In contrast, the use of a non-hierarchical representation for the underlying semantics allows the input to contain as few language commitments as possible and makes it possible to address the generation strategy from an unbiased position. We have chosen a particular type of a non-hierarchical knowledge representa tion formalism, conceptual graphs (Sowa 1992), to represent the input to 2
The tree-like semantics imposes some restrictions which the language may not support.
APPROXIMATE CHART GENERATION
275
our generator. This has the added advantage that the representation has well defined deductive mechanisms. A graph is a set of concepts connected with relations. The types of the concepts and the relations form generalisa tion lattices which also help define a subsumption relation between graphs. Graphs can also be embedded within one another. The counterpart of the unification operation for conceptual graphs is maximal join (which is nondeterministic). Figure 1 shows a simple conceptual graph which does not have cycles. The arrows of the conceptual relations indicate the domain and range of the relation and do not impose a dominance relationship.
Fig. 1: A simple conceptual graph The use of semantic networks in generation is not new (Simmons & Slocum 1972; Shapiro 1982). Two main approaches have been employed for generation from semantic networks: utterance path traversal and incre mental consumption.3 An utterance path is the sequence of nodes and arcs that are traversed in the process of mapping a graph to a sentence. Gener ation is performed by finding a cyclic path in the graph which visits each node at least once. If a node is visited more than once, grammar rules determine when and how much of its content will be uttered (Sowa 1984). It is not surprising that the early approaches to generation from semantic networks employed the notion of an utterance path — the then popular grammatical framework (Augmented Transition Networks) also involved a notion of path traversal. The utterance path approach imposes unnecessary restrictions on the resources (i.e., that the generator can look at a limited portion of the input — usually the concepts of a single relation); This imposes a local view of the generation process. In addition a directionality of processing is introduced which is difficult to motivate; sometimes linguistic knowledge is used to traverse the network (adverbs of manner are to be visited before adverbs of time); finally stating the relation between syntax and semantics involves the notion of how many times a concept has been visited. 3
Here the incremental consumption approach does not refer to incremental generation!
276
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Under the second approach, that of incremental consumption, generation is done by gradually relating (consuming) pieces of the input semantics to linguistic structure (Boyer & Lapalme 1985; Nogier 1991). Such covering of the semantic structure avoids some of the limitations of the utterance path approach and is also the general mechanism we have adopted (we do not rely on the directionality of the conceptual relations per se — the primit ive operation that we use when consuming pieces of the input semantics is maximal join which is akin to pattern matching). The borderline between the two paradigms is not clear-cut. Some researchers (Smith et al. 1994) are looking at finding an appropriate sequence of expansions of concepts and reductions of subparts of the semantic network until all concepts have real isations in the language. Others assume all concepts are expressible and try to substitute syntactic relations for conceptual relations (Antonacci 1992). Other work addressing surface realisation from semantic networks in cludes: generation using Meaning-Text Theory (Iordanskaja 1991), genera tion using the S N E P S representation formalism (Shapiro 1989), generation from conceptual dependency graphs (van Rijn 1991). Among those that have looked at generation with conceptual graphs are: generation using Lexical Conceptual Grammar (Oh et al. 1992), and generating from CGs using categorial grammar in the domain of technical documentation (Svenberg 1994). This work improves on existing generation approaches in the follow ing respects: (i) Unlike the majority of generators this one takes a nonhierarchical (logically well defined) semantic representation as its input. This allows us to look at a more general version of the realisation problem which in turn has direct ramifications for the increased paraphrasing power and usability of the generator; (ii) Following Nogier & Zock (1992), we take the view that lexical choice is essentially (pattern) matching, but unlike them we assume that the meaning representation may not be entirely con sumed at the end of the generation process. Our generator uses a notion of approximate matching and can happen to convey more (or less) information than is originally specified in its semantic input. We have a principled way to constrain this. We build the corresponding semantics of the generated sentence and aim for it to be as close as possible to the input semantics. (i) and (ii) thus allow for the input to come from a module that need not have linguistic knowledge; (iii) We show how the semantics is systematic ally related to syntactic structures in a declarative framework. Alternative processing strategies using the same knowledge sources can therefore be envisaged.
APPROXIMATE CHART GENERATION 3
277
D-Tree G r a m m a r s
Our generator uses a particular syntactic theory — D-Tree Grammar (DTG) which we briefly introduce because the generation strategy is influenced by the linguistic structures and the operations on them. D-Tree Grammar (DTG) (Rambow, VijayShanker & Weir 1995) is a new grammar formalism (also in the mathematical sense), which arises from work on Tree-Adjoining Grammars (TAG) (Joshi 1987).4 In the context of generation, TAGS have been used in a number of systems MUMBLE (McDon ald & Pustejovsky 1985), SPOKESMAN (Meteer 1990), WIP (Wahlster et al. 1991), the system reported by McCoy (1992), the first version of PRO T E C T O R 5 (Nicolov, Mellish & Ritchie 1995), and recently SPUD (by Stone & Doran), TAGS have been given a prominent place in the VERBMOBIL pro ject — they have been chosen to be the framework for the generation module (Caspari & Schmid 1994; Harbusch et al. 1994). In the area of grammar de velopment TAG has been the basis of one of the largest grammars developed for English (Doran 1994). Unlike TAGS, DTGS provide a uniform treatment of complementation and modification at the syntactic level, DTGS are seen as attractive for generation because a close match between semantic and syntactic operations leads to simplifications in the overall generation archi tecture, DTGS try to overcome the problems associated with TAGS while remaining faithful to what is seen as the key advantages of TAGS (Joshi 1987): 1. the extended domain of locality over which syntactic dependencies are stated; and 2. function argument structure is captured within a single initial con struction in the grammar. DTG assumes the existence of elementary structures and uses two operations to form larger structures from smaller ones. The elementary structures are tree descriptions6 which are trees in which nodes are linked with two types of links: domination links (d-links) and immediate domination links (ilinks) expressing (reflexive) domination and immediate domination relations between nodes. Graphically we will use a dashed line to indicate a d-link (see Figure 2). D-trees allow us to view the operations for composing trees as monotonic. The two combination operations that DTG uses are subsertion and sister-adjunction. 4 5 6
DTG and TAG are very similar, yet they are not equivalent (Weir pc). PROTECTOR is the generation system described in this paper. They are called d-trees hence the name of the formalism.
278
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Fig. 2: Subsertion Subsertion. When a d-tree α is subserted into another d-tree β, a com ponent 7 of α is substituted at a frontier nonterminal node (a substitution node) of β and all components of α that are above the substituted compon ent are inserted into d-links above the substituted node or placed above the root node of β (see Figure 2). It is possible for components above the sub stituted node to drift arbitrarily far up the d-tree and distribute themselves within domination links, or above the root, in any way that is compatible with the domination relationships present in the substituted d-tree. In or der to constrain the way in which the non-substituted components can be interspersed DTG uses subsertion-insertion constraints which explicitly spe cify what components from what trees can appear within a certain d-links. Subsertion as it is defined as a non-deterministic operation. Subsertion can model both adjunction and substitution in TAG .
Fig. 3:
Sister-adjunction
Sister-adjunction. When a d-tree α is sister-adjoined at a node η in a d-tree β the composed d-tree γ results from the addition to β of α as a new leftmost or rightmost sub-d-tree below η. Sister-adjunction involves the addition of exactly one new immediate domination link. In addition several 7
A tree component is a subtree which contains only immediate dominance links.
APPROXIMATE CHART GENERATION
279
sister-adjunctions can occur at the same node. Sister-adjoining constraints associated with nodes in the d-trees specify which other d-trees can be sisteradjoined at this node and whether they will be right- or left-sister-adjoined. For more details on DTGS see (Rambow, Vijay-Shanker & Weir 1995a) and (Rambow, Vijay-Shanker & Weir 1995b).
4
Knowledge sources
The generator assumes it is given as input an input semantics (InputSem) and 'boundary' constraints for the semantics of the generated sentence (BuiltSem which in general is different from InputSem8). The bound ary constraints are two graphs (UpperSem and LowerSem) which convey the notion of the least and the most that should be expressed. So we want BuiltSem to satisfy: LowerSem < BuiltSem < UpperSem.9 If the generator happens to introduce more semantic information by choosing a particular expression, LowerSem is the place where such additions can be checked for consistency. Such constraints on BuiltSem are useful because in general InputSem and BuiltSem can happen to be incomparable (neither one subsumes the other). In a practical scenario LowerSem can be the knowledge base to which the generator has access minus any contentious bits. UpperSem can be the minimum information that necessarily has to be conveyed in order for the generator to achieve the initial communicative intentions. The goal of the generator is to produce a sentence whose corresponding semantics is as close as possible to the input semantics, i.e., the realisation adds as little as possible extra material and misses as little as possible of the original input. In generation similar constraints have been used in the generation of referring expressions where the expressions should not be too general so that discriminatory power is not lost and not too specific so that the referring expression is in a sense minimal. Our model is a generalisation of the paradigm presented in (Reiter 1991) where issues of mismatch in lexical choice are discussed. We return to how UpperSem and LowerSem are actually used in Section 7. 8
9
This can come about from a mismatch between the input and the semantic structures expressible by the generator. The notation G1 < G2 means that G1 is subsumed by G2. We consider UpperSem to be a generalisation of BuiltSem and LowerSem a specialisation of BuiltSem (in terms of the conceptual graphs that represent them).
280
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Fig. 4: A mapping rule for transitive constructions 4.1
Mapping rules
Mapping rules state how the semantics is related to the syntactic repres entation. We do not impose any intrinsic directionality on the mapping rules and view them as declarative statements. In our generator a map ping rule is represented as a d-tree in which certain nodes are annotated with semantic information. Mapping rules are a mixed syntactic-semantic representation. The nodes in the syntactic structure are feature structures and we use unification to combine two syntactic nodes (Kay 1983). The semantic annotations of the syntactic nodes are either conceptual graphs or instructions indicating how to compute the semantics of the syntactic node from the semantics of the daughter syntactic nodes. Graphically we use dotted lines to show the coreference between graphs (or concepts). Each graph appearing in the rule has a single node ('the semantic head') which acts as a root (indicated by an arrow in Figure 4). This hierarchical struc ture is imposed by the rule, and is not part of the semantic input. Every mapping rule has associated applicability semantics which is used to license its application. The applicability semantics can be viewed as an evaluation of the semantic instruction associated with the top syntactic node in the tree description.
APPROXIMATE CHART GENERATION
281
Figure 4 shows an example of a mapping rule. The applicability semantics of this mapping rule is: -( j)→ . If this structure matches part of the input semantics (we explain more precisely what we mean by matching later on) then this rule can be triggered (if it is syntactically appropriate — see Section 5). The internal generation goals (shaded areas) express the following: (1) generate as a verb and subsert (substitute,attach) the verb's syntactic structure at the Vo node; (2) generate as a noun phrase and subsert the newly built structure at NP0; and (3) generate as another noun phrase and subsert the newly built structure at NP1. The newly built structures are also mixed syntactic-semantic representations (annotated d-trees) and they are incorporated in the mixed structure corresponding to the current status of the generated sentence.
5
Sentence generation
In this section we informally describe the generation algorithm. In Fig ure 5 and later in Figure 8, which illustrate some semantic aspects of the processing, we use a diagrammatic notation to describe semantic structures which are actually encoded using conceptual graphs. The input to the generator is InputSem, LowerSem, UpperSem and a mixed structure, Partial, which contains a syntactic part (usually just one node but possibly something more complex) and a semantic part which takes the form of semantic annotations on the syntactic nodes in the syntactic part. Initially Partial represents the syntactic-semantic correspondences which are imposed on the generator. 10 It has the format of a mixed structure like the representation used to express mapping rules (Figure 4). Later during the generation Partial is enriched and at any stage of processing it represents the current syntactic-semantic correspondences. We have augmented the DTG formalism so that the semantic structures associated with syntactic nodes will be updated appropriately during the subsertion and sister-adjunction operations. The stages of generation are: (1) building an initial skeletal structure; (2) attempting to consume as much as possible of the semantics uncovered in the previous stage; and (3) con verting the partial syntactic structure into a complete syntactic tree. 10
In dialogue and question answering, for example, the syntactic form of the generated sentence may be constrained.
282 5.1
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE Building a skeletal structure
Generation starts by first trying to find a mapping rule whose semantic structure matches 11 part of the initial graph and whose syntactic structure is compatible with the goal syntax (the syntactic part of Partial). If the initial goal has a more elaborate syntactic structure and requires parts of the semantics to be expressed as certain syntactic structures this has to be respected by the mapping rule. Such an initial mapping rule will have a syntactic structure that will provide the skeleton syntax for the sentence. If Lexicalised DTG is used as the base syntactic formalism at this stage the mapping rule will introduce the head of the sentence structure — the main verb. If the rule has internal generation goals then these are explored re cursively (possibly via an agenda — we will ignore here the issue of the order in which internal generation goals are executed 12 ). Because of the minimality of the mapping rule, the syntactic structure that is produced by this initial stage is very basic — for example only obligatory complements are considered. Any mapping rule can introduce additional semantics and such additions are checked against the lower semantic bound. When ap plying a mapping rule the generator keeps track of how much of the initial semantic structure has been covered/consumed. Thus at the point when all internal generation goals of the first (skeletal) mapping rule have been exhausted the generator knows how much of the initial graph remains to be expressed. 5.2
Covering the remaining semantics
In the second stage the generator aims to find mapping rules in order to cover most of the remaining semantics (see Figure 5) . The choice of mapping rules is influenced by the following criteria: Connectivity: The semantics of the mapping rule has to match (cover) part of the covered semantics and part of the remaining semantics. Integration: It should be possible to incorporate the semantics of the mapping rule into the semantics of the current structure being built by the generator. Realisability: It should be possible to incorporate the partial syntactic 11 via the maximal join operation. Also note that the arcs to/from the conceptual relations do not reflect any directionality of the processing — they can be 'tra versed '/accessed from any of the nodes they connect. 12 Different ways of exploring the agenda will reflect different processing strategies.
APPROXIMATE CHART GENERATION
283
Fig. 5: Covering the remaining semantics with mapping rules structure of the mapping rule into the current syntactic structure be ing built by the generator. Note that the connectivity condition restricts the choice of mapping rules so that a rule that matches part of the remaining semantics and the extra semantics added by previous mapping rules cannot be chosen (e.g., the 'bad mapping' in Figure 5). While in the stage of fleshing out the skeleton sen tence structure (Section 5.1) the syntactic integration involves subsertion, in the stage of covering the remaining semantics it is sister-adjunction that is used. When incorporating semantic structures the semantic head has to be preserved — for example when sister-adjoining the d-tree for an adverbial construction the semantic head of the top syntactic node has to be the same as the semantic head of the node at which sister-adjunction is done. This explicit marking of the semantic head concepts differs from (Shieber et al. 1990) where the semantic head is a PROLOG term with exactly the same structure as the input semantics. 5.3
Completing a derivation
In the preceding stages of building the skeletal sentence structure and cov ering the remaining semantics, the generator is mainly concerned with con suming the initial semantic structure. In those processes, parts of the se mantics are mapped onto partial syntactic structures which are integrated and the result is still a partial syntactic structure. That is why a final step of 'closing off' the derivation is needed. The generator tries to convert the partial syntactic structure into a complete syntactic tree. A morphological post-processor reads the leaves of the final syntactic tree and inflects the words.
284 6
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE Example
In this section we illustrate how the algorithm works by means of a simple example.13 Suppose we start with an initial semantics as given in Figure 1. This semantics can be expressed in a number of ways: Fred limped quickly, Fred hurried with a limp, Fred's limping was quick, The quickness of Fred's limping . . . , etc. Here we show how the first paraphrase is generated.
In the stage of building the skeletal structure the mapping rule (i) in Figure 6 is used. Its internal generation goals are to realise the instantiation of (which is as a verb and similarly as a noun phrase. The generation of the subject noun phrase is not discussed here. The main verb is generated using the terminal mapping rule14 (iii) in Figure .15 The skeletal structure thus generated is Fred limp(ed). (see (i) in Figure 7). An interesting point is that although the internal generation goal for the verb referred only to the concept in the initial semantics, all of the information suggested by the terminal mapping rule (iii) in Figure is consumed. We will say more about how this is done in Section 7. At this stage the only concept that remains to be consumed is This is done in the stage of covering the remaining semantics when the 13
14
15
For expository purposes some VP nodes normally connected by d-edges have been merged. Terminal mapping rules are mapping rules which have no internal generation goals and in which all terminal nodes of the syntactic structure are labelled with terminal symbols (lexemes). In Lexicalised DTGS the main verbs would be already present in the initial trees.
APPROXIMATE CHART GENERATION
285
mapping rule (ii) is used. This rule has an internal generation goal to generate the instantiation of as an adverb, which yields quickly. The structure suggested by this rule has to be integrated in the skeletal structure. On the syntactic side this is done using sister-adjunction. The final mixed syntactic-semantic structure is shown on the right in Figure 7. In the syntactic part of this structure we have no domination links. Also all
Fig. 7: Skeletal structure and final structure of the input semantics has been consumed. The semantic annotations of the S and VP nodes are instructions about how the graphs/concepts of their daughters are to be combined. If we evaluate in a bottom up fashion the semantics of the S node, we will get the same result as the input semantics in Figure 1. After morphologicalpost-processing the result is Fred limped quickly. An alternative paraphrase like Fred hurried with a limp16 can be generated using a lexical mapping rule for the verb hurry which groups and together and a another mapping rule expressing as a PP. To get both paraphrases would be hard for generators relying on hierarchical representations. 7
Matching the applicability semantics of mapping rules
Matching of the applicability semantics of mapping rules against other se mantic structures occurs in the following cases: when looking for a skeletal structure; when exploring an internal generation goal; and when looking for mapping rules in the phase of covering the remaining semantics. During the exploration of internal generation goals the applicability semantics of 16
Our example is based on Iordanskaja et al.'s notion of maximal reductions of a semantic net (see lordanskaja 1991:300). It is also similar to the example in (Nogier &· Zock 1992).
286
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
a mapping rule is matched against the semantics of an internal generation goal. We assume that the following conditions hold: 1. The applicability semantics of the mapping rule can be maximally joined with the goal semantics. 2. Any information introduced by the mapping rule that is more special ised than the goal semantics (additional concepts/relations, further type instantiation, etc.) must be within the lower semantic bound (Lower Sem). If this additional information is within the input se mantics, then information can propagate from the input semantics to the mapping rule (the shaded area 2 in Figure 8). If the mapping rule's semantic additions are merely in LowerSem, then information cannot flow from LowerSem to the mapping rule (area 1 in Figure 8).
Fig. 8: Interactions involving the applicability semantics of a mapping rule Similar conditions hold when in the phase of covering the remaining se mantics the applicability semantics of a mapping rule is matched against the initial semantics. This way of matching allows the generator to convey only the information in the original semantics and what the language forces one to convey even though more information might be known about the particular situation. In the same spirit after the generator has consumed/expressed a concept in the input semantics the system checks that the lexical semantics of the generated word is more specific than the corresponding concept (if there is one) in the upper semantic bound. 8
Preference-based chart generation
During generation appropriate mapping rules have to be found. However, at each stage a number of rules might be applicable. Due to possible in-
APPROXIMATE CHART GENERATION
287
teractions between some rules the generator may have to explore different allowable sequences of choices before actually being able to produce a sen tence. Thus, generation is in essence a search problem. Our generator uses a non-deterministic generation strategy to explore the search space.17 The generator explores each one of the applicable mapping rules in turn through backtracking. In practice this means that whenever the generator reaches a dead end (a point in the process where none of the available alternatives are consistent with the choices made so far) it has to undo some previous commitments and return to an earlier choice point where there are still unexplored options. It often happens that computations in one branch of the search space have to re-done in another even if the first branch did not lead to a solution of the generation goal. Consider a situation where the semantics in Figure 9 is to be expressed. FULL SCALE
( ι
ALEXANDER
TOWN: #
Fig. 9: Alexander attacked the town. The attack was jullscale. The generator will first choose a skeletal mapping rule anchored by at tack — X attacked Y18 and then will go on to generate the subject and object NPs. The skeletal string will be Alexander attacked the town. Then, in the phase of covering the remaining semantics, the system will attempt to generate as an adverb and will fail. The generator will return to previous choice points and will revise some of the earlier de cisions. Another skeletal mapping rule (X launched an a t t a c k on Y) will be choosen which eventually will lead to a successful solution: Alexander launched a full scale attack on the town. Yet, because the computations after the first incorrectly chosen skeletal mapping rule were undone, the ef fort of generating the two NPs (subject and object) will have to be repeated. There is no way for the system to predict this situation — the reason for the failure above is a lexical gap. Thus, recomputation of structures is a recurrent problem for backtracking generators.19 17 18 19
This is in contrast to systemic and classification approaches which are deterministic. The syntactic structure of the mapping rule is a simple declarative transitive tree. It can be argued that the problem with reaching a dead end above is due to the fact that the two available mapping rules have been distinguished too early. Both
288
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
In order to address the problem of recomputing structures we have explored aspects of a new semantic-indexed memoing technique based on on-line caching of generated constituents. The general idea is very simple: every time a constituent is generated it is being stored and every time a generation goal is explored the system first checks if the result isn't stored already. Following the corresponding term in parsing this technique has come to be known as chart-generation. Information about partial structures in kept in a chart which is not indexed on string positions (because a certain constituent might appear in different positions in different paraphrases) but on the heads of the headed conceptual graphs which represent the built semantics for the subphrases. 20 We also introduce agenda-based control for chart generators which al lows for an easy way to define an array of alternative processing strategies simply as different ways of exploring the agenda. Having a system that allows for easy definition of different generation strategies provides for the eventual possibility of comparing different algorithms based on the uniform processing mechanism of the agenda-based control for chart generation. 21 One particular aspect which we are currently investigating is the use of syntactic and semantic preferences for rating intermediate results. Syn tactic/stylistic preferences are helpful in cases where the semantics of two paraphrases are the same. One such instance of use of syntactic preferences is avoiding (giving lower rating to) heavy constituents in split verb-particle constructions. With regard to semantic preferences we have defined a novel measure which compares two graphs (say applicability semantics of two mapping rules) with respect to a third (in our case this is the input se mantics). Given a conceptual graph the measure defines what does it mean for one graph to be a better approximate match than another. 22 Thus,
20
21
22
alternatives share a lot of structure and neither can be ruled out in favour of the other during the stage of generating their skeletal structures. Obviously if we used a 'parallel' generation technique that explores shared forests of structure, there would be less need for backtracking. This aspect has remained underexplored in generation work. The major assumption about memoing techniques like chart generation is that retriev ing the result is cheaper than computing it from scratch. For a very long time this was the accepted wisdom in parsing, yet new results show that storing all constituents might not always lead to the best performance (van Noord forthcoming). Chart generation has been investigated by Shieber (1988), Haruno et al. (1993), Pianesi (1993), Neumann (1994), Kay (1996), and Shemtov (forthcoming). For a good discussion of preference-driven processing of natural language (mainly pars ing) see Erbach (1995).
APPROXIMATE CHART GENERATION
289
the generator finds all possible solutions (i.e., it is complete) producing the 'best' first. 9
Implementation
We have developed a sentence generator called PROTECTOR (approxim ate PROduction of TExts from Conceptual graphs in a declaraTive framewORk). P R O T E C T O R is implemented in LIFE (Aït-Kaci & Podelski 1993). The syntactic coverage of the generator is influenced by the XTAG system (the first version of P R O T E C T O R in fact used TAGS 2 3 ). By using DTGS we can use most of the analysis of xTAG while the generation algorithm is simpler because complementation and modification on the semantic side correspond to subsertion and sister-adjunction on the syntactic side. Thus in the stage of building a skeletal structure only subsertion is used. In covering the re maining semantics only sister adjunction is used. We are in a position to express subparts of the input semantics as different syntactic categories as appropriate for the current generation goal (e.g., VPs and nominalisations). The syntactic coverage of P R O T E C T O R includes: intransitive, transitive, and ditransitive verbs, topicalisation, verb particles, passive, sentential comple ments, control constructions, relative clauses, nominalisations and a variety of idioms. On backtracking P R O T E C T O R returns all solutions. We are also looking at the advantages that our approach offers for multilingual genera tion. 10
Discussion
In the previous section we mentioned that generation is a search problem. In order to guide the search a number of heuristics can be used. In (Nogier & Zock 1992) the number of matching nodes has been used to rate different matches, which is similar to finding maximal reductions in (Iordanskaja 1991:300). Alternatively, a notion of semantic distance (cf. Foo 1992) might be employed. In P R O T E C T O R we will use a much more sophisticated notion of what it is for a conceptual graph to match better the initial semantics than another graph. This captures the intuition that the generator should try to express as much as possible from the input while adding as little as possible extra material. We use instructions showing how the semantics of a mother syntactic node is computed because we want to be able to correctly update the se mantics of nodes higher than the place where substitution or adjunction has 23
PROTECTOR-95 was implemented in PROLOG.
290
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
taken place — i.e., we want to be able to propagate the substitution or ad junction semantics up the mixed structure whose backbone is the syntactic tree. We also use a notion of headed conceptual graphs, i.e., graphs that have a certain node chosen as the semantic head. The initial semantics need not be marked for its semantic head. This allows the generator to choose an appropriate (for the natural language) perspective. The notion of semantic head and their connectivity is a way to introduce a hierarchical view on the semantic structure which is dependent on the language. When matching two conceptual graphs we require that their heads be the same. This reduces the search space and speeds up the generation process. Our generator is not coherent or complete (i.e., it can produce sentences with more general/specific semantics than the input semantics). We try to generate sentences whose semantics is as close as possible to the input in the sense that they introduce little extra material and leave uncovered a small part of the input semantics. We keep track of more structures as the generation proceeds and are in a position to make finer distinctions than was done in previous research. The generator never produces sentences with semantics which is more specific than the lower semantic bound which gives some degree of coherence. Our generation technique provides flexibility to address cases where the entire input cannot be expressed in a single sen tence by first generating a 'best match' sentence and allowing the remaining semantics to be generated in a follow-up sentence. Our approach can be seen as a generalisation of semantic head-driven generation (Shieber et al. 1990) — we deal with a non-hierarchical input and non-concatenative grammars. The use of Lexicalised DTG means that the algorithm in effect looks first for a syntactic head. This aspect is similar to syntax-driven generation (König 1994). Unlike semantic head-driven generation we generate modifiers after the corresponding syntactic head has been generated which allows for better treatment of colocations. We have specified a declarative definition of 'derivation' in our framework (including the semantic aspects of the approximate generation), yet due to space constraints we omit a full discussion of it here. The notion of derivation in generation is an important one. It allows one to abstract from the procedural details of a particular implementation and to consider the logical relationships between the structures that are manipulated. If alternative generation strategies are to be developed clearly stating what a derivation is, is an important prerequisite. If similar research had been done for other frameworks we could make comparisons with relevant generation
APPROXIMATE CHART GENERATION
291
work; regretably this is not the case.24 Potentially the information in the mapping rules can be used by a nat ural language understanding system too. However, parsing algorithms for the particular linguistic theory that we employ (DTG) have a complexity of O(n 4k+3 ) where n is the number of words in the input string and is the number of d-edges in elementary d-trees. This is a serious overhead and we have not tried to use the mapping rules in reverse for the task of understanding. 25 The algorithm has to be checked against more linguistic data and we intend to do more work on additional control mechanisms and also using alternative generation strategies using knowledge sources free from control information. 11
Conclusion
We have presented a technique for sentence generation from conceptual graphs. The use of a non-hierarchical representation for the semantics and approximate semantic matching increases the paraphrasing power of the generator and enables the production of sentences with radically different syntactic structure due to alternative ways of grouping concepts into words. This is particularly useful for multilingual generation and in practical gen erators which are given input from non linguistic applications. The use of a syntactic theory (D-Tree Grammars) allows for the production of linguist ically motivated syntactic structures which will pay off in terms of better coverage of the language and overall maintainability of the generator. The syntactic theory also affects the processing — we have augmented the syn tactic operations to account for the integration of the semantics. The gen eration architecture makes explicit the decisions that have to be taken and allows for experiments with different generation strategies using the same declarative knowledge sources.26 24
25
26
Yet there has been work on a unified approach of systemic, unification and classification approaches to generation. For more details see (Mellish 1991). The first author is involved in a large project (with David Weir & John Carroll at the University of Sussex) for "Analysis of Naturally-Occurring English Text with Stochastic Lexicalised Grammars" which uses the same grammar formalism (D-Tree Grammars). The goal of the project is to develop a wide-coverage parsing system for English. From the point of view of generation it is interesting to investigate the bidirecţionality of the grammar, i.e., whether the grammar used for parsing can be used for generation. More details about the above mentioned project can be found at http ://www.cogs.susx.ac.uk/lab/nlp/dtg/. More details about the P R O T E C T O R generation system are available on the world-wide web: h t t p : / / w w w . c o g s . s u s x . a c . u k / l a b / n l p / n i c o l a s / .
292
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE REFERENCES
Aït-Kaci, Hassan & Andreas Podelski. 1993. "Towards a Meaning of LIFE". Journal of Logic Programming 16:3&4.195-234. Antonacci, F. et al. 1992. "Analysis and Generation of Italian Sentences". Con ceptual Structures: Current Research and Practice ed. by Timothy Nagle et al., 437-460. London: Ellis Horwood. Boyer, Michel & Guy Lapalme. 1985. "Generating Paraphrases from MeaningText Semantic Networks". Computational Intelligence 1:1.103-117. Caspari, Rudolf & Ludwig Schmid. 1994. "Parsing and Generation in TrUG [in German]". Verbmobil Report 40. Siemens AG. Doran, Christine et al. 1994. "XTAG — A Wide Coverage Grammar for Eng lish". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 922-928. Kyoto, Japan. Erbach, Gregor. 1995. Bottom-Up Barley Deduction for Preference-Driven Nat ural Language Processing. Ph.D. dissertation, University of the Saarland. Saarbrücken, Germany. Foo, Norman et al. 1992. "Semantic Distance in Conceptual Graphs". Concep tual Structures: Current Research and Practice ed. by Timothy Nagle et al., 149-154. London: Ellis Horwood. Harbusch, Karin, G. Kikui & A. Kilger. 1994. "Default Handling in Incremental Generation". Proceedings of the 15th International Conference on Computa tional Linguistics (COLING-94)) 356-362. Kyoto, Japan. Iordanskaja, Lidija, Richard Kittredge & Alain Polguère. 1991. "Lexical Selec tion and Paraphrase in a Meaning-Text Generation Model". Natural Lan guage Generation in Artificial Intelligence and Computational Linguistics ed. by C.Paris, W.Swartout & W.Mann, 293-312. Dordrecht, The Nether lands: Kluwer. Joshi, Aravind. 1987. "The Relevance of Tree Adjoining Grammar to Gen eration". Natural Language Generation ed. by Gerard Kempen, 233-252. Dordrecht, The Netherlands: Kluwer. Kay, Martin. 1983. "Unification Grammar". Technical Report. Palo Alto, Calif.: Xerox Palo Alto Research Center. 1996. "Chart Generation", Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL'96), 200-204, Santa Cruz, Calif.: Association for Computational Linguistics. König, Esther. 1994. "Syntactic Head-Driven Generation". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 475-481. Kyoto, Japan.
APPROXIMATE CHART GENERATION
293
McCoy, Kathleen F., . Vijay-Shanker & G. Yang. 1992. "A Functional Ap proach to Generation with TAG". Proceedings of the 30th Meeting of the Association for Computational Linguistics ACL (ACL'92), 48-55. Delaware: Association for Computational Linguistics. McDonald, David & James Pustejovsky. 1985. "TAGs as a Grammatical Form alism for Generation". Proceedings of the 23rd Annual Meeting of the As sociation for Computational Linguistics (ACL'85), 94-103. Chicago, Illinois: Association for Computational Linguistics. Mellish, Chris. 1991. "Approaches to Realization in Natural Language Genera tion". Natural Language and Speech ed. by Ewan Klein & Frank Veltman, 95-116. Berlin: Springer-Verlag. Meteer, Marie. 1990. The "Generation Gap": The Problem of Expressibüüy in Text Planning. Ph.D. dissertation, Univ. of Massachusetts, Mass. (Also available as COINS TR 90-04.) Neumann, Günter. 1994. A Uniform Computational Model for Natural Lan guage Parsing an Generation. Ph.D. dissertation. University of Saarland, Saarbrücken, Germany. Nicolov, Nicolas, Chris Mellish & Graeme Ritchie. 1995. "Sentence Genera tion from Conceptual Graphs". Conceptual Structures: Applications, Imple mentation and Theory, (LNAI 954) ed. by -G. Ellis, R. Levinson, W. Rich & J. Sowa, 74-88. Berlin: Springer. Nogier, Jean-François. 1991. Génération Automatique Conceptuels. Paris: Hermès.
de Langage et Graphs
& Michael Zock. 1992. "Lexical Choice as Pattern Matching". Conceptual Structures: Current Research and Practice, ed. by Timothy Nagle et al., 413-436. London: Ellis Horwood. van Noord, Gertjan. Forthcoming. "An Efficient Implementation of the HeadCorner Parser". To appear in Computational Linguistics. Oh, Jonathan et al. 1992. "NLP: Natural Language Parsers and Generators". Proceedings of the 1st International Workshop on PEIRCE: A Conceptual Graph Workbench, 48-55. Las Cruces: New Mexico State University. Pianesi, Fabio. 1993. "Head-Driven Bottom-Up Generation and Government and Binding: A Unified Perspective". New Concepts in Natural Language Generation ed. by H. Horacek & M. Zock, 187-214. London: Pinter. Rambow, Owen, K. Vijay-Shanker & David Weir. 1995a. "D-Tree Grammars". Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL',95), 151-158. Boston, Mass.: Association for Computa tional Linguistics.
294
NICOLAS NICOLOV, CHRIS MELLISH & GRAEME RITCHIE
Rambow, Owen, K. Vijay-Shanker & David Weir. 1995b. "Parsing D-Tree Gram mars". Proceedings of the International Workshop on Parsing Technologies (IWPT'95), 252-259. Prague. Reiter, Ehud. 1991. "A New Model of Lexical Choice for Nouns". Computational Intelligence (Special Issue on Natural Language Generation) 7:4.240-251. Shapiro, Stuart. 1982. "Generalized Augmented Transition Network Grammars for Generation from Semantic Networks". Computational Linguistics 2:8.1225. 1989. "The CASSIE projects: An approach to NL Competence. Proceed ings of the 4th Portugese Conference on AI: EPIA-89 (LNAI 390), 362-380. Berlin: Springer. Shemtov, Hadar. Forthcoming. Logical Forms".
"Generation of Paraphrases from Ambiguous
Shieber, Stuart, Gertjan Noord, Robert Moore & Fernando Pereira. 1990. "A Semantic Head-Driven Generation Algorithm for Unification-Based Formal isms". Computational Linguistics 16:1.30-42. Simmons, R. & J. Slocum. 1972. "Generating English Discourse from Semantic Networks". Communications of the Association for Computing Machinery (CACM) 15:10.891-905. Smith, Mark, Roberto Garigliano & Richard Morgan. 1994. "Generation in the LOLITA System: An Engineering Approach". Proceedings of the 7th Int. Workshop on Natural Language Generation, 241-244. Kennebunkport, Maine, U.S.A. Sowa, John. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading, Mass.: Addison-Wesley. Sowa, John. 1992. "Conceptual Graphs Summary". Conceptual Structures: Current Research and Practice ed. by Timothy Nagle et al., 3-51. London: Ellis Horwood. Svenberg, Stefan. 1994. "Representing Conceptual and Linguistic Knowledge for Multilingual Generation in a Technical Domain". Proceedings of the 7th International Workshop on Natural Language Generation (IWNLG'94), 245248. Kennebunkport, Maine, U.S.A. Afke van Rijn. 1991. Natural Language Communication between Man and Ma chine. Ph.D. dissertation, Technical University Delft, The Netherlands. Wahlster, Wolfgang et al. 1991 "WIP: The Coordinated Generation of Mul timodal Presentations from a Common Representation". Technical Report RR 91-08. Saarbrücken, Germany: DFKI.
Example-Based Optimisation of Surface-Generation Tables CHRISTER SAMUELSSON
Universität
des
Saarlandes
Abstract A method is given that 'inverts' a logic grammar and displays it from the point of view of the logical form, rather than from that of the word string. LR-compiling techniques are used to allow a recursivedescent generation algorithm to perform 'functor merging' much in the same way as an LR parser performs prefix merging. This is an improvement on the semantic-head-driven generator that results in a much smaller search space. The amount of semantic lookahead can be varied, and appropriate tradeoff points between table size and resulting nondeterminism can be found automatically. This can be done by removing all spurious nondeterminism for input sufficiently close to the examples of a training corpus, and large portions of it for other input, while preserving completeness. 1
1
Introduction
W i t h the emergence of fast algorithms and optimisation techniques for syn tactic analysis, such as the use of explanation-based learning in conjunction with LR parsing, see (Samuelsson & Rayner 1991) and subsequent work, surface generation has become a major bottleneck in NLP systems. Surface generation will here be viewed as the inverse problem of syntactic analysis and subsequent semantic interpretation. The latter consists in constructing some semantic representation of an input word-string based on the syn tactic and semantic rules of a formal grammar. In this article, we will limit ourselves to logic grammars t h a t attribute word strings with expressions in some logical formalism represented as terms with a functor-argument struc ture. T h e surface generation problem then consists in assigning an o u t p u t 1
I wish to thank greatly Gregor Erbach, Jussi Karlgren, Manny Rayner, Hans Uszkoreit, Mats Wirén and the anonymous reviewers of ACL, EACL, IJCAI and RANLP for valuable feedback on previous versions of this article. Special credit is due to Kristina Striegnitz, who assisted with the implementation. Parts of this article have previously appeared as (Samuelsson 1995). The presented work was funded by the N3 "Bidirektionale Linguistische Deduktion (BiLD)" project in the Sonderforschungsbereich 314 Künstliche Intelligenz — Wissensbasierte Systeme.
296
CHRISTER SAMUELSSON
word-string to such a term. This is a common scenario in conjunction with for example transfer-based machine-translation systems employing revers ible grammars, and it is different from that when a deep generator or a text planner is available to guide the surface generator. In general, both these mappings are many-to-many: a word string that can be mapped to several distinct logical forms is said to be ambiguous. A logical form that can be assigned to several different word strings is said to have multiple paraphrases. We want to create a generation algorithm that generates a word string by recursively descending through a logical form, while delaying the choice of grammar rules to apply as long as possible. This means that we want to process different rules or rule combinations that introduce the same piece of semantics in parallel until they branch apart. This will reduce the amount of spurious search, since we will gain more information about the rest of the logical form before having to commit to a particular grammar rule. In practice, this means that we want to perform 'functor merging' much in the same ways as an LR parser performs prefix merging by employing parsing tables compiled from the grammar. One obvious way of doing this is to use LR-compilation techniques to compile generation tables. This will however require that we reformulate the grammar from the point of view of the logical form, rather than from that of the word string from which it is normally displayed. The rest of the paper is structured as follows: We will first review ba sic LR compilation of parsing tables in Section 2. The grammar-inversion procedure turns out to be most easily explained in terms of the semantichead-driven generation (SHDG) algorithm. We will therefore proceed to outline the SHDG algorithm in Section 3. The grammar inversion itself is described in Section 4, while LR compilation of generation tables is dis cussed in Section 5. The generation algorithm is presented in Section 6. The example-based optimisation technique turns out to be most easily ex plained as a straight-forward extension of a simpler optimisation technique predating it, why this simpler technique is given in Section 7. This exten sion is described in Section 8 and the relation between this example-based optimisation technique and explanation-based learning is discussed in Sec tion 9.
OPTIMISATION OF GENERATION TABLES 2
297
LR compilation for parsing
LR compilation in general is well-described in for example (Aho et al. 1986:215-247). Here we will only sketch out the main ideas. An LR parser is basically a pushdown automaton, i.e., it has a pushdown stack in addition to a finite set of internal states and a reader head for scanning the input string from left to right one symbol at a time. The stack is used in a characteristic way. The items on the stack consist of alternating grammar symbols and states. The current state is simply the state on top of the stack. The most distinguishing feature of an LR parser is however the form of the transition relation — the action and goto tables. A nondeterministic LR parser can in each step perform one of four basic actions. In state S with lookahead symbol2 Sym it can: 1. accept: Halt and signal success. 2. error: Fail and backtrack. 3. shift S2: Consume the input symbol Sym, push it onto the stack, and transit to state S2 by pushing it onto the stack. 4. reduce R: Pop off two items from the stack for each grammar symbol in the RHS of grammar rule R, inspect the stack for the old state S1 now on top of the stack, push the LHS of rule R onto the stack, and transit to state S2 determined by goto(S1,LHS,S2) by pushing S2 onto the stack. Consider the small sample grammar given in Figure 1. To make this simple grammar slightly more interesting, the recursive Rule 1, S → S QM, allows the addition of a question mark (QM) to the end of a sentence (S), as in John sleeps?. The LHS S is then interpreted as a yes-no question version of the RHS S. Each internal state consists of a set of dotted items. Each item in turn corresponds to a grammar rule. The current string position is indicated by a dot. For example, Rule 2, S → NP VP, yields the item S → NP . VP, which corresponds to just having found an NP and now searching for a VP. In the compilation phase, new states are induced from old ones. For the indicated string position, a possible grammar symbol is selected and the dot is advanced one step in all items where this particular grammar symbol immediately follows the dot, and the resulting new items will constitute the kernel of the new state. Non-kernel items are added to these by selecting 2
The lookahead symbol is the next symbol in the input string, i.e., the symbol under the reader head.
298
CHRISTER SAMUELSSON
s s
VP VP VP VP PP
→ → -→→
→ → → →
S QM NP VP VP PP VP AdvP
Vi
Vt NP Ρ ΝΡ
1 2 S
4 5 6 7
NP NP NP
vi vt Ρ AdvP QM
→ → → → → → → →
John Mary Paris sleeps sees in today ?
Fig. 1: Sample grammar grammar rules whose LHS match grammar symbols at the new string posi tion in the new items. In each non-kernel item, the dot is at the beginning of the rule. If a set of items is constructed that already exists, then this search branch is abandoned and the recursion terminates.
State 1 ff. •S S • 5" QM S NP VP
State 2
ff S•. S S • QM
State 3 S NP • VP VP •VP PP VP •. VP AdvP VP •.Vi VP • Vt N
State 4 S NP VP . VP VP • PP VP VP • AdvP PP • Ρ ΝΡ
State 8 PP Ρ NΡ .
State 5 VP • VP ΡΡ .
State 10 VP Vt • NP
State 6 VP = VP AdvP .
State 11 VP Vt NP•
State 7 PP Ρ . NP
State 12 S S QM •
State 9 VP Vi •.
Fig. 2: LR-parsing states for the sample grammar The state-construction phase starts off by creating an initial set consist ing of a single dummy kernel item and its non-kernel closure. This is State 1 in Figure 2. The dummy item introduces a dummy top grammar symbol as its LHS, while the RHS consists of the old top symbol, and the dot is at the beginning of the rule. In the example, this is the item S' • S. The rest
299
OPTIMISATION OF GENERATION TABLES
of the states are induced from the initial state. The states resulting from the sample grammar of Figure 1 are shown in Figure 2, and these in turn will yield the parsing tables of Figure 3. The entry "s3" in the action table, for example, should be interpreted as "shift the lookahead symbol onto the stack and transit to State 3". The entry "r7" should be interpreted as "re duce by Rule 7". The accept action is denoted "acc" . The goto entries, like "g4", simply indicate what state to transit to once a nonterminal of that type has been constructed. NP 1 s3 2 3 4 5 6 7 s8 8 9 10 s11 11 12
VP
PP
AdvP
g4 g5
Vi
Vt
s9
s lO
Ρ
S g2
QM
eos
sl2
C
s6 r3 r4
s7 r3 r4
r2 r3 r4
r2 4
r7 r5
r7 r5
r7 r5
7 5
r6
r1
r1
Fig. 3: LR-parsing tables for the sample grammar In conjunction with grammar formalisms employing complex feature structures, this procedure is associated with a number of interesting prob lems, many of which are discussed in (Nakazawa 1991) and (Samuelsson 1994c). For example, the termination criterion must be modified: if a new set of items is constructed that is more specific than an existing one, then this search branch is abandoned and the recursion terminates. If, on the other hand, it is more general, then it replaces the old one. 3
The semantic head-driven generation algorithm
Generators found in large-scale systems such as the DFKI DISCO system (Uszkoreit et al. 1994), or the SRI Core Language Engine (Alshawi (ed.) 1992:268-275), tend typically to be based on the semantic-head-driven gen eration (SHDG) algorithm. The SHDG algorithm is well-described in (Shieber et al. 1990); here we will only outline the main features.
300
CHRISTER SAMUELSSON
The grammar rules of Figure 1 have been attributed with logical forms as shown in Figure 4. The notation has been changed so that each constitu ent consists of a quadruple , where W0 and W1 form a difference list representing the word string that Cat spans, and Sem is the logical form. For example, the logical form corresponding to the LHS S of the (S,mod(X,Y), W0 W) → (S,X,W0, W1) rule, consists of a modifier Y added to the logical form X of the RHS S. As we can see from the last grammar rule, this modifier is in turn realised as ynq. 1 2 3 4 5 6
7
For the SHDG algorithm, the grammar is divided into chain rules and nonchain rules. Chain rules have a distinguished RHS constituent, the semantic head, that has the same logical form as the LHS constituent, modulo λabstractions; non-chain rules lack such a constituent. In particular, lexicon entries are non-chain rules, since they do not have any RHS constituents at all. This distinction is made since the generation algorithm treats the two rule types quite differently. In the example grammar, rules 2 and 5 through 7 are chain rules, while the remaining ones are non-chain rules. A simple semantic-head-driven generator might work as follows: Given a grammar symbol and a piece of logical form, the generator looks for a non-chain rule with the given semantics. The constituents of the RHS of that rule are then generated recursively, after which the LHS is connected
OPTIMISATION OF GENERATION TABLES
301
to the given grammar symbol using chain rules. At each application of a chain rule, the rest of the RHS constituents, i.e., the non-head constituents, are generated recursively. The particular combination of connecting chain rules used is often referred to as a chain. The generator starts off with the top symbol of the grammar and the logical form corresponding to the string that is to be generated. The inherent problem with the SHDG algorithm is that each rule com bination is tried in turn, while the possibilities of prefiltering are rather limited, leading to a large amount of spurious search. The generation al gorithm presented in the current article does not suffer from this problem; what the new algorithm in effect does is to process all chains from a partic ular set of grammar symbols down to some particular piece of logical form in parallel before any rule is applied, rather than to construct and try each one separately in turn. 4
Grammar inversion
Before we can invert the grammar, we must put it in normal form. We will use a variant of chain and non-chain rules, namely functor-introducing rules corresponding to non-chain rules, and argument-filling rules corresponding to chain rules. The inversion step is based on the assumption that there are no other types of rules. Since the generator will work by recursive descent through the logical form, we wish to rearrange the grammar so that arguments are generated together with their functors. To this end we introduce another difference list A0 and A to pass down the arguments introduced by argument-filling rules to the corresponding functor-introducing rules. Here the latter rules are assumed to be lexical, following the tradition in GPSG where the presence of the SUBCAT feature implies a preterminal grammar symbol, see e.g., (Gazdar et al. 1985:33), but this is really immaterial for the algorithm. The grammar of Figure 4 is shown in normal form in Figure 5. The grammar is compiled into this form by inspecting the flow of arguments through the logical forms of the constituents of each rule. In the functorintroducing rules, the RHS is rearranged to mirror the argument order of the LHS logical form. The argument-filling rules have only one RHS constituent — the semantic head — and the rest of the original RHS constituents are added to the argument list of the head constituent. Note, for example, how the NP is added to the argument list of the VP in Rule 2, or to the argument list of the Ρ in Rule 7. This is done automatically, although currently, the exact flow of arguments is specified manually.
302
CHRISTER SAMUELSSON
Functor-introducing rules ‹S,mod(X,Y),W0,W,ϵ,ϵ→ ‹S,X,W0,W1e,e ‹QM,Y,W u W,e,e ̂ m od(Y,Z),W 0 ,W,A Q ,A → VP,X VP, X̂Y,Wo, W1A0, A AdvP, Z, W1, W, e, e VP,X ̂ m od(Y,Z),W 0 ,W,A 0 ,A→ VP, X̂Y,W0, W1, A0, A PP, Ζ, W1, W, e, e NP,joha,[John\W],W,A,e → A NP,mary,[Mory|W],W,A,ϵ→ A. N Ρ , p a r i s , [Paris|W],W, A, ϵ → A Vi,X^rsleep(X),[sleeps\W],W,A,e → A V t ,X^Y^see(X,Y),[see\W],W,A,e → A P,X^in(X),[in|W],W;A,ϵ → A AdvP, today, [today\W),W, A,ϵ → A QM,ynq,[?|W],W,A,ϵ → A
1 3 4
Argument-filling rules S,Y,W0, W, ϵ,ϵ → VP,XY,W1,W,[NP,X,W0,W1],ϵ VP,X,W0,W;A0,A → Vi,X,W0,W,A0,A) 5 VP,Y,W0,W;A0,A)→ Vt,X^Y,W0,[NP,X,W1,W|A0],A
P P , Y , W 0 , W , A 0 , A >→ Ρ,Χ^YW0.W1[NP,XW1,W Fig. 5: Sample grammar in normal form
|A0],A
6
7
We assume that there are no purely argument-filling cycles. For rules that actually fill in arguments, this is obviously impossible, since the number of arguments decreases strictly. For the slightly degenerate case of argumentfilling rules which only pass along the logical form, such as the (VP,X) → Vi, X rule, this is equivalent to the off-line parsability requirement, (Kaplan & Bresnan 1982:264-266).3 We require this in order to avoid an infinite number of chains, since each possible chain will be expanded out in the inversion step. Since subcategorisation lists of verbs are bounded in length, PATR π style VP rules do not pose a serious problem, which on the other hand the 'adjunct-as-argument' approach taken in (Bouma & van Noord 1994) may do. However, this problem is common to a number of other generation algorithms, including the SHDG algorithm. Let us return to the scenario for the SHDG algorithm given at the end of Section 3: we have a piece of logical form and a grammar symbol, and 3
If the RHS Vi were a VP, we would have a purely argument-filling cycle of length 1.
OPTIMISATION OF GENERATION TABLES
303
we wish to connect a non-chain rule with this particular logical form to the given grammar symbol through a chain. We will generalise this scenario just slightly to the case where a set of grammar symbols is given, rather than a single one. Each inverted rule will correspond to a particular chain of argumentfilling (chain) rules connecting a functor-introducing (non-chain) rule in troducing this logical form to a grammar symbol in the given set. The arguments introduced by this chain will be collected and passed down to the functors that consume them in order to ensure that each of the inver ted rules has a RHS matching the structure of the LHS logical form. The normalised sample grammar of Figure 5 will result in the inverted grammar of Figure 6. Note how the right-hand sides reflect the argument structure of the left-hand-side logical forms. As mentioned previously, the collected arguments are currently assumed to correspond to functors introduced by lexical entries, but the procedure can readily be modified to accommodate grammar rules with a non-empty RHS, where some of the arguments are consumed by the LHS logical form. The grammar inversion step is combined with the LR-compilation step. This is convenient for several reasons: Firstly, the termination criteria and the database maintenance issues are the same in both steps. Secondly, since the LR-compilation step employs a top-down rule-invocation scheme, this will ensure that the arguments are passed down to the corresponding functors. In fact, invoking inverted grammar rules merely requires first invoking a chain of argument-filling rules and then terminating it with a functor-introducing rule. 5
LR compilation for generation
Just as when compiling LR-parsing tables, the compiler operates on sets of dotted items. Each item consists of a partially processed inverted grammar rule, with a dot marking the current position. Here the current position is an argument position of the LHS logical form, rather than some position in the input string. New states are induced from old ones: For the indicated argument po sition, a possible logical form is selected and the dot is advanced one step in all items where this particular logical form can occur in the current ar gument position, and the resulting new items constitute a new state. All possible grammar symbols that can occur in the old argument position and that can have this logical form are then collected. From these, all rules with
304
CHRISTER SAMUELSSON
(S, mod(X, Y), W 0 ,W, e, e) → (S,X,W0iWue,e) (QM,Y,W1,W,e,e) S,mod(Y,Z),W 0 ,W,e,e) → {VP,X^Y,W1,W2,[(NP,X,WQ,W1)},e) (AdvP,Z,W2,W,e,e) S,mod(Y,Z),W 0,W,e,e) → (VP, X^Y, W1, W2, [{NP, X, W0, W ) ] , e) (PP, Z, W2, W, e, ϵ VP, X^mod(Y, Z), W1, W, [(NP, X, W0, W1)),ϵ ) → VP, X^Y, W1, W2, [{NP, X, W0, W1], ϵ ΛdvΡ, Z, W2, W,ϵ ,ϵ) (VP, X^mod(Y, Z), W1, W, [NΡ, X, W0, W1],ϵ → (VP, X-Y, W1, W2, [NP, X, W0, W1,],ϵ P P , Z, W2,W,ϵ,ϵ (S,sleep(X),W 0 ,W,ϵ ,ϵ → NP,X, W0, [sleeps|W],ϵ,ϵ (VP, X^sleep(X), [sleeps\W},W, [NP, X, W0, [sleeps| W])],ϵ→ (NP, X, W0,[sleepsl|W],ϵ,ϵ (S,see(X,Y),Wo,W,ϵ ,ϵ → (MP,X, W1; W,ϵ ,ϵ NP,Y, W0,[sees|W1],ϵ,ϵ VP,Y^see(X,Y),[sees|W0],W,[(NP,Y,W1,[sees|W0]], ϵ → NP,X,Wo,W,ϵ ,ϵ (NP,Y, Wl, [sees|W0]ϵ ,ϵ PP,X^in(X),[in|W 0 ],W,ϵ ,ϵ → NP, X, W0,W,ϵ,ϵ NP, John, [Johin|W],W,ϵ ,ϵ → ϵ NP,mary, [Mary|W],W,ϵ ,ϵ → ϵ NP, p a r i s , [Paris|W],W,ϵ ,ϵ → ϵ AdvP, today, [today|W],W,ϵ ,ϵ → ϵ QM,ynq,[?|W],W,ϵ ,ϵ → ϵ Fig. 6: Inverted sample grammar a matching LHS are invoked from the inverted grammar. Each such rule will give rise to a new item where the dot marks the first argument position, and the set of these new items will constitute another new state. If a new set of items is constructed that is more specific than an existing one, then this search branch is abandoned and the recursion terminates. If it on the other hand is more general, then it replaces the old one. The state-construction phase starts off by creating an initial set con sisting of a single dummy item with a dummy top grammar symbol and a dummy top logical form, corresponding to a dummy inverted grammar rule. In the sample grammar, this would be the rule (S',f (X), W0, W,ϵ ,ϵ → S, X, W 0 ,Wϵ ,ϵ The dot is at the beginning of the rule, selecting the first and only argument. The rest of the states are induced from this one.
OPTIMISATION OF GENERATION TABLES
305
The first three states resulting from the inverted grammar of Figure 6 are shown in Figure 7, where the difference lists representing the word strings are omitted. State 1 S',f(X),e,e)
=> .<S,X,ϵ,ϵ>
State 2 S,mod(X,Y),ϵ,ϵ => . 5,X,ϵ,c) (QM,Y,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ => · VP,XY,[NP,X],e (AdvP,Z,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ => . VP,XY,[NP,X],e(PP,Z,ϵ,ϵ State 3 S,mod(X,Y),ϵ,ϵ => . S,X,e,eQM,Y,c,e S,mod(Y,Z),ϵ,ϵ => . VP,XY,[(NP,X)],ϵ) (AdvP,Z,ϵ,ϵ S,mod(Y,Z),ϵ,ϵ =» .VP,X^Y,[(7VP,X)],6)PP,Z,ϵ,ϵ VP,X^mod(Y,Z,[(NP,X)],ϵ) => . VP,X^Y, [NP,X)],ϵ is that A computer is a machine that processes information. <X> is a sort of A bicycle is a sort of vehicle. Table 1: Schemata for a definition <X> is somehow like.. A cat is somehow like a tiger. is to as is to . Good is to light as evil is to darkness. Table 2: Schemata for comparison This is true not only on the higher levels (paragraph, text level), — stories, news, weather forecast, sport reports, etc. are clearly schematic — but also on the lower levels (phoneme, word, sentence level). Actions, events, states, processes Entities, names, places Properties, attributes of entities Manner, attributes of actions Intensifier, location, time Means Spatial relations: path, position, direction
verbs nouns adjectives adverbs adverbs prepositions
build, happen, be, sleep car, Paul, Tokyo young, bright slowly very, here, tomorrow by, with
prepositions
from, in, on, towards
Table 3: Mapping of ontological categories on syntactic categories When translating a message into discourse, the speaker maps a conceptual structure (deep structure) onto a linguistic form (surface structure). Thus, concepts are mapped on words, each of which have a specific categorial
322
MICHAEL ZOCK
potential, i.e., part of speech (Table 3), 10 deep-case relations are mapped on grammatical functions (Table 4), and conceptual configurations, i.e., larger conceptual structures, are mapped on syntactic structures (Table 5), etc. agent, cause object, patient beneficiary, recipient
subject direct object indirect object
Table 4: Mapping of case relations on grammatical functions 1. [PERSON:#][MUSIC:*] D + N + V + D + N
11
T h e girl p l a y s a s o n g .
2. [MUSIC:*] 4 (l)5 features: (MOOD *DECLARATIVE) F-structure assembled by the controller: [sentence [subj P R O N O U N ] W E R D E
[xcomp [subj PRONOUN] [adj GERNE] ANMELDEN [obj PRONOUN] [pp.adj FÜR [obj [det DER] KONFERENZ]]]] (1) VP network Input: source: [sent [subj I] WOULD [xcomp [subj I] L I K E [xcomp R E G I S T E R ] ] ] context: NIL role: sentence p-head:6 NIL Output: head: WERDE subs: subj (2) xcomp (3) features: ((CAT V) (PERSON 1) (MODAL +) (FORM FIN) ...) F-structure assembled by the controller: [ [subj PRONOUN] WERDE [ [subj PRONOUN] [adj GERNE] ANMELDEN [obj PRONOUN] [pP-adj FÜR [obj [det DER] KONFERENZ]]]]
400
YE-YI WANG & ALEX WAIBEL
In step (0), the controller first activates the IP network with the source input f-structure. There is no context input for the IP network, since the sentential f-structures are the top level f-structures in our task. From the network's output, the controller knows that the head of the IP is NIL7. It also generates the sentential feature (MOOD *DECLARATIVE). And it interprets the output as that the only sub-structure of the sentence is a German VP, whose English counterpart is the (non-proper) sub-structure with the head WOULD8. Therefore it builds the target f-structure framework [ NIL (MOOD *DECLARATIVE) [sentence *]], and activates the VP network in step (1). Upon receiving the VP sub-structure returned from step (1), it combines that sub-structure with the f-structure framework, and collapses the NILheaded f-structure to form the assembled f-structure shown as the output in step (0). In step (1), the input source was determined in step (0), since the sen tence sub-structure's head was "WOULD" according to the IP network's sub-structure's input specifier in step (0). The context input is NIL because the source f-structure does not have a parent f-structure. The input role has the value sentence because the slot position of the output sub-structure in step (0) implies the grammatical relation of the sub-structure is sentence. The input p-head is NIL because the head of target f-structure in step (0) was NIL as specified by the output HF-vector there. The VP network maps the input f-structure to its German counterpart by specifying (a) the head of the German VP structure WERDE and the terminal features of the German VP structure in the output HF-vector, and (b) the input specifiers and the categories of the sub-structures of the target German VP f-structure. To build detailed sub-structures for this VP f-structure, the controller will activate the NP network with the input of the English sub-structure with the head "I" and the VP network with the input English sub-structure with the head "LIKE" in the subsequent steps, and 4 The sub-structure's input specifier and category are combined into a tuple here. 5 The number in the parenthesis indicates the subsequent step of network activation for this sub-structure. 6 P-head is the head of the target f-structure's parent structure 7 NIL-headed f-structure happens only when there is only one sub-structure or when there is an xcomp sub-structure. The NIL-headed f-structure must collapse into the only sub-structure in the first case, or into the xcomp sub-structure in the second case. All terminal features and other sub-structures are moved into the collapsed-into sub-structure during collapsing. 8 The network actually specifies the slot at the input layer instead of the lexical item WOULD.
CONNECTIONIST TRANSFER
401
combine the sub-structures returned from these subsequent steps into the f-structure framework [[subj *] WERDE [xcomp *]]. The combined structure is then returned to step (0) to be integrated into the top level f-structure framework there. 5
Training, testing and performance
From the 300 sentential f-structure pairs, we extracted all the German NP sub-structures, their grammatical relations and their parent structures' heads. We labeled their English counterparts 9 . These were all the inform ation required for the training of the NP network. About 700 samples for the NP networks were created this way. The training samples for the other networks were prepared in the same way. The NP network had the most samples, while the MP network had the least of 89 samples. Stand ard back-propagation was used to train the networks. We also tried the information-theoretical networks (Gorin et al. 1991) to generate the head of a target structure in the HF-vector, which required less training time and achieved comparable performance as the network trained with pure backpropagation algorithm (Wang 1994). The training took 500 to 2000 epochs for different networks, and the training time ranged from one hour to three days on DEC Station 5000. The mapper achieved 92.4% accuracy on the training data 10 . Learnability: The connectionist f-structure transfer described above did not require any hand-crafted rules or representations. The structure transfer was learned automatically. By clustering the distributed represent ations of words learned by the networks, i.e., the activation patterns of a feature slot when a lexical item was presented to its connected input slot, we had some interesting findings about what was learned by the networks. One of them was that the feature patterns for English nouns in the DP network were clustered into three classes, which reflected the three genders of German nouns: the German translations of the words in each class were roughly of the same gender. Another finding was about the classification of verbs. When we clustered the feature patterns for verbs in the VP networks, we found some intransitive verbs like register in the same class as most of the transitive verbs. This seemly strange classification is not odd at all if we consider the fact that the German translation for register, "anmelden", is a 9 A n NP's counterpart i s not necessary to be a n NP. A source language f-structure is said to be accurately mapped if the generated target language f-structure is exactly the same as desired in the sample.
10
402
YE-YI WANG & A L E X WAIBEL
transitive verb. These two independent findings reveal the networks' ability to discover some linguistic features of the target language and use it in the representation of an entity of the source language which does not possess those features. This is exactly what a symbolic transfer are supposed to do: using an intermediate representation which reflects the linguistic features of the two languages in question (even if one of the languages may have degen erated form for a specific feature,) and thus being able to make a 'transfer' at both the lexical and structural level into corresponding structure in the target language. Our system learned the intermediate representation auto matically, although it was not expressed explicitly in symbolic forms but encoded in the networks' activation patterns. Because the development of this representation was integrated into the process of automatic learning of f-structure mapping, it tended to include in the intermediate representation the important language specific linguistic features which were directly rel evant for the ultimate purpose of structure transfer. In the other words, the learning of the intermediate representation was focused on the purpose of improving the transfer performance. This is one of the biggest advantage of this approach over the hand-crafted intermediate representation. Scalability: We did a preliminary scalability experiment. We extended the source and target language lexicon by 2%, and made 30 new f-structures with these new lexical items. Trying to scale up from what was already learned, we froze all but the input-feature connections, trained the network for about 40 epochs with the new data, then fine-tuned all the connections with old and new data for a few epochs. In doing so, we let the networks first learn the new words to derive their distributed representations, and then learn the structure mapping for the new data later. This approach was based on the observation that a big portion of the new English words were translated to some German words already in the lexicon, which in turn was translated from some English words in the old training data. These old Eng lish words were mostly the synonyms of the new English words. By freezing the other connections and training only the input-feature connections, we hoped the networks to be able to develop the distributed representation for a new word similar to the already-learned representations of its synonyms. This approach greatly reduced the learning time for new words, since the one layer back-propagation was much fast than the full-blown learning. The mapper with the new phrasal networks that were retrained this way achieved 83.3% accuracy on the new data, without affecting the performance on the old data.
CONNECTIONIST TRANSFER
403
Generalisability: A separate set of data was used to test the gener alisation performance of the system. The testing data was collected from people not associated with our researches. The data was compared with the training corpus, and the sentences that appeared in the training data were removed. An LR parser parsed the sentences to English f-structures. The English were translated into German manually, and the translations were parsed by a German LR-parser. We picked the most probable struc ture when a parsing result was ambiguous. There were 154 f-structure pairs after we eliminated the wrongly-parsed sentences. The mapper achieved 61.7% accuracy on the testing data. Considering the limited number of training samples, this performance was encouraging. Previous research as in (Chrisman 1991) did not generalise to deal with unseen data. 6
Discussion
The application of the connectionist transfer described in this paper has its restrictions. First, it requires well-formed f-structures for both the input and output sentences. This greatly limits the applicable domain of the approach to well-structured 'clean' languages. It is difficult to use this approach for spoken language where performance data like ungrammatical utterances, noises, false starts are pervasive. Another restriction is that this approached can only achieve satisfactory performance when the input and output languages are similar, in the sense that the translation equivalents in the two languages mostly have similar recursive f-structures. Although the system can deal with structurally dif ferent input/output sentences, like the aforementioned example of [sentence GOODBYE] and [sentence AUF [obj WIEDERHOEREN]], we believe that the performance would drop significantly if drastic structure differences between translation equivalents are very common for the two languages in question. Fortunately, as shown by our data, the structural difference between English and German is not so drastic to ruin our system's performance. Although we had done some scalability experiment, it is unclear how the system will perform if we increase the lexicon significantly instead of by 2%. Because of the limitation of available data, we found it very difficult to conduct scalability experiments with much more expanded lexicon. We hope that with stable incremental performance, the system can be gradually and easily retrained to deal with more complicated problems.
404 7
YE-YI WANG & A L E X WAIBEL Conclusion
Aiming at the difficulties in symbolic transfer, we have proposed a connectionist transfer system t h a t maps between f-structures of two languages. It can discover meaningful linguistic features by learning. Its performance is promising with respect to learnability, scalability and generalisability. REFERENCES Bresnan, Joan. 1982. The Mental Representation Cambridge, Mass.: MIT Press.
of Grammatical
Relations.
Chrisman, Lonnie. 1991. "Learning Recursive Distributed Representations for Holistic Computation". Connection Science 3.4:345-366. Gorin, Allen L., Steve E. Levinson, A. N. Gertner &E. Goldman. 1991. "Adoptive Acquisition of Language". Computer Speech and Language 5:101-132. Miikkulainen, Risto & Michael G. Dyer. 1989. "A Modular Neural Network Ar chitecture for Sequential Paraphrasing of Script-Based Stories". Proceedings of the International Joint Conference on Neural Networks. IEEE. Nirenburg, Sergei, Victor Raskin & Allen B. Tuker. 1987. "The Structure of Interlingua in TRANSLATOR". Machine Translation: Theoretical and Meth odological Issues ed. by Sergei Nirenburg, 90-113. Cambridge: Cambridge University Press. Wang, Ye-Yi. 1994. "Dual-Coding Theory and Connectionist Lexical Selection". Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACĽ94), student session, 325-327. White, John S. 1987. "The Research Environment in the METAL Project". Ma chine Translation: Theoretical and Methodological Issues ed. by Sergei Niren burg, 225-246. Cambridge: Cambridge University Press.
Acquisition of Translation Rules from Parallel C o r p o r a Yuji Matsumoto and Mihoko K i t a m u r a Graduate School of Information Science Nara Institute of Science and Technology
Abstract This article presents a method of automatic acquisition of transla tion rules from a bilingual corpus. Translation rules are extracted by the results of structural matching of parallel sentences. The struc tural matching process is controlled by the word similarity dictionary, which is also obtained from the parallel corpus. The system acquires translation equivalences of word-level as well as those of multiple word or phrase-level.
1
Introduction
The major issues in Machine Translation are the ways to acquire transla tion knowledge and to apply the knowledge to real systems without causing unexpected side-effect phenomena. Hand-coding of transfer rules suffers from the problems of enormous manual labour and the difficulty of main taining their consistency. Example-based translation (Sumita 90; Sato 90) is supposed to be a method to cope with this problem. Unlike transfer-based approaches, the idea is to carry out translation by referring to translation examples t h a t give the best similarity to the given sentence. T h e key technique is to define t h e similarity between the given sentence and the examples and to identify the ones with the best similarity. Robustness and scalability are the claims of this approach. However, there are at least two important prob lems t h a t haven't been answered. One is "knowledge access bottleneck," which concerns the selection of the most similar example. Similarities are usually defined only for fixed and local structures, such as predicate argu ment structures and compound nominals. T h e units of translation cannot always be such fixed structures and may vary according to the language pairs. Similarity should be defined in a more flexible way. T h e other is "knowledge acquisition bottleneck." In example based translation, the par allel examples have to be aligned not only at sentence-level but word or
406
YUJI MATSUMOTO & MIHOKO KITAMURA
phrase-level. Although the sentence-level alignment can be done automat ically using statistics, e.g., (Utsuro et al. 94), the word-level alignment is not an easy task especially when the system tries to cover wide syntactic phenomena. This paper presents a method of automatic acquisition of translation rules from a parallel corpus of English and Japanese. Translation rules in this paper refer to word selection rules and translation templates t h a t represent word-level and phrase-level translation rules. A translation tem plate are regarded as a phrasal translation rule. Since translation rules may change according to the target domain, this method shed a light on an easy and effective way for developing domain dependent translation rules by accumulating a parallel corpus. 2 A c q u i s i t i o n of T r a n s l a t i o n R u l e s Figure 1 shows the flow of the acquisition of translation rules. Following three types of resources are assumed: 1. A Parallel corpus of the source and target languages. 2. G r a m m a r s and dictionaries of the source and target languages. 3. A machine readable bilingual dictionary. The automatic acquisition of translation rules is composed of the follow ing three processes: C a l c u l a t i o n of w o r d s i m i l a r i t i e s Calculation of the similarities of word pairs of the source and target languages based on their co-occurrence frequencies in the parallel corpus. S t r u c t u r a l m a t c h i n g Structural matching of the dependency structures obtained through parsing of parallel sentences. A c q u i s i t i o n of t r a n s l a t i o n r u l e s Acquisition of translation rules based on the structural matched results. We focus on a bilingual corpus of Japanese and English and assume t h a t sentence-level alignment has been done on the corpus. In case they are not aligned, we can have them aligned using an existing alignment algorithm such as (Kay & Röscheisen 93) (Utsuro et al 94). 2.1 Calculation
of Word
Similarities
We define the similarity of a pair of Japanese and English words by a numerical value between 0 and 1. We use the following two resources for
ACQUISITION OF TRANSLATION RULES
407
Figure 1: The flow of translation rules acquisition
obtaining the similarity: • a machine readable bilingual dictionary • a bilingual corpus of Japanese and English As for the former, we assign value 1 to the translation pairs appearing in the bilingual dictionary. As for the latter, we use the basic calculation method of the similarity proposed by (Kay & Röscheisen 93). Unlike their method, we preprocessed the corpus by analyzing them morphologically to obtain the base form of the words. The similarity of a pair of Japanese and an English words is defined by the numbers of their total occurrences and co-occurrences in the corpus. The similarity of a Japanese and English
408
YUJI MATSUMOTO & MIHOKO KITAMURA
English: Companies compensate agents. Japanese: The best score = 1.55
Figure 2: A result of structural matching
, where fj and fe are the total word-pair, is defined by sim(wj,WE) = numbers of the occurrences of the Japanese word wj and the English word WE, and fje means the total number of co-occurrence of wj and WE, that is, the number of occurrences they appear in corresponding sentences. 2.2 Structural matching of parallel sentences Corresponding Japanese and English sentences in the parallel corpus are parsed with LFG-like grammars, resulting in feature structures. We do not use any semantic information in the current implementation. When a sen tence includes syntactic ambiguity, the result is represented as a disjunctive feature structure. A feature structure is regarded as a directed acyclic graph (DAG). In the subsequent process of structural matching, we use the part of the DAG that relates with content words (such as nouns, verbs, adjectives and adverbs). The resulting DAG represents a (disjunctive) dependency structure of the content words in the sentence. We start with a pair of dependency graphs of Japanese and English sentences and find the most plausible graph matching between them. We use the word similarities described in the previous section in the matching process. The similarity of word pairs is extended to the similarity of subgraphs in the dependency structures. A sample result of structural match is shown in Figure 2. The basic definition and algorithm follows (Matsumoto et al. 93), though the similarity measures of words and subgraphs are refined. When the corresponding subgraphs (nodes in circles pointed by a bidi-
ACQUISITION OF TRANSLATION RULES
409
rectional arrow in Figure 2) consist of single words, the word similarity is used for their similarity. When any of the subgraphs contains more than one content word, we placed the following criterion: The higher the sim ilarity of a word pair the finer their corresponding subgraphs should be. This means that mutually very similar words should have an exact match whereas mutually dissimilar words, when they are matched against each other by the structural constraint, are better included in coarse subgraphs. To achieve this criterion, we defined the following formula for calculating mutual similarity between subgraphs: Let s and t be subgraphs matched against and Vs and Vt be the sets of contents words in s and t. We can assume, without loss of generality, that \VS\ is not greater than \Vt\ (Vs and Vt can be switched if it is not the case). Let Dp be the set of pairs of elements from \VS\ and \Vt\ defined by an injection (one-to-one mapping) p : |V s → \Vt\. Dp = {(a,p(a)) \a є VS} Then, the average similarity of words between |Vg| and |Vį| is defined as follows:
To achieve the above criterion, we put a threshold value Th (0 < Th < 1) where a similarity value higher than Th is supposed to indicate that they are mutually similar. The following formula of similarity between two subgraphs realizes the criterion in that the total similarity is bent toward the threshold value according to the size of subgraphs. Dividing the difference of AverageSim and Th by the size of subgraphs works as a penalty for graphs that are mutually similar and as a reward for graphs that are mutually dissimilar.
The branch-and-bound algorithm is employed for the search of the graph matching that gives the highest similarity value. Figure 2 shows an example of dependency structures and the result of the structural matching, in which the corresponding pairs are linked by arrows. Here the best score is the total similarity of the most similar graph matching. The threshold is set at 0.15.
410
YUJI MATSUMOTO & MIHOKO KITAMURA
2.3 Acquisition of translation rules After accumulating structurally matched translation examples, the ac quisition of translation rules is performed in the following steps. We assume a thesaurus for describing the constraints on the applicability of the acquired rules. Suppose we concentrate on a particular word or a particular phrase in the source language graphs that appear as a subgraph in matching graphs. We refer to the subgraph as t 1. Collect all the matched graphs that contain the same subgraph as t 2. Extract the graph t and its children together with the correspond ing part of the target language tree. Some heuristics are applied in this process: Corresponding pairs of pronouns are deleted, and zero personal pronouns in Japanese sentences are recovered. 3. The child elements are generalized using the classes in the thesaurus, which is identified as the condition on the applicability of the rule. The system acquires two types of translation rules that represent wordlevel and phrase-level translation rules. When the top subgraph consists of a single content word, we regard that the corresponding subgraphs give a a word selection rule. On the other hand, when the top subgraph consists of more than one content word, we regard it as a phrasal expression, and call it a translation template. Figure 2 shows an example of phrasal-level correspondence, "compensate : Since we assume the translation is influenced by the adjacent elements, i.e., the words that directly modify the word in the subgraph, we generalize the information in the collected matches so as to identify the exact contexts in which the translation rule is applicable. From the set of partial graphs that share the same parent nodes, trans lation rules in the form of feature structures are obtained. In the experiment described below, we focus on acquiring JapaneseEnglish and English-Japanese translation rules related with verbs, nouns and adjectives. 3 Experiments of translation rule acquisition We used Torihiki Jouken Hyougenhou Jiten (Collection of JapaneseEnglish expressions for business contracts, 9,804 sentences) (Ishigami 92)
ACQUISITION OF TRANSLATION RULES
Simirality
wE abnormal accessory accountant accumulative accurate address adjudge administrative adopt advancement advancement afterward agent
1 0.923077 0.941176 1 0.769231 0.764977 1 0.8 1 1 0.8 0.8 0.935583
411
fe
fi
fje
2 14 9 2 5 111 2 3 2 4 4 2 1004
2 12 8 2 8 106 2 2 2 4 6 3 952
2 12 8 2 5 83 2 2 2 4 4 2 915
Table 1: Examples of word similarity word
make business exclusive
sentence 184 254 114 309 191 127
parsing 183(99.5%) 245(96.5%) 103(90.4%) 309(100%) 191(100%) 127(100%)
matching 180(97.8%) 242(95.3%) 99(86.8%) 298(96.4%) 179(93.7%) 116(91.3%)
word-level 115(63.9%) 144(59.5%) 68(68.7%) 184(61.7%) 92(51.4%) 27(23.3%)
phrase-level 65(36.1%) 97(40.1%) 31(31.3%) 113(37.9%) 87(48.6%) 88(75.9%)
Table 2: Statistics of parsing and matching results and EDICT 19941 and Kodanska Japanese-English dictionary (Shimizu 79) (93,106 words) as the base resources. We also used an electronic version of Japanese thesaurus (called Bunrui-Goi-Hyo, BGH) (NLRI 94) and Roget's Thesaurus (Roget 11) for specifying the semantic classes. The current system works only with simple declarative sentences. 3.1 Acquisition of translation rules Total of 948 word pairs of Japanese and English are obtained by the method for the calculation of word-word similarity between two languages described in Section 2.1. Some examples of the similarity obtained in the 1
EDICT 1994 is obtainable through ftp via monu6.cc.monash.edu.au:pub/nihongo
412
YUJI MATSUMOTO & MIHOKO KITAMURA
experiment are shown in Table 1. We get a number of domain specific terms about business contracts, such as "agent: and "accountant: ," which are not found in the ordi nary bilingual dictionaries. Out of the 948 word pairs we obtained, only 236 appear in EDICT or Kodansha Japanese-English dictionary. Acquisition of word pairs from domain specific parallel corpora is very important, since many domain specific word pairs often do not appear in ordinary bilingual dictionaries. However, it should also be noted that the repetitive occur rences of the same expression causes a slight error in the similarity of the pairs. We selected several Japanese and English words of frequent occurrence and collected structurally matched results. Some of the results for those words are shown in Table 2. For example, out of 184 occurrence of Japanese verb " ", 183 sentences were successfully parsed (meaning that the cor rect parse was included in the possible parses), and 180 sentences succeeded in structural matching, in which 115 sentences had the top subgraph with a single content word, and 65 sentences had the top subgraph with more than one content word. To acquire word selection rules, the results are classified into the groups according to the translated target words. A word selection rule is acquired from each target word by generalizing the child nouns by the classes in the thesaurus. The word selection rules for are summerized in the upper part of Table 3. For instance, the table specifies that is translated into "give" when its subject is either of the semantic classes, substance, school, store and difference and its object is either of the class of difference, unit and so on. Phrasal translation rules are treated in the same way. Such examples of are shown in the lower part of Table 3. For instance, the Japanese phrase is translated into "X compensate Υ", if X and Y satisfy the semantic constraints described in the table. 3.2 The translation rules The translation rules described above are converted into the following data structure in our machine translation system. t r _ d i c t ( index, source feature structure, target feature structure, condition).
413
ACQUISITION OF TRANSLATION RULES
nominative (ga) objective (wo) dative (ni) [substance], [difference], [unit], [substance] [school], [store], [chance], [feeling], [store], [school] [number], [start end] [range seat track] [difference] [cause] [change] affect(8) [trade] [propriety] [store] [school] confer (6) [range seat track] [difference], [school], [school], [feeling] furnish (3) [store] [range seat track] render (1) [difference] [care] aíFord(l) [harmony] provide(l) [difference] t h e n u m b e r of word occurrence is in p a r e n t h e s e s . T h e n a m e of semantic classes in t h e t h e s a u r u s is in s q u a r e brackets. Engllish verb give(58)
Japanese patterns [21 [store,school,cause,...] [l][store,school [1][store,school]
English patterns
[1] affect [2] (17) 2) (2)
"21 [store,school] 21[store,school] 1 store,school 1 store] [2] [store] (1) [3][substance] [1][store [21[store] t h e n u m b e r of word occurrence is in p a r e n t h e s e s .
(1)
[1] [1] [1] 1
compensate [2] assent to [2] authorize [2] furnish [2] with [3]
Table 3: Acquired translation rules of index The index word of the translation rule. source feature structure A feature structure of the source language. target feature structure A feature structure of the target language. condition The semantic condition for the rule described by a set of seman tic classes for the variables appearing in the source feature structure. In the condition, checksum/2 is a Prolog predicate for checking the semantic classes of the variables (semantics classes are expressed by the class numbers in the thesaurus). Identifying the most suitable semantic classes in the thesaurus is by no means an easy task. In the current implementation, we use the semantic classes at the lowest level in the Japanese thesaurus BGH, which has 6 layers. This leads the description of the semantic condition to be a list of the lowest level semantic classes. Therefore, in our current implementation the translation rules compiled with few translation examples are far from
414
YUJI
MATSUMOTO & MIHOKO KITAMURA
complete. Some of the final form of translation rules are represented as follows:
[ p r e d : a s s e n t ( v e r b ) , subj:X, true ) .
to:Ζ ] ,
[ pred : g i v e ( v e r b ) , s u b j : X , o b j 1:Y, obj2:Z J , ( checksem(X,[11000,11040,11600, . . . ] ) , checksem(Y,[11642,11910,13004,...]), checksem(Z,[11000,11040,12630,...]) ) ) .
[ p r e d : r e f e r e n c e( n o u n ) true ) .
],
4 Discussion and Related Works Our machine translation system based on the acquired translation rules has the following characteristics: The system uniformly deals with word selection rules such as "confer and phrasal translation rules such as X compensate Y. Even it there is no translation rule to apply, the system uses the bilingual dictionary as the default. Translation pairs in the dictionary are regarded as word selection rules with no condition. Since all the translation rules are acquired from translation examples, manual compilation of translation rules is made minimal. Also, since the structural matching results used to obtain the translation rules are sym metric, both English-Japanese and Japanese-English translation rules are acquired, making two-way translation possible. Another important characteristic is that ambiguity (ambiguous transla tions caused by multiple applicable translation rules and ambiguous struc tural analyses) are resolved by putting priority to the translation rules with more specific information. The frequency information of translation pairs is also used for deciding the priority among the translation options. The parsing and generation phases share the grammars and dictionaries that are used in the acquisition phase of the translation rules. This assures no contradiction among the parsing, generation and translation rules.
ACQUISITION OF TRANSLATION RULES
415
On the other hand, the following issues should be considered: The quality of the translation rules depends on the quality of the the saurus. There are some unadmissible word selection and phrasal rules ac quired in the experiment. For example, the word selection rule, " X[human] Y[problem] (" " means advocate)" was paired with "make Y[problem] to X[human]," which is not a good translation rule. Rather, "make an objection to X[human]: X[human] " should be considered as an appropriate idiomatic expression. Idiomatic expressions like this example should be distinguished from normal word selection rules. The proposed method is suitable to formal domains. An experiment with colloquial expressions reveals much more difficulties in acquiring "good" translation rules. Moreover, the current method cannot cope with expres sions that necessitate contextual information. The method should be augmented so as to deal with complex sentences. We do not think that a direct augmentation of the structure matching algo rithm is applicable to complex sentences. Some two-level technique should be developed, the first level is to find an appropriate decomposition of com plex sentences and the proposed structural matching is applicable at the second level. A similar work for acquiring translation rules from parallel corpora is discussed in (Kaji 92), in which a bottom-up method is used for finding cor responding phrases (i.e. partial parse trees). We use dependency structures, which we think, is a critical point, since word order is not normally preserved between Japanese and English sentences while dependency between content words is preserved in most of the cases. (Watanabe 93) proposed a method of using matched pairs of dependency structures of Japanese and English sentences for improving translation rules. The algorithm of finding the structural correspondence is different from ours. Our method uses a more finer similarity measure that is learned from parallel corpus. As for the translation rule acquisition, their objective is to improve existing transfer rules whereas our objective is to compile the whole translation rules altogether. 5
Conclusions
The translation rules obtained by the proposed method can be integrated into an existing machine translation system. Generally, translation may differ depending on the domain. Our system is easily adapted to any domain provided that sizable parallel corpora of that domain are accumulated.
416
YUJI MATSUMOTO & MIHOKO KITAMURA
improve the acquired translation rules both in quality and quantity, we need to enlarge the scale of the parallel corpora. Another possible way to improve the translation rules is to give the post-edited translation results back to the acquisition phase. By doing this, missing translation rules are gradually acquired. REFERENCES Ishigami, Susumu. 1992. Torihiki Jouken Hyougenhou Jiten. Tokyo: Interna tional Enterprise Development Co. Kaji, Hiroyuki, Y. Kida & Y. Morimoto. 1992. "Learing Translation Templates from Bilingual Text". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), vol.11, 672-678. Nantes, France. Kay, Martin & M. Röscheisen. 1993. "Text-Translation Alignment". Computa tional Linguistics 19:1.121-142. Matsumoto, Yuji, H. Ishimoto & T. Utsuro. 1993. "Structural Matching of Parallel Texts". Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL'93), 23-30. Columbus, Ohio. National Language Research Institute. 1994. Bunrui-Goi-Hyo [Word List by Semantic Principles]. Tokyo: Syuei Syuppan. Roget, Peter M. 1911. Rogeťs Thesaurus. New York: Crowell. Sato, Satoshi & M. Nagao. 1990. "Toward Memory-Based Translation". Pro ceedings of the 14th International Conference on Computational Linguistics (COLING-90), vol.III, 247-252. Helsinki, Finland. Shieber, Stuart M., G. van Noord, R.C. Moore & F.C.N. Pereira. 1990. "A Semantic Head-Driven Generation Algorithm for Unification-Based Formal isms". Computational Linguistics 16:1.30-42. Shimizu, Mamoru & N. Narita. 1979. Japanese-English Dictionary. Tokyo: Kodansha Co. Sumita, Eiichiro & H. Iida. 1991. "Experiments and Prospects of Example-Based Machine Translation". Proceedings of 29th Annual Meeting of the Association for Computational Linguistic (ACĽ91), 185-192. Berkeley, California. Utsuro, Takehito, H. Ikeda, M. Yamane, Y. Matsumoto & M. Nagao. 1994. "Bilingual Text Matching Using Bilingual Dictionary and Statistics". Pro ceedings of the lįth International Conference on Computational Linguistics (COLING-9Ą), vol.11, 1076-1082. Kyoto, Japan. Watanabe, Hideo. 1993. "A Method for Extracting Translation Patterns from Translation Examples". Proceedings of the 5th Int. Conf. on Theoretical and Methodological Issues in Machine Translation (TMI-93), 292-301. Kyoto, Japan.
Clause Recognition in t h e Framework of Alignment HARRIS V.
PAPAGEORGIOU
Institute for Language and Speech Processing — ILSP National Technical University of Athens — NTUA Abstract In this paper we explore the possibility of achieving reliable clause identification of unrestricted text by using POS information ( as the output of a unification rule-based part of speech tagger ) a CMTAG module trying mainly to fix errors from earlier processing and a lin guistic rule-based parser. Identification of simple/complex clauses is considered here as a basic component in the framework of bilingual alignment of parallel texts. One of the important points of this work is the ability for processing very long sentences. Parser is capable of analysing and labeling clause structure. The system is applied to an experimental corpus. The results we have obtained are very promising. 1
Introduction
Recent research in bilingual alignment has explored mainly statistical meth ods. While these aligners achieve surprisingly high accuracy when perform ing at the sentence level (Brown 1991; Kay 1988; Gale 1991; Chen 1993) it remains an open issue how to generalise these techniques for alignment of phrases at the subsentence level because of the inherent assumptions of these methods. A number of recent proposals to the identification of subsentential translations have been developed that tackle the problem at dif ferent levels (Utsuro 1994; Kupiec 1993; Kaji 1992; Grishman 1994; Dagan 1993; Church 1993; Matsumoto 1993; Smadja 1992). However the detection of clausal, embedded translations in bilingual parallel corpora remains a difficult problem due to the fact that there is considerable divergence from the desirable one-to-one clause correspondence (Santos 1994). Papageorgiou (1994) describes a generic alignment scheme which is based on the principle that semantic content and discourse function are preserved by translation. At the sentence level, the aligner obtained performance comparable to that of statistical aligners. In this paper we are mainly concerned with the first basic step in clause alignment, that is clause recognition. Even though identification of simple/ complex clauses is considered here as a basic component in the framework
418
HARRIS V. PAPAGEORGIOU
of bilingual alignment of parallel texts, some cross points will be made concerning partial parsing. 2
Previous work
Automatic detection of clause boundaries is a prerequisite for clause align ment. It is also a major issue in parsing. According to Koskenniemi (1990): "Clause boundaries are easier to determine if we have the correct readings of words available". And conversely, it is more convenient to write constraint rules for disambiguation and head-modifier relations if one can assume that the clause boundaries are already there (Koskenniemi 1992). Two different approaches have been extensively recorded in the literat ure: regular expression methods and stochastic methods. The former use regular expression grammars and constraints expanded into a deterministic FSA for clause recognition. Ejerhed (1988) uses a regular expression method and her system: • looked for ηομη phrases in a preliminary stage; • concentrated on certain characteristics present in the beginnings of clauses; and • assumed that the recognition of any beginning of a clause automatic ally leads to the syntactic closure of the previous clause. Another assumption often used by researchers involved the construction of the grammar: the text was expected to being fully and correctly disambig uated (Ejerhed 1988; Coniam 1991). Several errors confusing VBD (past tense) and VBN (past participle) as well as IN (preposition) and CS (sub ordinating conjunction) which were made during the tagging process, led to incorrect recognition of clauses in many cases. A systematic failure of the system described by Ejerhed (1988) was its incapability to capture clauses beginning with a CC (coordinating conjunc tion) followed by a tensed verb, as in the example: [The Purchasing Departments are well operated and follow generally accepted practices]. The same applies also for cases where a preposition is followed by a whword, i.e., before (WDT WPO WP$ WRB WQL) as in the example: [The City Executive Committee deserves praise for the manner in] [which the election was conducted]. In (Koskenniemi 1992) constraint rules are handcoded specifying what kinds of clues must be present in order to put in a clause boundary. In the case of
CLAUSE RECOGNITION
419
non-finite constructions, a non-finite verbal skeleton is constructed starting with certain kinds of non-finite verb and ending with the first main verb to the right. A distinction is made between finite and non-finite clause constructions by using different tags. This approach increases the amount of ambiguity and burdens the work of disambiguation process. Another feature of the system is that a first level of centre embedding has taken into consideration, as in the example: @@The man ... @( who came first @ ) got the job @@.
Technical constraints for feasible clause bracketing have not allowed a second or third level of embedded clauses. As for the stochastic approach, training material is needed in order to fine-tune the parameters of the model (Ejerhed 1988; Ramshaw 1995). In (Ejerhed 1988), the training material included markers for beginning and end of clauses. The system was also trained to recognise tensed verbs. Res ults recorded were surprisingly good. However, a comparison of the nature of the errors in a sample of a regular expression approach and a sample of the stochastic recogniser revealed that while the finitary approach errors are sys tematically due to under-recognising clause boundaries, the stochastic pro gram errors are due both to over-recognising and under-recognising clause boundaries. These qualitative results give preference to finitary methods since under-recognising clauses is not actually a problem given that they can be easily recovered using simple Dynamic Programming techniques. On the other hand overgeneration coupled with errors by the stochastic module (due to wrong predictions of clause openings or/and closings) makes it more difficult for the alignment algorithm to reconstruct the clause structure of the sentence. 3
The model
The methodology adopted here is surface oriented and stepwise. An inspira tion has been the so called CASS (Cascaded Analysis of Syntactic Structure) described in (Abney 1990). The goal is to recover syntactic information effi ciently and reliably, by sacrificing completeness and depth of analysis. The full framework of automatic clause recognition is depicted in Figure 1. The preprocessing module is a single deterministic FSA which partitions input streams into tokens. In the context of alignment, sentence and word boundaries as well as numbers, dates, abbreviations, paragraph boundaries and various sorts of punctuation are extracted. The rules of the text gram mar were designed to capture the introduction of text sentences and also to
420
HARRIS V. PAPAGEORGIOU
Fig. 1: The model architecture.
define text adjunct formulations (Nunberg 1990). The tagging analysis is done by the well-known transformation-based tagger (Brill 1993a; Brill 1993b; Brill 1994). The initial state was a trigram tagger tuned and trained on a small portion of a (different but of the same type) pre-tagged corpus. Training the contextual-rule tagger was done on a small amount of text of the CELEX database (the computerised documentation system on Community Law) about 70,000 words. CMTAG (Clause Marker TAGging) is an essential step supplementing the tagging by reducing ambiguities concerning possible clause markers and by enriching the annotation of the text with information about certain types of clauses. Its role is: • to extend parts of speech over more than one orthographic word (quite similar to IDIOMTAG module of CLAWS tagger). This is done only for compound subordinators such as: so as/that, in order to/for/that, as if and for complex prepositions such as: according to, by means of, due to. • to discriminate a non finite verbal skeleton from a finite construc tion: for this purpose we insert tags for cases like: to/TO the/DT Treaty/NNP establishing/VBG-F the/DT European/JJ... Examples of finite verbal constructions are: — They [ are going to adopt ] ... — They [ would not have been investigating ] this .. . — I [ would like to make ] some questions ... In the last example, we include "to make" in the finite verb chain, an interpretation that will be validated by the clause alignment al-
CLAUSE RECOGNITION
421
gorithm. Examples of non-finite verbal constructions are: — the distillation [ indicated ] in Article 39 thereof is decided on ... — [ Given ] the improvements in market conditions .. . • to reduce ambiguity by imposing constraints to the tagged corpus. Here, we try to fix errors of mis-analysed complementisers by insert ing clause markers before possible conjunctions: for example if there is a verbal construction before the next candidate conjunction or punc tuation and after a possible subordinator as in the following case: . . . voluntary distillation as/RB provided for in Articles 38, . . . "as" that was tagged RB(adverb)is converted to as/CS. • to label certain types of clauses depending on clause opens. It is a twolevel module distinguishing adverbial clauses, relative, non-finite and coordinate clauses as in (Quirk 1985). At a second stage, we predict a subcategorisation of adverbial clauses into eight types, following (Collins 1992). This information will be explored if it is worthwhile by the clause alignment algorithm. Actually, it does not affect the following syntactic processing. Finally the proposed grammar constructs the clause analysis for input sen tences. First, we introduce a few definitions: subord for a set of comple mentisers, punct for a set of punctuation marks. • subord (CS|WDT|WRB|WP|WP$) • punct (- │, | : | . | ; | ") The syntactic analysis consists of a set of rules trying to match the input against the rule patterns. The first pattern defines complete subordinate clauses as consisting of an optional coordinating conjunction (CC) followed by an obligatory subordinating conjunction (subord) followed by optional nominal elements followed by an obligatory verbal construction (as this was defined in CMTAG description) followed by optional nominal elements ex cept anything listed in subord or punct sets above or a new verbal construc tion. This rule pattern is expressed as: (CC)?{subord}({nomelements}│{punct})*(fvskeleton){nomelements}* The second pattern defines non-finite clauses starting with a non-finite verbal skeleton followed by optional nominal elements except anything listed in subord/punct lists above or a new verbal construction. The expression representing this pattern is: (nfvskeleton){nomelements}* The third pattern defines coordinate clauses introduced by an obligatory coordinating conjunction (CC) followed by an optional adverb followed by a verbal skeleton followed by the same ending as in the previous cases.
422
HARRIS V. PAPAGEORGIOU
(CC)(adv)*((fvskeleton)│(nfvskeleton)){nomelements}* There are six other rule patterns capturing basic clause fragments (verb phrase fragments, noun phrase fragments and adjuncts) and trying to identify the role of noun phrase fragments (SUBJ,OBJ..). Finally, the third and last part consists of grammar rules and actions trying to construct clauses(main clauses and embedded clauses) from the previous non-terminals identified by the second part. For example, a simple rule is that a main clause can be a noun phrase fragment followed by a sequence of clause(s) (identified by the second part) and a verb phrase fragment, as: [ the total quantity of table wine ] [ for-which each producer may submit one or more delivery contract declarations for approval by the intervention agency ] [ should be limited ] [ to an appropriate percentage of the quantity of table wine ]. where the result is two clauses: [ the total quantity of table wine should be limited to an appropriate percentage of the quantity of table wine] [ for-which each producer may submit one or more delivery contract declarations for approval by the intervention agency ]. Regulation R0086 R0104 R0746 R1111 R1117 R1120 R1369 R1425 R1486 R1516 TOTAL
Sentences Clauses Identified Clauses Wrong-place 22 46 44 1 11 23 23 68 134 127 8 30 40 39 1 24 36 35 54 33 54 2 41 25 39 1 64 39 1 63 25 43 45 2 38 81 79 9 315
562
548
25
Table 1: Test Samples after syntactic analysis. 4
Results
The proposed model was applied to a test suit of ten regulations of the CELEX database. (Table 3 shows the results obtained only for the English
CLAUSE RECOGNITION
423
corpus though the experiment was done for the English-Greek language pair set of sentences). The total success rate of the current system is about 93% (out of 562 clauses the system identified 548 cases and in these there was a clause marker in a wrong place, in twenty-five cases). Under-recognition errors made by the parser were due to inherited er rors made by the tagger ( confusing NN and VB ) and propagated to the subsequent modules. The proper way to deal with these cases is probably to establish a NP filter to correct the most common errors (similar to that in (Abney 1996) ). Wrong-place errors are mainly due to the incapability of the system to identify correctly the clause opens in coordinate constructions where a CC is followed by a noun-phrase followed by a subordinate clause followed by a verb phrase fragment. Differences between success rates in different regulations are partly ex plained by the structural complexity of very long sentences that are char acteristic of the samples. 5
Conclusions
We have introduced a method for recognising clauses for subsentence align ment purposes. The method is robust with respect to wrong-place errors and over-recognising errors (totally about 4%) if we ignore underrecognition errors that might be handled by the alignment algorithm. Some improve ments might be achieved by inserting 'repair' filters before parsing. Work on the alignment algorithm is being currently carried out. Acknowledgements. I want to thank greatly Stelios Piperidis for his extens ive work on the Greek compound conjunctions. Special credit is due to Penny Lambropoulou for commenting on the work. REFERENCES Abney, Steven. 1990. "Rapid Incremental Parsing with Repair". Proceedings of the 6th New OED Conference, 1-9. University of Waterloo. 1996. Forthcoming. "Part-of-Speech Tagging and Partial Parsing". To appear in Corpus-Based Methods in Language and Speech ed. by Ken Church, S. Young & G. Bloothooft. Dordrecht: Kluwer. Brown, Peter F., J.C. Lai & R. Mercer. 1991. "Aligning Sentences in Parallel Corpora". Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, 169-176. Berkley, Calif.
424
HARRIS V. PAPAGEORGIOU
Brill, Eric. 1993a. "Automatic Grammar Induction and Parsing Free Text: A Transformation-based Approach". Proceedings of the DARPA Speech and Natural Language Workshop, 237-242. 1993b. "A Corpus-Based Approach to Language Learning". PhD thesis. Philadelphia: University of Pennsylvania. 1994. "Some Advances in Transformat ion-Based Part-of-Speech Tagging". Proceedings of the 12th National Conference on Artificial Intelligence (AAAI94) , 722-727. Church, Ken Ward. 1993. "Char_align: A Program for Aligning Parallel Texts at the Character Level". Proceedings of the 31st Annual Conference of the Association ¡or Computational Linguistics, 1-8. Columbus, Ohio. Coniam, David 1991. "Boundary Marker: System Description" NERC report. Chen, Stanley. 1993. "Aligning Sentences in Bilingual Corpora Using Lexical Information". Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 9-16. Columbus, Ohio. Collins Cobuild. 1992. Collins Cobuild English Grammar. Harper Colllins, London. Dagan, Ido, . Church & W. Gale. 1993. " Robust Bilingual Word Alignment for Machine-Aided Translation". Proceedings of the Workshop on Very Large Corpora, 1-8. Columbus, Ohio. Ejerhed, Eva. 1988. "Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods". Proceedings of the 2nd Conference on Applied Natural Language Processing, 219-227. Austin, Texas. Gale, William A. & Ken Church. 1991. "A Program for Aligning Sentences in Bilingual Corpora". Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, Berkley, Calif., vol 2.177-184. Grishman, Ralph. 1994. "Iterative Alignment of Syntactic Structures for a Bi lingual Corpus". Proceedings of the Second Annual Workshop on Very Large Corpora, 57-68. Kyoto, Japan. Kaji, H., Y. Kida L· Yasutsugu Morimoto. 1992. "Learning Translation Tem plates from Bilingual Text". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 672-678. Nantes, France. Kay, Martin & M. Roscheisen. 1988. "Text Translation Alignment". Computa tional Linguistics 19:1. 121-142. Kupiec, Julian. 1993. "An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora". Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, 17-22. Columbus, Ohio. Koskenniemi, Kimmo. 1990. "Finite State Parsing and Disambiguation". Pro ceedings of the 13th International Conference on Computational Linguistics (COLING-90), vol 2.229-232. Helsinki, Finland.
CLAUSE RECOGNITION
425
Koskenniemi, Kimmo, P. Tapanainen & A. Voutilainen. 1992. "Compiling and Using Finite State Syntactic Rules". Proceedings of the 14th International Conference on Computational Linguistics, 156-162. Nantes. Matsumoto, Yuji, H. Ishimoto, T. Utsuro & Makoto Nagao. 1993. "Structural Matching of Parallel Texts". Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, 23-30. Columbus, Ohio. Nunberg, Geoffrey. 1990. "The Linguistics of Punctuation" Center for the Study of Language and Information CSLI Lecture Notes, Number 18. Papageorgiou, Harris, L. Cranias & S. Piperidis. 1994. "Automatic Alignment in Parallel Corpora". Proceedings of 32nd Annual Conference of the Association for Computational Linguistics, 334-336. Las Cruses, New Mexico. Quirk, Randolph, S. Greenbaum, G. Leech & Svartvik J. 1985. A Comprehensive Grammar of the English Language. London: Longman. Ramshaw, Lance A. & M.P Marcus. 1995. "Text Chunking using TransformationBased Learning". Third Workshop on Very Large Corpora Cambridge, Mass. Smadja, Frank. 1992. "How to Compile a Bilingual Collocational Lexicon Auto matically". AAAI -92 Workshop on Statistically-Based NLP Techniques, 65-71. San Jose, Calif. Santos, Diana. 1994. "Bilingual Alignment and Tense". Proceedings of the Second Annual Workshop on Very Large Corpora, 129-143. Kyoto, Japan. Utsuro, Takehito, H. Ikeda, M. Yamane, Y. Matsumoto & Makoto Nagao. 1994. "Bilingual Text Matching using Bilingual Dictionary and Statistics". Pro ceedings of the 15th International Conference on Computational Linguistics, 1076-1082. Nantes, Paris.
Bilingual Vocabulary Estimation from Noisy Parallel Corpora Using Variable Bag Estimation DANIEL B. J O N E S
&
HAROLD SOMERS
UMIST, Manchester Abstract This paper describes a fully automatic bilingual lexicon extraction process which can be used for estimating the bilingual vocabulary of two languages from the analysis of raw and noisy parallel corpora. No sentence alignment or other pre-processing is required. The result is a set of possible translations for each source language. Each translation is assigned a probability based on the distributional nature of the source word in relation to the target word across the parallel corpus.
1
Introduction
Extracting useful information about language behaviour from corpora is of great interest in theoretical terms as it calls into question the exact role of linguistics in Language Engineering. If the process of information extraction is also a fully automatic one, i.e., it requires no human intervention either at an initial stage (for example, tagging words with their grammatical parts of speech), or during the process's execution time, then the information extraction mechanism is of additional practical importance as it can be applied to texts under a wide variety of circumstances and for a wide variety of needs. This paper describes a fully automatic information extraction process which can be used for estimating the bilingual vocabulary of two languages from the analysis of raw and noisy parallel corpora. The vocabulary is estimated in the sense that a word in the target language text is said to be a translation of a word in the source language text given an estimation of their distribution within the parallel corpus. The corpus itself has to be presented in parallel, i.e., the source language corpus has been translated into the target language and both versions are available to the process. The approach described here does not require the parallel corpus to be 'clean': no pre-editing of the text as a whole or pre-alignment of sentences is necessary.
428 2
DANIEL . JONES & HAROLD SOMERS Related work
Similar work in this area has been carried out by a variety of researchers (Catizone et al. 1989; Kay & Röscheisen 1993; Gale & Church 1991; Fung & Church 1994; Fung & McKeown 1994; Jones & Alexa 1994). Most (if not all) of these approaches have been developed in order to bootstrap NLP (and in particular Machine Translation) systems with lexical and phrasal alignment information from parallel corpus material. The main characteristic of these approaches is they require no (to some degree or other) linguistic description or pre-processing as they rely on distributional-statistical data of word occurrence. When preprocessing is used it is itself an automatic process. Gale and Church, for example, require sentences in parallel corpora to be aligned before vocabulary estimation can be achieved to a reasonable degree of accuracy. The sentence alignment is done automatically. It is appealing, though, when preprocessing is not used at all. One ad vantage is that overall run-time of an alignment process will be much faster. Given the computationally expensive nature of this type of statistical pro cessing, any saving in processing time can be very advantageous. Fung & Church propose a method called K-vec which can be used for estimating bilingual vocabulary from 'noisy' corpora, i.e., parallel bilingual corpora which have not been pre-edited nor sententially aligned. We find the ap proach appealing in particular for the reason stated above and also for its inherent simplicity. The following section briefly describes K-vec and an extension of it which we call Variable Bag Estimation (VBE). 3
Methodology
3.1
General estimation of distribution
The method of alignment used by VBE is quite simple in principle. Firstly, positional estimation of possible translation alignments is carried out using the same process proposed by Fung & Church (1994). Briefly, this involves dividing the source and target language corpus into portions 1 . Once this is done, the presence or absence of a word in each portion is noted. For example, if we are considering the possible alignment of a source word SWi 1
Fung & Church suggest square root of the length of the corpus (counted in words) as a suitable value for the portion size; the number of portions, Ki must be the same for both corpora.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
429
with a target word TWj, the distribution of SWi and TWj over the corres ponding portions of the source and target corpora is calculated. The result is a binary vector of length for each of SWį and TWj, e.g., Vi= [1,0,1,1,...] Vj = [1,1,0,1,...]
(1) (2)
The likelihood of SWi and TWj being in a translation relation is then based on a comparison of the two vectors Vi and Vj. Once the vectors have been computed, 2 × 2 contingency matrices are calculated for the pair of vectors showing the number of portions which contain (a) both SWi and TWj, (b) SWi but not TWj, (c) TWj but not SWi and (d) neither SWi nor TWj. The word pairing is then assigned a mutual information score and a significance score from the values in the contingency matrix. The mutual information score I is based on co-occurrence probabilities, and is given by: (3) where (4) (5) and (6) The significance score is given by: (7) We tested Fung & Church's algorithm on an English-French parallel cor pus 2 . By way of example, estimations for the translation of the English word years are given in Table 1. The table is numerically sorted on the I column 2
The corpora contained 23136 and 24377 words respectively, and were taken from the ACL-European Corpus Initiative CD-ROM. The material was that of the announce ment text of the European Community's Esprit programme.
430
DANIEL . JONES & HAROLD SOMERS TW système d'un secteurs tout terme marché etats entre mais long comme données nécessaire on activités doit années travail gestion ressources ainsi communautaire conseil niveau secteur
% sera cours projet esprit techniques tous leurs peut produits ont 2 base elle phase seront traitement objectifs
I 0.596585 0.585938 0.573865 0.573865 0.521397 0.477003 0.458388 0.425473 0.40394 0.351472 0.251937 0.235995 0.213969 0.204631 0.142019 0.0295442 -0.0270394 -0.040845 -0.0814871 -0.107959 -0.184581 -0.192848 -0.23349 -0.23349 -0.23349 -0.280796 -0.292384 -0.311493 -0.311493 -0.370994 -0.385493 -0.414062 -0.455883 -0.455883 -0.455883 -0.555418 -0.602724 -0.648528 -0.921546 -1.04085 -1.23349 -1.29238 -1.69292
t 0.957939 1.05552 0.868297 0.868297 0.90991 0.689608 0.720176 0.807663 0.646115 0.529619 0.423933 0.36963 0.308215 0.349873 0.265165 0.0453256 -0.0423042 -0.0574324 -0.129934 -0.155405 -0.305193 -0.350321 -0.304279 -0.392823 -0.351351 -0.480452 -0.449324 -0.417409 -0.417409 -0.655712 -0.750294 -0.743342 -0.643668 -0.743243 -0.743243 -0.939189 -1.03716 -0.983056 -1.5487 -1.49544 -1.9111 -2.04965 -3.15809
Table 1: Estimates for possible translations of the English word years
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
431
with the largest values first, as large I scores are 'better' than low scores. It can be seen that the best estimate of a translation relation lies with système, whereas the correct translation (années) is ranked only 17th. This is evidently not a good result, and certainly not as positive as the results hinted at by Fung & Church. Yet they are typical of results that we have obtained in several experiments with this algorithm, with a variety of corpora and language pairs including German and Japanese. Why could this be? One possible explanation is the size of our corpora being too small. Although the algorithm itself is independent of corpus length, it is possible that with too small a corpus the distributions of the words are too similar. It should be noted too that Fung & McKeown (1994:82) also report the poor performance of the K-vec algorithm with Japanese-English and ChineseEnglish parallel corpora: K-vec segments two texts into equal parts and only compares the words which happen to fall in the same segments. This assumes a linearity in the two texts [...]. The occurrence of inserted or deleted paragraphs is another problem which leads to nonlinearity of parallel corpora. In fact, it does not need a whole paragraph to skew the corpus: just an extra sentence near the beginning of the text can mean that many of the words you would expect to occur in portion actually occur in portion i + 1, as we discovered with a manual inspection of the corpus. Fung & McKeown (1994) tried to overcome this weakness by proposing a new algorithm, DK-vec, which compares 'recency vectors' for word pairs, comparing the amount of text between each occurrence of the word, the idea being that each such vector will have a distinctive trace, rather like a speech signal, so that techniques developed for matching such signals can be used. 3.2
Variable bag estimation
Our own approach similarly attèmpts to capture the generalisation that words which are translations of each other will appear at roughly the same equivalent places in the text. Figure 1 depicts this in a graphic, though simplistic, way. In the case shown in Figure 1, there are three instances of bilingual and three instances of bilingue. As they also appear in the same portions of the text (and nowhere else), it is logical to regard them as probable translations of one another. In order to simplify matters we can imagine the system assuming that its highest scoring estimation of a translation of SW is the TW which
432
DANIEL . JONES & HAROLD SOMERS
This paper describes a fully automatic | bilingual | lexicon extraction process which can be used for estimating the Į bilingual Į vocabulary of two languages from the analysis of raw and noisy parallel corpora. No sentence alignment or other pre-processing is required. The result is a set of possible translations for each source language. Each translation is assigned a probability based on the distributional nature of the source word in relation to the target word across the bilingual corpus.
"Cet articleHécrit-ua processus d'extraction de lexique bilingue entièrement automatique qui "peut servir a deviner le vocabulaire bilingue] d'après ľ analyse des corpus parallèles bruts et bruyants sans appariement des phrases ou autre pré-edition. Le processus donne un ensemble de traductions possibles pour chaque mot de la langue source. Avec chaque traduction est associée une probabilité laquelle est basée sur la nature distributionelie du mot source par rapport au mot cible et ce, à travers le corpu bilingue .
Fig. 1: Graphic depiction of source and target word alignment has either the highest I and/or t score. This assumption is based on the fact that those portions which contain TW are physically in or near the same position as SW is found in the source language version of the corpus. However, even in 'well behaved' corpora this is very often not the case and in noisy texts which have not been pre-processed to remove or canonicalise white space, punctuation, markup etc., the position is made more difficult and less accurate than it might otherwise be. The results in Table 1 show that although the correct translation is often given a relatively high score, it may well not be sufficient to distinguish it as the alignment to select over any of the others. Given this state of affairs, is there anything that can be done to improve the performance of the process? The problem is the generality of the por tions into which the corpus is divided. In particular, the size of the portion is arbitrary, and, in addition, as mentioned above, the quality of word align ment depends on the degree of 'skewing' exhibited by a parallel corpus, i.e., the degree to which the word positions of the texts are offset by interme diate material. However, it is logical that these factors can be alleviated if the portions are not fixed but variable in size and therefore content.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
433
A direct method of incorporating this approach into Fung & Church's work would be to simply start with a portion size based on splitting the corpus into = chunks as before, iteratively increasing or decreasing the size of the portions, and examining the relative performance of each value of K. It would be interesting to see, for example, whether or not some values of produced better results than others with respect to certain language pairs and/or sublanguages. However, it is not clear how one could use such a system in a fully automatic bootstrapping process as the value of would not be self-determining but a matter for ongoing empirical study. A more practical implementation of a variable portion size process in volves creating a minimal-sized portion of one word and gradually increasing its size until the significant matching target word appears. Intuitively, what happens is this: assuming the 'bags' are centred at roughly the right places, but assuming also that the texts are not perfectly aligned word-by-word, as the bags grow in size, more and more words are 'sucked' into them. Re member that there are several bags spread throughout the text. At first, the bags will apparently contain random words, some of them occurring in several of the bags, but, significantly, occurring just as often outside the bags. At some point, amongst a lot of rubbish, crucially the word we are looking for — the translation of the source word — will be found in nearly all the bags, and hardly anywhere else. This is when the process stops. The advantage of this is that the termination of the process is not arbitrary but is based on determining which words in the local context of the initiating points are unique to that context and not the rest of the corpus. The crucial question is: Where are the initiating points? The simplest (and most naïve) method of finding these points would be to use the same locations in the target text as the source words. In other words, if the source word under consideration occurred as the 11th, 50th, and 200th word, the initiating points in the target corpus would be the same (or rather, the equivalent taking into account the relative lengths of the two corpora). However, experiments have shown that this approach is too crude and does not provide very good results. A much better approach is to estimate the initiating points in the target text from the information used in K-vec's general estimate of distribution. If we take each entry in Table 1 in turn, the corresponding portions in which each word occurs can be used as anchor points from which the initiating points can be estimated more accurately. An initiating point IP can be determined from: IP = ((kj - 1) × K') + (K' x IPe)
(8)
434
DANIEL . JONES & HAROLD SOMERS
where kj represents the portion which contains the target word, K' is the number of words per portion (roughly equal to K, since is the square root of the length of the corpus), and IPe is an offset factor for estimating the initiating point for that portion. A neutral value for this would be 0.5 which would place IP at the mid-point of the portion containing the target word3. For example, if techniques in Table 1 occurred in (amongst others) portion number 23 and there were 150 words per portion, the IP (assuming IPe = 0.5) for the target text with respect to an alignment for the source word years would be: 3375 = ((23 - 1) x 150) + (150 χ 0.5)
(9)
Thus, the IP in this case is located at the 3375th word in the target language corpus. 4
Experiments
The use VBE makes of Fung & Church's corpus portioning information makes VBE a form of filtering process on the I and t values produced by that process. VBE can therefore be thought of as a post-filtering process which seeks to support (or otherwise) the estimations made by the K-vec process. Experiments were carried out to compare results obtained from the Kvec process with those of VBE using the English-French parallel corpus mentioned above. The VBE process used the portioning information pro duced by K-vec in order to determine its IP values by applying equation (8). Table 4 shows the results of running VBE with IP values derived from the K-vec alignment portions information for the source language word years. The first column (TW) shows the target word aligned with years. Associated with each target word is a number which is the bag size at which the likely target word emerged. As outlined in section 3.2, a bag4 is created which contains all the words at all the IPs. When the VBE process begins the bag is very small and will only contain the words which fall exactly at the IPs. However, at every iteration of the process, more words are added which are neighbours of the IP words. Neighbour words are regarded as 3
4
Any other value for IPe (between 0 and 1) would suggest a starting point nearer the start or end of the portion, and could be used in connection with a different rate of leftward or rightward expansion of the bag. For example, one might want to experiment by starting the search at the front of the portion, and expanding only rightwards. Intuitively there are several bags, one for each occurrence of the candidate target word; but for computational simplicity, all the bags can be combined as a single 'super-bag'.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA
435
words immediately to the right and left of the IP. Thus, the bag increases in size as these neighbour words are added at each iteration until there emerges a word contained in the bag which does not occur outside it. This is the proposed translation. BAG SIZE
220 230 240
LIST OF T W S
années, niveau, on, projet, secteur, système ainsi, d'un, entre, mais, nécessaire, peut activités, comme, communautaire, cours, doit, gestion, leurs, long, ressources, secteurs, sera, techniques, tous, tout, travail
Table 2: VBE estimates for possible translations of the English word years using IP values derived from a previous K-vec process using the same source language word The VBE algorithm can be briefly described as follows: 1. Determine all IP values from the target language corpus partitioning information created by the K-vec process. There will be one IP for each instance of the target word under consideration. 2. At each IP create a bag from the word located at the IP plus words to the left and to the right of the IP. The value of is based on the degree on granularity required. 3. Check if the words in the bag(s) only occur in the bag(s) and nowhere else in the target language corpus. If this is true stop. If it is false increase the size of by an increment factor, e.g., 10 or 20 words5 and go to step 2. This algorithm is applied to each target language word estimated to be a translation of the source by K-vec. Once all these words have been processed by VBE the word or words with the smallest bag sizes are considered to be the most likely translations of the source word. Small bags score more highly than large bags as they indicate a greater degree of positional equivalence of the source and target word(s). If the source and target texts were identical, any given word in the source text at position χ would find its translation at position χ in the target. This is the essential principle used by VBE except that the variable bag size accommodates the practical consideration of source and target corpora being far from identical when dealing with real languages. 5
In practice, the increment factor cannot be too small as this slows down the process. On the other hand, a large increment factor results in a coarser grain-size in the results.
436
DANIEL . JONES & HAROLD SOMERS
The results shown in Table 4 were obtained by taking each TW from Table 1 and applying the VBE procedure as outlined in Section 3.2. The comparison between the two tables is quite striking as VBE has scored the correct trans lation (années) as most likely translation along with five other candidates (niveau, on, project, secteur, and système).
5
Results and observations
Table 5 shows another example of a VBE alignment. The English word under consideration is software. The correct translation, logiciel, is ranked joint 3rd. In this particular case, K-vec ranked logiciel joint 17th as a probable translation of software so again, VBE has preferred to give the correct translation a higher ranking. BAG SIZE
210 220 230 240
LIST OF T W S
résultats communauté commission, développement, domaine, leur, logiciel, programme, projets, se, technologies aux, ce, ces, cette, d'une, est, il, l'industrie, ou, pas, plus, recherche sont, systèmes, technologie, travaux
Table 3: VBE estimates for possible translations of the English word software At the time of writing an exhaustive comparison has not been made between K-vec and VBE. However it can be said that VBE does not always confirm K-vec's results and will promote certain target words making them more probable translations. As Table 4 demonstrates, even when VBE does correctly rank the proper alignment in first place, it is not the only candidate. In fact it is often the case that there are multiple alignments for any given score. However, this is largely due to the level of granularity used by the program when increasing bag size. As already mentioned, finer degrees of granularity are more computationally expensive: it is a lot quicker to increase bag sizes by 10 or 20 words at each iteration instead of, say, just two words at a time. However, this is a practical consideration and further experiments are required to establish to what extent candidate alignments are indeed distinguished from each other due to increased granularity.
VOCABULARY ESTIMATION FROM PARALLEL CORPORA 6
437
Conclusions
Although the VBE approach is fully automatic, it is not necessarily foreseen as forming part of a completely automatic MT system. We believe that a bottleneck in the creation of MT capabilities for differing language pairs is the creation of information to allow such systems to perform at a useful level of translation quality. Approaches like that of VBE can facilitate the bootstrapping of MT products more quickly than training linguists to pro duce from scratch the information required by systems having to translate between perhaps very different language groups. Hybrid systems, for instance, which use information from both human and machine sources would seem to be a sensible way to incorporate information which can be determined from automatic and human sources. REFERENCES Catizone, Roberta, Graham Russell & Susan Warwick. 1989. "Deriving Trans lation Data from Bilingual Texts". Proceedings of the 1st International Ac quisition Workshop, Detroit, Michigan. Fung, Pascale & Kathleen McKeown. 1994. "Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping". Technology Partnerships for Crossing the Language Barrier: Pro ceedings of the 1st Conference of the Association for Machine Translation in the Americas, 81-88. Columbia, Maryland, U.S.A. & Kenneth Ward Church. 1994. "K-vec: A New Approach for Aligning Parallel Texts". Proceedings of the 15th International Conference on Com putational Linguistics (COLING-94), vol.11, 1096-1101. Kyoto, Japan. Gale, William A. & Kenneth W. Church. 1991. "A Program for Aligning Sen tences in Bilingual Corpora". Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACĽ91)> 177-184. Berkeley, Calif. Jones, Daniel & Melina Alexa. 1994. "Towards Automatically Aligning German Compounds with English Word Groups in an Example-based Translation System". International Conference on New Methods in Language Processing, 66-71. Manchester, U.K. (To appear in New Methods in Language Processing ed. by Daniel Jones & Harold Somers. London: University College Press.) Kay, Martin & Martin Röscheisen. 1993. "Text Translation Alignment". Com putational Linguistics 19.1:121-142.
A HMM Part-of-Speech Tagger for Korean with Wordphrasal Relations J U N G H. SHIN,* YOUNG S. HAN** & K E Y - S U N CHOI**
* Korea R&D Information Center, KIST **Korea Advanced Institute of Science and Technology Abstract This paper describes a Korean tagger that takes into account the type of wordphrases for more accurate tagging. Because Korean sen tences consist of wordphrases that contain one or more morphemes, Korean tagging must be posed differently from English tagging. We introduce a hidden Markov model that closely reflects the natural structure of Korean. The wordphrases contain more syntactic in formation such as case role than the words in English. Consequently the wordphrasal information will make better prediction leading to higher tagging accuracy. The suggested tagging model was trained on 476,090 wordphrases and tested on 10,702 wordphrases. The ex periments show that the new model can tag Korean text with 96.18% accuracy that is 0.38% higher than the English tagging method. 1
Introduction
The problem of determining part-of-speech categories for words can be transformed to the problem of deciding which states the Markov process went through during its generation of sentence. A category of a word usu ally corresponds to a state (Charniak et al. 1993). In the last few years the tagging systems based on hidden Markov model (HMM) have produced reasonably accurate results in English and other Indo-European languages (Charniak et al. 1993; Jelinek & Mercer 1980; Kupiec 1992; Merialdo 1994). A sentence in Indo-European languages is composed of words. A Korean sentence consists of wordphrases and a wordphrase is a com bination of morphemes. The patterns of morphemes that form a wordphrase are diverse and the relationship between two morphemes can differ in usage patterns according to the wordphrase they belong to. With the differences between Korean and English, it is not the best way to apply the HMM to Korean as it is used to model English. In this paper, we propose a HMM based tagging method that captures the wordphrasal
440
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
information of Korean sentences. The proposed model makes use of both wordphrasal relations and morpheme relations of each wordphrase. The proposed method requires the types of wordphrases be known be fore constructing a HMM. To classify the patterns of wordphrases, we first devised an algorithm to extract the categories of the wordphrase from the tagged text. Once a network skeleton for each wordphrase is drawn, a com plete HMM is derived by combining the wordphrasal networks. Section 2.1 outlines the characteristics of the Korean. In Section 2.2, we describe the proposed model and the method to construct the model. In Section 3, experimental results on the improvement of accuracy by proposed methods are described. In Section 4, we discuss the significance and the limitation of our work. 2
Wordphrase based Hidden Markov Model
Numerous variations and extensions of hidden Markov models are repor ted in the literature, but few works are known on designing HMMs for agglutinative languages such as Korean. In the following we first discuss the characteristics of Korean that give motivation to the proposed design method of HMM. After the proposed method is introduced, it is shown by means of experiments that our method outperforms other approaches 2.1
Characteristics of Korean
Unlike English and other Indo-European languages, a Korean sentence is not just a sequence of words, but a sequence of wordphrases. A wordphrase is composed of content morphemes and one or more function morphemes though the function morphemes are often omitted. Function morphemes are placed usually after content morphemes. The function morphemes play richer role in the sentence than the function words that indicate for example the number or person in English sentences. The notable difference is that function morphemes make it explicit the role of its content morphemes in the sentence (Nam 1985). The roles deliver information on deep cases as well as syntactic cases. Because there can be more than one segmentation of a wordphrase, the number of morphemes and their corresponding ρart-of-speeches can also be different for the ambiguous segmentations. Namely, the same wordphrases can be subdivided into different forms and categories in the different num ber of morphemes. For example, 'kamkinun' is analysed into two patterns
AN HMM POS TAGGER FOR KOREAN
441
'kamki' + 'nun' and 'kam' + 'ki' + 'nun. This makes morphological ana lysis and automatic tagging particularly difficult in Korean (Lee et al. 1994). The wordphrases are complex enough to require a fairly lengthy gram mar to generate themselves, but it turned out that notable patterns of wordphrases could be identified. The dependency among wordphrases may be summarised into patterns. Contrary to English phrases that are hard to separate from the sentence, Korean wordphrases are easily identified be cause blanks are the delimiters. The information of phrasal dependency that is also more transparent in Korean than in English should contribute to the accuracy of hidden Markov model. In the following a method to design a hidden Markov model that makes use of the phrasal dependency is introduced. 2.2
Construction of Hidden Markov Model
Automatic tagging for Korean using hidden Markov models has been known in two directions. In one approach, the Markov network is applied in the same manner as is done for English. The network states represent tags, but no phrasal dependency is taken into account. Typical works in this direction are found in Lim et al. (1994) and Lee et al. (1994). In the other approach, the network is a graph of phrasal dependencies but morpheme level dependencies are not considered (Lee et al. 1993). These two methods lack the information that each other has. Our proposal combines the two methods to achieve the better precision of the model using both morpheme and phrasal dependencies. Figure 1 shows the steps to design a hidden Markov model in the proposed method. The first step is to extract wordphrase patterns from a sample of texts. For each wordphrase pattern a morpheme level Markov network is constructed from the observation on the sample texts, and the co-occurrence dependencies between wordphrase patterns can be obtained at the same time. The co occurrence dependencies are made into a graph such as the one in Figure 4. A wordphrase pattern is denoted by two symbols in which the first one indicates the type of content morphemes and the second one represents the type of function morphemes. Table 1 shows the symbols used in our testing. The tags of content and function morphemes are highly simplified giving a minimal set of symbols so that the number of wordphrases patterns may be minimised. The more wordphrase patterns means the larger network that requires larger corpus to train and longer time to run. From the tag sets in Table 1, 48 different wordphrase patterns can be
442
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
Fig. 1: Designing a hidden Markov model using morpheme and wordphrase relations composed. Based on the Korean standard grammar and examination of sample texts, we defined 32 wordphrase patterns. Figure 2 shows an analysis of an example sentence in morpheme and wordphrase tags. Figure 3 illustrates a typical Markov model based on morpheme level de pendencies. The composition of wordphrase networks and inter-wordphrase network such as in Figure 4 is shown in Figure 5. Comparing Figures 3, 4, and 5, we can easily conclude that the network in Figure 5 will have the largest discriminating power of the three methods illustrated in the figures. One shortcoming of the combined method is that the size of network is also the largest, and this implies that bigger training corpus is needed to achieve the same level of estimation accuracy. The deterioration of tagging speed and the increased network, however, should not be critical since Viterbi algorithm runs relatively fast to find optimal word sequence. In the final network as in the Figure 5, each state is assigned with a composite tag (wordphrase tag, morpheme tag). Let t denote a composite tag. If lexical table is defined at each state, the following defines an auto matic tagging algorithm where T(w) is an optimal tag sequence for given sentence w.
443
AN HMM POS TAGGER FOR KOREAN
Example sentence pyenhwauy soktoka maywu ppalumul alkey twiesta. (One came to know the speed of change was very rapid.) M o r p h e m e tagging pyenhwa/Noun uy / Adnominal-Particle sokto/Noun ka /Subjective-Particle maywu/dүr ppalu/Verb m /Nominalising-ending ul/Objective-Particle al/Verb key / Auxiliary-connective-ending twi /Auxiliary-verb ess /Post-final-ending ta /Sentence-final-ending Wordphrase tagging pyenhwauy /NM soktoka/NS maywu/A ppalumul / P O alkey/PC twiesta/PF Fig. 2: Sentence
analysis in morpheme
and wordphrase
tags
444
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI CONTENT WORD TAG
N Ρ A M I
s
DESCRIPTION
Nominals Predicates Adverbials Adnominals Interjections Symbols
FUNCTION WORD TAG
S A 0 M Y
x F
DESCRIPTION
Subjective Particle Adverbial Particle Objective Particle Adnominal Particle Connective Particle Auxiliary Particle Connective Ending Sentence Ending
Table 1: Tag symbols of content and function words
Fig. 3: Morpheme level Markov model
By integrating wordphrase structure into HMM network, we use not only the category of word but the case of its wordphrase to select POS tag. The additional information results in the increased accuracy. For ex ample, 'casirí has duplicate interpretation 'casirí (common noun) which means 'self-confidence' and 'casin' (pronoun) meaning 'oneself. Because these two categories have similar usage distributions conditioned by other categories, conventional bigram model will not discriminate them. With wordphrase types in consideration, we can find that 'casin' (common noun) is more often used in NO, on the other hand, 'casirn'(pronoun) is used more likely in NM and NS. Furthermore, some categories have different usage patterns according to the wordphrases they are attached to. Indeed, such discriminations are reflected to the trained probabilities, where we find that P(casin |NM, pronoun) is 0.096180, while P(casin\NMi common noun) is
AN HMM POS TAGGER FOR KOREAN
445
Fig. 4: Wordphrase level Markov model
Fig. 5: Markov model with morpheme and wordphrasal relations 0.009506. The Particle 'was' which has duplicate role of conjunctive and auxiliary case, is another example which can be resolved by considering wordphrase case. Let us consider the case 'kunye' '(pronoun,"her")+ 'wa'(auxiliary, "with") 'kyelhon (action common noun,"marriage")+'ha'(verb-derived suffix,"do") + 'ta' (final ending). Because 'wa'(conjunctive) has higher likelihood with noun compared to that of 'wa'(auxiliary particle), 'wa'(conjunctive) is se lected incorrectly when only POS tag relations is used. However, when we consider wordphrase case of elhon(action common noun,"marriage") as PF, we find that iwa)(auxiliary,"with") has higher likelihood with PF compared to 'wa'(conjunctive). As an another measure for more accurate tagging, we extended the lex ical probability. By defining lexical tables at each edge, we can extend the depth of dependency of lexical probability such that the occurrence of a word is conditioned by the previous tag as well as the current one. The
446
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
extension from Equation 1 gives the following algorithm:
Much larger training texts will be needed to estimate P(w\ti-1). To deal with the data sparseness, we used the well known interpolation method as equation 3. The interpolation coefficient λ is computed using the deleted interpolation algorithm (Rabiner 1989; Jelinek & Mercer 1980).
Many ambiguous morphemes can be further handled by using the category of the preceding morphemes. In particular, when a morpheme is analysed into two forms that have the same category, they depend only on the lexical probability, 'cwun' is such a case, 'cwun' is analysed into 'cwuta'(verb, "to give") + 'n'(adnominal ending,TENSE) and 'cwulta'(verb, "to reduce")+ 'n'(adnominal ending,TENSE). As both 'cwuta' and 'cwulta' are verb, they have the same transition probability. However, when we consider preceding particles, the discrimination power can be enhanced. For example, adverbial particle and objective particle are more likely to take place before 'cwuta', and 'cwulta' is often follows subjective particle. From the trained model, we find that P('cwuta'│jca,pv) is 0.003882 and P('cwuta'│jca,pv) is 0.000162. Consequently, considering the relation with the category of the preceding morpheme is useful to discriminate categories. 3
Experiments
The goal of the experiment is to find how much improvement is achieved in tagging accuracy by the proposed method compared to morpheme level hid den Markov models. We discuss possible extensions of HMM and compare experimental results of them in different size of training data. 3.1
Test data
For training and testing the models, we have used the KIBS 1 tagged corpus (1995). We divided tagged corpus into three parts. 1
KIBS (Korea Information Base System) project aims at constructing resources for Korean language processing, including treebank, tagged corpus and raw corpus, and developing analysing tools for Korean.
AN HMM POS TAGGER FOR KOREAN Training Data 475,090 237,548 119,772 59,889 Model Model Model Model
1: 2: 3: 4:
447
Model2 Model 3 Model 4 0.82 0.86 0.35 0.75 0.87 0.32 0.71 0.78 0.31 0.64 0.74 0.29
Bigram model of morphemes Trigram model of morphemes Model of morphemes and wordphrase relations Model of morphemes and wordphrase relations with extended lexical probability. Table 2: Interpolation coefficient
• a set of 476,090 tagged wordphrases, the training data, which is used to build our models. • a set of 10,698 tagged wordphrases, the data, which is used to estimate the interpolation coefficient. • a set of 10,702 tagged wordphrases, the test data, which is used to test he models. Tagging Korean texts is necessarily preceded by some depth of morpholo gical analysis for the simplification of dictionaries of hidden Markov model. Many words are used as deeply inflected forms, and non-trivial morpholo gical rules are often needed to recover them. This makes it unreliable to evaluate tagging models using completely foreign texts since the tagging critically depends on the quality of the morphological analysis. To avoid noises that caused by fault morphology analysis, we excluded the sentences that doe not contain valid analysis candidates from test data. In other words, the recall of morphological analysis in the test is set to be 100%. The trained hidden Markov network reflecting both morpheme and word phrase relations contains 712 nodes and 28553 edges. The number of partof-speech tags is 52, and the average ambiguity of each wordphrase is 5.06. 3.2
Results
The experiments consist of comparisons of five tagging models. We adopt bigram model of morpheme as an initial model. It is extend to trigram model, which is used generally in practical English taggers (Merialdo 1994). To minimise the effect of data sparseness, we interpolate trigram distribu tions with bigram distributions as shown in equation 4.
448
JUNG H. SHIN, YOUNG S. HAN & KEY-SUN CHOI
Fig. 6: Comparison of tagging accuracy with increase of training corpus
The interpolation coefficients are summarised in Table 2. The coefficient of model 2 (trigam) is defined in 4 and other models use the coefficient defined in 3. As the size of training data increases, the coefficients tend to give stronger support to the original model parameters. When we assume the degree of training corpus for each model is proportion to the size of interpolation coefficient, model 4 has more chance to improvement with the increase of training corpus. In case of small training corpus below 200,000 wordphrase, model 3 (with morpheme and wordpharse relation) achieved highest accuracy. As the training data increases, our proposed method, model 4 in Table 2 out performed other methods. As is shown in Figure 6, the proposed model excels the popular morpheme based model by more than 0.53%. The wordphrase model that is our method short of the extension of lexical probability gave more accurate results than bigram and trigram models over all data sites. This implies that the model is insensitive to training data size despite the increased network size. Thus, the extension of lexical tables must be the sources of the data sparseness of model 4 with smaller training data.
AN HMM POS TAGGER FOR KOREAN 4
449
Conclusions
We proposed a Korean tagging model that takes wordphrasal relations as a backbone and expands lexical probability. As a result, our model gives rise to the increase of the network size but with higher accuracy. Another merit of the proposed method is that the whole process including extracting wordphrases and constructing a network, is executed automatically without human intervention. With larger corpus our model is expected to perform even better as the network saturates. This paper introduced an important issue that may be fundamental to the more elaborate taggers for Korean or other similar languages. REFERENCES Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowits. 1993. "Equations for Part-of-Speech Tagging". Proceedings of the National Con ference on Artificial Intelligence, 784-789. Menlo Park, Calif.: MIT Press. Jelinek, Frederick & Robert L. Mercer. 1980. "Interpolated estimation of Markov source parameters from sparse data". Proceedings of the Workshop on Pat tern Recognition in Practice, 381-397. Kupiec, Julian. 1992. "Robust Part-of-Speech Tagging Using a Hidden Markov Model". Computer Speech and Language 6:1.225-242. Lee, Sang H., Jae H. Kim, Jung M. Cho & Jung Y. Seo. 1995. "Korean Morpho logical Analysis Sharing Partial Analyses". Proceedings of the International Conference on Computer Processing of Oriental Language, 164-173. Hawaii, U.S.A. Lee, Wun J., Key-Sun Choi & Gil C. Kim. 1993. "Design and Implementa tion of an Automatic Tagging System for Korean Texts". Proceedings of the 20th Spring Conference of the Korean Information Science Society, 805-808. Seoul, Korea. [In Korean.] Lim, Chul S. 1994. A Korean Part-of-Speech Tagging System using Hidden Markov Model. M.Sc. thesis. KAIST, Taejon, Korea. [In Korean.] Merialdo, Bernard. 1994. "Tagging English Text with a Probabilistic Model". Computational Linguistics 20:2.155-168. Nam, Key S. 1986. Grammar for Standard Korean. Seoul, Korea: Top Press. [In Korean.] Rainber, Lawrence R. 1990. "A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition". Readings in Speech Recognition ed. by Alex Waibel & K. Lee, 267-296. San Mateo, Calif.: Morgan Kaufmann.
A Multimodal Environment for Telecommunication Specifications IVAN RETAN*, M Å N S ENGSTEDT** & B J Ö R N GAMBÄCK*
* Telia Research AB, **Ericsson Telecommunication Systems Lab., * Computerlinguistik, Universität des Saarlandes Abstract This is a description of the rationale and basic technical framework underlying VINST, a Visual and Natural language Specification Tool. The system is intended for interactive specification of the functional behaviour of telecommunication services by users not possessing indepth technical knowledge of telecommunication systems. In order to obtain the desired level of abstraction needed to accomplish this functionality, a natural language component together with a visual language interface based on a finite state automata metaphor have been integrated. The multimodal specification produced by VINST is translated to an underlying formal specification language that is further refined or transformed in a number of steps in the design process. Further-more, the specification is validated by means of sim ulation and paraphrasing. Both the specification phase and the val idation phase are carried out in the same multimodal environment. This integration of modalities provides for synergy effects including complementary expressiveness and cross-modal paraphrasing. 1
Introduction
One of the initial phases in the specification of a new telecommunication system is that of requirements engineering. The required system is nor mally described informally by the customer in a requirement specification consisting of informal text and figures, which is handed over to a design department. The major and obvious drawback of this process is that it involves a step of manual 'compilation' to telecommunication oriented im plementation or specification languages. This compilation or interpretation will have to be undertaken by a technical specialist who will have to take great care to realise the intentions of the customer through several cycles of implementation, validation and verification while tackling ambiguity, in consistency and incompleteness in the informal specification. The VINST research project, as carried out by Ericsson Telecom and partners, aimed at involving the customer in an actual formal specification
452
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
process, replacing the informal requirement specification with a computer ised tool which provides support for describing telecommunication systems in a constrained and rigorous manner. The result of using this tool is in prin ciple directly compilable into a formal specification for a telephone service, although the customer will generally only describe parts of the function ality of the entire system. However, specification of these parts may be critical and very time-consuming, and may vary extensively from customer to customer. After user studies of prototype specification systems it was decided that a multimodal environment would be most appropriate for this task. As shown in Figure 1, the two main modalities of the user interface would be natural language (NL — initially only keyboard input) and a visual language (VL) using icons and finite state automata metaphors to describe the dynamic behaviour of the system. Although, to our knowledge, no other system with a similar integrated multimodal architecture for specification of telecom services exists, other tools have been designed with similar goals. VISIONNAIRE (Henjum & Clarisse 1991) supports formalised requirements engineering for telecom munication applications using natural language, visual programming and animation. WATSON (Kelly & Nonnenman 1987) is also used for formal specification of telecom systems from natural language scenarios, while PLANDoc (McKeown et al. 1994) is used for generation of natural language paraphrasai text from telephone route planning descriptions. 2
S y s t e m overview
VINST is a tool which operates in at least three distinct modes: 1. Specification of static properties in a conceptual schema 2. Specification of dynamic properties in rules 3. Validation of specifications of which we will here be mostly concerned with (2), which in some sense is the central mode since (1) is a task which provides the general setting or constraints for dynamic specifications (where one specification of static properties can be common to many different dynamic specifications) and (3) is intended for validating (1) and (2). Validation of a specification is carried out by simulation of the rules using manually triggered events, the conceptual schema and a description of the initial state of the world. A com plementary way of validating a specification is by cross-modal paraphrasing (see below). The tasks are normally performed in the above order.
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
453
Fie. 1: Components and information flow in VINSΤ The goal of interacting with VINST is to produce a specification in Delphi (Höök 1993), a language dedicated to the formal description of the func tional behaviour of telecommunication systems. It is a declarative language based on first-order predicate logic and Entity-Relationship theory with a discrete model of time, where the dynamic specification is made up of a set of rules consisting of event, pre- and post-conditions and where the static part consists of a set of axioms and a conceptual schema. As can be seen in Figure 1, in order to produce Delphi rules, the VINST user can either work with the Natural Language (NL) modality or the Visual Language (VL) modality, which will both give rise to the same kind of internal representations in a language which we simply will refer to as IL (Internal Language), which is modelled relatively closely around Delphi. IL is basically an abstract syntax tree, either represented in Prolog (ILP) or in Smalltalk (ILS). The translation of natural language proceeds via several internal rep resentation languages for NL, here collectively referred to as NIL, which contains various types of linguistic information, such as morphological and syntactical features. Likewise, the translation of VL is performed via the Visual Internal Language, VIL, which is IL annotated with information on
454
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
how to present the representation visually as icons and automata. VIL and NIL can thus be seen as a supersets of IL. These translations meet in the common internal representation language IL, not containing any modality specific information, which is finally translated to a Delphi representation called Delphi Textual Language (DTL). The information contained in NIL and VIL expressions that is not part of IL has to be added by the generation processes. In Figure 1, all arrows showing translations are bidirectional, hence in dicating the possibility of translating in any direction between the repres entation languages VL, NL and Delphi. The user creates the specification in either modality, upon which trans lation via IL to Delphi takes place, as well as translation into the alternate modality, i.e., paraphrasing. The guiding principle for paraphrasing is to give control to the user. Thus, when a visual expression is constructed, it can be translated into natural language upon request. Likewise, when associating natural language fragments with different parts of the specific ation, they can optionally be paraphrased or even replaced by their visual counter-parts. The visual representation is canonical in some sense, reflecting both the system's view of the world (in terms of automata) and relevant discourse objects. It might be argued that the natural language part of the system does not need to be very sophisticated, since the VL part will have relatively high expressivity with respect to the task. However, in order for NL para phrasing of VL expressions to be generally applicable, and also for providing a complete and predictable (the term habitable is sometimes used) linguistic coverage, a large-scale NLP component is highly desirable also for a system such as VINST. Although the existing initial VINST prototype makes use of a less ex tensive NLP component, an experiment to integrate VINST and a natural language processor for English based on the SRI Core Language Engine (CLE — Alshawi ed., 1992 — described further in Section 6 below) was carried out together with SRI.1 The envisaged complete VINST system (as outlined in this paper) would of course allow for the user to make more extensive use of the NL modality.
1
Most of the adaptation of the CLE to the VINST domain, including conversion from its internal logical format to ALF, was carried out by Manny Rayner and Richard Crouch, SRI International, Cambridge, England.
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
455
Fig. 2: A VINST automaton and NL rule
3
Automata and rules
In the NL part of the system, rules can be formulated using conditionals which mirror the structure of the target Delphi rule, consisting of an event (for instance triggered by an action performed by the user of the service), a number of conditions (pre-conditions) and conclusions (post-conditions). In the visual language part, the same type of rule can be created using icons and visual counterparts of states and transitions in a finite state automata. In Figure 2, a two-state automaton representing two Delphi rules for a part of the service Basic Call is shown together with a rule formulated in NL which corresponds to the upper transition from the state idle to the state ready for digits. As can be seen from the figure, an automaton state can correspond both to pre- and post-conditional parts of Delphi rules whereas the automaton transitions correspond to events of Delphi rules. Since each transition in the automaton correspond to a Delphi rule, the automaton in Figure 2 can be translated to two corresponding Delphi rules. The main principle behind the automata metaphor is that it conveys the
456
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
temporal flow of the service, where states correspond to moments in time and transitions are triggered by events (such as picking up the receiver) which increment the temporal counter of the system. Also, the automata metaphor provides for cyclical specifications (a state can have transitions directed both to it and from it), which normally will be less explicit in a natural language-only specification. 4
The visual language
The basic building blocks of the visual language of VINST are icons that are used together with the graphical rendering of states and transitions. The icons, or the visual vocabulary, can be seen as part of the conceptual schema, and visualise different primitive or derived concepts of the domain. Some icons are parameterised, where the parameters normally corres pond to the predicate-argument structure of a lexical entry. For instance, the icon corresponding to the entry for the verb "dial" will normally have two initially empty slots, which can be filled with other symbols represent ing a subscriber and a telephone number, for the subject and direct object respectively. Other icons can be seen as representing properties, and can consequently be added as visual annotations to a main icon representing a principal object. A one-to-one correspondence between visual symbols and lexical entries is not a necessary property of the system. In fact, due to the specificity of pictures, one icon (such as a telephone) can express arbitrarily complex states-of-affairs (for instance that there is an idle subscriber who is 'onhook'). Conversely, since one of the major points of a multimodal system is complementary expressiveness, certain VL expressions will be significantly less direct to formulate than their NL counterparts, or in some cases even impossible. The visual language of VINST does not have general mechan isms to support the formulation of quantification, disjunction and negation, for which the user is deferred to the natural language component. The visual language could of course be enhanced to handle a more complex logic, probably at the expense of intuitiveness. 5
Modality synergy
As observed by for instance Cohen et al. (1989), direct manipulation based languages provide a controlled and guided interaction style complementary to that of natural language interfaces, while support for the type of com-
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
457
Α "Α d o e s n ' t have a hotnumber b u t he has a r e d i r e c t i o n number." |
Fig. 3: NL fragment within an automaton state plex compositional semantics that NL interfaces normally exhibit generally stretches the limits for what a visual language can express without attaining the same degree of difficulty as a high-level programming language. In addi tion to the cases where natural language simply is more convenient, such as when expressing complex quantification, the need for modality integration showed up in user studies when the VL symbol library did not contain icons which matched exactly what the user wanted to express, which could be due to problems of specificity or just suboptimal icon design. We can envision users switching from VL to NL when the former way of expressing services is judged to be too blunt (due to the specificity problem) or involving too many interactive steps. Another significant observation from these studies is the increased un derstanding of the specification created when it was paraphrased in the alternate, non-input modality. When paraphrasing the VL specification in natural language a 'search-light' effect is obtained, where the opaque linguistic coverage of the NL component is illuminated and both domain specific vocabulary and grammatical preferences are revealed. There is ample evidence (as reported by Karlgren 1992, among others) that such system-generated language will be picked up by the user and recycled in the continued dialogue, increasing the efficiency of the interaction. The notion of synergy, where access to several modalities gives a func tionality which cannot be obtained through using only one of them, can be taken one step further if more integrated mixing of modalities is considered. This is realised in VINST with the support for placing NL fragments in differ ent parts of a visually specified automaton in order to 'tag' these fragments with respect to a point in time in the execution of the service. For instance, a user specifying something in NL which is cumbersome to visualise, and who places this text within a particular state, as in Figure 3, does not need to say anything about at which point in time this happens,
458
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
Fig. 4: The NL architecture of VINST or what event brought about this fact, since all this is given by the visual context. 6
Architecture of the NL component
The NL component of VINST is divided into two parts, a front-end and a back-end. The front-end translates NL to IL, the modality-independent Internal Language. IL can in turn be translated either to VL or to Delphi, or be used as the starting point for NL-generation. IL to VL translation typically requires generation of layout information for the resulting visual description, a difficult problem a discussion of which is out of the scope of the present paper. The main functionality of the NL front-end part is to translate, in both directions, between natural language and IL via the intermediate represent ation ALF (Application-specific Logical Form). The back-end translates IL to Delphi Textual Language, DTL. This separation of the system in a frontend and a back-end minimises changes in the system caused by changes in VL or in Delphi. The front-end and the back-end consist of a number of sub-processing steps as shown in Figure 4. The first step involves different types of word-
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
459
level processes, such as tokenisation, lexical analysis, and inflectional mor phology. The output from the morphological component is a lattice, con taining all possible sequences of inflected words. This is used by the syntactic parser, an LR-parser which works with unification-based grammar rules to produce an implicit parse tree (deriv ation tree with rule annotations). Semantic rules are applied to this tree, resulting in one or several pseudo-logical representations in a format known as Quasi Logical Forms (QLF — Alshawi & van Eijck 1989). A QLF carries different amounts of informational content in different stages of processing. In the interpretation step, the QLFs undergo scoping and reference resol ution. Scoping is here taken to mean the mapping from determiners, modals and certain adverbs to quantifiers or operators and determining their scope using linguistically motivated scope preferences. The resolution stage deals with referring expressions, elliptic phrases, and semantically vague relations (such as the ones derived from "have" and "is"). The final step in the linguistic analysis chain involves mapping the scoped, resolved QLF into the application-specific logical format, ALF, a fairly standard extension of first-order predicate logic with some higherorder operators. This is done by means of a machinery operating using declarative rewrite rules in a process such as Abductive Equivalential Trans lation (AET — Rayner 1993). This process translates a resolved QLF into an ALF according to a domain theory which describes equivalences between logical formulae containing linguistic predicates and formulae containing Delphi related predications. For the purposes of VINST, the ALF is further translated into an ab stract syntactic representation of a Delphi rule, i.e., the IL representation. The Delphi translation step, the back-end, translates an IL form to the fi nal Delphi Textual Language, DTL, representation. This can be seen as a mapping from abstract to concrete syntax; however, as noted above (see Section 2), the internal language is at present mainly a notational variant of Delphi, with rather uninteresting variations depending on the actual pro gramming language (Prolog or Smalltalk, for NL or VL, respectively) it is represented in. Going in the other direction of processing, most modules are reversible, although some cannot be used as such for practical purposes due to the high amount of indeterminism involved; for example, the first step could probably make use of rules derived from machine-learning techniques to map ALFs into 'standard' QLF fragments. Since the grammar formalism of the CLE is fully reversible, the final
460
IVAN BRETAN, MÅNS ENGSTEDT & BJÖRN GAMBÄCK
step would on the other hand just involve specifying which of the (analysis) grammar rules that should be allowed to be invoked by the generator. These would then by compiled in a way distinct from the format used by the parser. 7
A simple translation example
As mentioned, Delphi rules consist of an event, a number of conditions and conclusions. The informal semantics of a rule is that if an event occurs at a point of time T and the conditions are valid at the same point of time T then the conclusions will be valid at the next point of time T + 1. A Delphi rule can be described as one conditional NL sentence with an if-part and a then-part as described in Figure 2, although there are of course many possible alternative ways of conveying the same information, e.g., as a small discourse: An idle subscriber is onhook. The subscriber makes offhook. He becomes offhook and gets dialtone. The first sentence contains some pre-conditions and the second the event of the rule. The third sentence contains the post-conditions or the conclu sions of the corresponding Delphi rule. This discourse refers to a number of concepts such as subscriber, on-hook and idle. These concepts, their lex ical realisations, their relations to other concepts, to visual symbols and to Delphi expressions must already have been defined in the domain model (which includes both a conceptual schema and a domain theory). From the three sentences above, the following ALF formula is generated (with the uppercase letters being variables of the standard Prolog type and bindings, while the letters t are indicating the time points of the different events, states and conditions): exists([A,B,C,D,E,F], cond(subscriber,A,[t=T]), state(be_idle,B,A,[t=T]), state(be_onhook, , A,[t=T]), event(make_offhook,D, A,[t=T]), cond(get,E,A,dialtone,[t=T+l]), state(be_offhook,F,A,[t=T+l])
)
This formula is obtained by using (conditional) AET equivalences of the following type relating linguistic predicates and target Delphi predicates:
A MULTIMODAL TELECOM SPECIFICATION ENVIRONMENT
461
make_offhook(Event, Person) event(make_offhook,Event,Person)