Lecture Notes in Artificial Intelligence
6149
Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science FoLLI Publications on Logic, Language and Information Editors-in-Chief Luigia Carlucci Aiello, University of Rome "La Sapienza", Italy Michael Moortgat, University of Utrecht, The Netherlands Maarten de Rijke, University of Amsterdam, The Netherlands
Editorial Board Carlos Areces, INRIA Lorraine, France Nicholas Asher, University of Texas at Austin, TX, USA Johan van Benthem, University of Amsterdam, The Netherlands Raffaella Bernardi, Free University of Bozen-Bolzano, Italy Antal van den Bosch, Tilburg University, The Netherlands Paul Buitelaar, DFKI, Saarbrücken, Germany Diego Calvanese, Free University of Bozen-Bolzano, Italy Ann Copestake, University of Cambridge, United Kingdom Robert Dale, Macquarie University, Sydney, Australia Luis Fariñas, IRIT, Toulouse, France Claire Gardent, INRIA Lorraine, France Rajeev Goré, Australian National University, Canberra, Australia Reiner Hähnle, Chalmers University of Technology, Göteborg, Sweden Wilfrid Hodges, Queen Mary, University of London, United Kingdom Carsten Lutz, Dresden University of Technology, Germany Christopher Manning, Stanford University, CA, USA Valeria de Paiva, Palo Alto Research Center, CA, USA Martha Palmer, University of Pennsylvania, PA, USA Alberto Policriti, University of Udine, Italy James Rogers, Earlham College, Richmond, IN, USA Francesca Rossi, University of Padua, Italy Yde Venema, University of Amsterdam, The Netherlands Bonnie Webber, University of Edinburgh, Scotland, United Kingdom Ian H. Witten, University of Waikato, New Zealand
Christian Ebert Gerhard Jäger Jens Michaelis (Eds.)
The Mathematics of Language 10th and 11th Biennial Conference MOL 10, Los Angeles, CA, USA, July 28-30, 2007 and MOL 11, Bielefeld, Germany, August 20-21, 2009 Revised Selected Papers
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Christian Ebert Department of Linguistics, University of Tuebingen Wilhelmstrasse 19, 72074 Tuebingen, Germany E-mail:
[email protected] Gerhard Jäger Department of Linguistics, University of Tuebingen Wilhelmstrasse 19, 72074 Tuebingen, Germany E-mail:
[email protected] Jens Michaelis Faculty of Linguistics and Literary Studies, University of Bielefeld Postfach 100131, 33501 Bielefeld, Germany E-mail:
[email protected] Library of Congress Control Number: 2010930278
CR Subject Classification (1998): F.4.1, F.2, G.2, F.3, F, I.2.3 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13
0302-9743 3-642-14321-0 Springer Berlin Heidelberg New York 978-3-642-14321-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 06/3180
Preface
The Association for Mathematics of Language (MOL) is the ACL special interest group dedicated to the study of Mathematical Linguistics. After its first meeting in 1984, the association has been organizing meetings on a biennial basis since 1991, with locations usually alternating between Europe and the USA. This volume contains a selection of 19 papers that were presented in contributed talks at the 10th meeting at UCLA, Los Angeles, in 2007 and the 11th meeting at the University of Bielefeld, Germany, in 2009. It furthermore contains three papers of invited speakers from these meetings. Like each MOL proceedings volume, this collection reflects studies in a wide range of theoretical topics relating to language and computation that the association is devoted to supporting, including papers on the intersection of computational complexity, formal language theory, proof-theory, and logic as well as phonology, lexical semantics, syntax, and typology. This volume is hence of interest not only to mathematical linguists but to logicians, theoretical and computational linguists, and to computer scientists alike. It therefore fits very well in the Springer FoLLI/LNAI series and we are grateful to Michael Moortgat for suggesting this series as a place of publication. We would furthermore like to thank everyone who played a role in making these meetings possible and helped to make them a success, such as the reviewers, the invited speakers, the contributors, and the people who were involved in local organization. May 2010
Christian Ebert Gerhard J¨ ager Jens Michaelis
Organization
MOL 10 was held at the University of California, Los Angeles, during July 28–30, 2007.
Local Organizers Marcus Kracht Gerald Penn Edward P. Stabler
University of California, Los Angeles University of Toronto University of California, Los Angeles
Referees Ron Artstein Patrick Blackburn Pierre Boullier Wojciech Buszkowski David Chiang Tim Fernando Markus Egg Gerhard J¨ ager
Martin Jansche David Johnson Aravind Joshi Andr´ as Kornai Alain Lecomte Carlos Mart´ın-Vide Jens Michaelis Mehryar Mohri
Uwe M¨onnich Michael Moortgat Drew Moshier Gerald Penn Sylvain Pogodalla Edward P. Stabler Shuly Wintner
MOL 11 was held at the University of Bielefeld in Germany during August 20–21, 2009.
Local Organizers Gerhard J¨ ager Marcus Kracht Christian Ebert Jens Michaelis
University University University University
of of of of
T¨ ubingen Bielefeld T¨ ubingen Bielefeld
Referees Patrick Blackburn Philippe de Groote Christian Ebert Gerhard J¨ ager Aravind Joshi Stephan Kepser Gregory M. Kobele
Andr´ as Kornai Marcus Kracht Natasha Kurtonina Jens Michaelis Michael Moortgat Larry Moss Richard T. Oehrle
Gerald Penn Wiebke Petersen Sylvain Pogodalla James Rogers Sylvain Salvati Edward P. Stabler Hans-J¨org Tiede
Table of Contents
Dependency Structures Derived from Minimalist Grammars . . . . . . . . . . . Marisa Ferrara Boston, John T. Hale, and Marco Kuhlmann
1
Deforesting Logical Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhong Chen and John T. Hale
13
On the Probability Distribution of Typological Frequencies . . . . . . . . . . . . Michael Cysouw
29
A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy A.D. Fowler
36
LC Graphs for the Lambek Calculus with Product . . . . . . . . . . . . . . . . . . . Timothy A.D. Fowler
44
Proof-Theoretic Semantics for a Natural Language Fragment . . . . . . . . . . Nissim Francez and Roy Dyckhoff
56
Some Interdefinability Results for Syntactic Constraint Classes . . . . . . . . Thomas Graf
72
Sortal Equivalence of Bare Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Holder
88
Deriving Syntactic Properties of Arguments and Adjuncts from Neo-Davidsonian Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tim Hunter On Monadic Second-Order Theories of Multidominance Structures . . . . . Stephan Kepser The Equivalence of Tree Adjoining Grammars and Monadic Linear Context-Free Tree Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephan Kepser and James Rogers
103 117
129
A Formal Foundation for A and A-bar Movement . . . . . . . . . . . . . . . . . . . . Gregory M. Kobele
145
Without Remnant Movement, MGs Are Context-Free . . . . . . . . . . . . . . . . Gregory M. Kobele
160
The Algebra of Lexical Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´ as Kornai
174
VIII
Table of Contents
Phonological Interpretation into Preordered Algebras . . . . . . . . . . . . . . . . . Yusuke Kubota and Carl Pollard
200
Relational Semantics for the Lambek-Grishin Calculus . . . . . . . . . . . . . . . . Natasha Kurtonina and Michael Moortgat
210
Intersecting Adjectives in Syllogistic Logic . . . . . . . . . . . . . . . . . . . . . . . . . . Lawrence S. Moss
223
Creation Myths of Generative Grammar and the Mathematics of Syntactic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geoffrey K. Pullum
238
On Languages Piecewise Testable in the Strict Sense . . . . . . . . . . . . . . . . . James Rogers, Jeffrey Heinz, Gil Bailey, Matt Edlefsen, Molly Visscher, David Wellcome, and Sean Wibel
255
A Note on the Complexity of Abstract Categorial Grammars . . . . . . . . . . Sylvain Salvati
266
Almost All Complex Quantifiers Are Simple . . . . . . . . . . . . . . . . . . . . . . . . . Jakub Szymanik
272
Constituent Structure Sets I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroyuki Uchida and Dirk Bury
281
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
297
Dependency Structures Derived from Minimalist Grammars Marisa Ferrara Boston1 , John T. Hale1 , and Marco Kuhlmann2 1 2
Cornell University Uppsala University
Abstract. This paper provides an interpretation of Minimalist Grammars [16,17] in terms of dependency structures. Under this interpretation, merge operations derive projective dependency structures, and movement operations introduce both non-projectivity and illnestedness. This new characterization of the generative capacity of Minimalist Grammar makes it possible to discuss the linguistic relevance of non-projectivity and illnestedness. This in turn provides insight into grammars that derive structures with these properties.1
1
Introduction
This paper investigates the class of dependency structures that Minimalist Grammars (MGs) [16,17] derive. MGs stem from the generative linguistic tradition, and Chomsky’s Minimalist Program [1] in particular. The MG formalism encourages a lexicalist analysis in which hierarchical syntactic structure is built when licensed by word properties called “features”. MGs facilitate a movement analysis of long-distance dependency that is conditioned by lexical features as well. Unlike unification grammars, but similar to categorial grammar, these features must be cancelled in a particular order specific to each lexical item. Dependency Grammar [18] is a syntactic tradition that determines sentence structure on the basis of word-to-word connections, or dependencies. DG names a family of approaches to syntactic analysis that all share a commitment to word-to-word connections. [8] relates properties of dependency graphs, such as projectivity and wellnestedness, to language-theoretic concerns like generative capacity. This paper examines these same properties in MG languages. We do so using a new DG interpretation of MG derivations. This tool reveals that syntactic movement as formalized in MGs can derive the sorts of illnested structures attested in Czech comparatives from the Prague Dependency Treebank 2.0 (PDT) [5]. Previous research in the field indirectly link MGs and DGs: [12] proves the equivalence of MGs with Linear Context-Free Rewriting Systems (LCFRS) [19], and the relation of LCFRS to DGs is made explicit in [8]. This paper provides a 1
The authors thank Joan Chen-Main, Aravind K. Joshi, and audience members at Mathematics of Language 11 for discussion and suggestions.
C. Ebert, G. Jäger, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 1–12, 2010. © Springer-Verlag Berlin Heidelberg 2010
2
M.F. Boston, J.T. Hale, and M. Kuhlmann
direct connection between the two formalisms. Using this connection, we investigate the linguistic relevance of structural constraints such as non-projectivity and illnestedness based on MGs that induce structures with these properties. The system proposed is not a new formalism, but a technical tool that we use to gain linguistic insight. Section 2 describes MGs as they are formalized in [17], and Section 3 translates MG operations into operations on dependency structures. Sections 4 and 5 discuss the structural constraints of projectivity and nestedness in terms of MGs.
2
Minimalist Grammars
This section introduces notation particular to the MG formalism. Following [16] and [17], a Minimalist Grammar G is a five-tuple (Σ, F, Types, Lex , F ). Σ is the vocabulary of the grammar, which can include empty elements. Figure 6 exemplifies the use of empty “functional” elements typical in Chomskyan and Kaynian analyses. There denotes an empty element that, while syntactically potent, makes no contribution to the derived string. F is a set of features, built over a non-empty set of base features, which denote the lexical category of the item. If f is a base feature, then =f is a selection feature, which selects for complements with base feature f ; a prefixed + or − identifies licensor and licensee features, +f and −f respectively, that license movement. A Minimalist Grammar distinguishes two types of structure. The “simple” type, flagged by double colons (::), identifies items fresh out of the lexicon. Any involvement in structure-building creates a “derived” item (:) flagged with a single colon. This distinction allows the first derivation of syntactic composition to be handled differently from later episodes. A chain is a triple Σ ∗ × T ypes × F ∗ , and an expression is a non-empty sequence of chains. The set of all expressions is denoted by E. The lexicon Lex is a finite subset of chains with type ::. The set F is a set of two generating functions, merge and move. For simplicity, we focus on MGs that do not incorporate head or covert movement. Table 1 presents the functions in inference-rule form, following [17]. In the table, the juxtaposition st denotes the concatenation of two strings s and t. Merge is a structure building operation that creates a new, derived expression from two expressions (E×E → E). It is the union of three functions, shown in the upper half of Table 1. Each sub-function applies according to the type and feature of the lexical items to be merged. The three merge operations differ with respect to the types of chains they operate on. If s is simple (::), the merge1 operation applies. If it is derived (:), the merge2 operation applies. We write · when the type does not matter. If t has additional features δ, the merge3 operation must apply regardless of the type of s. The move operation is a structure building operation that creates a new expression from an expression (E → E). It is the union of two functions, move1 and move2, provided in the lower half of Table 1. As with the merge3 operation, move2 only applies when t has additional features δ.
Dependency Structures Derived from Minimalist Grammars
3
Table 1. Merge and Move s :: =f γ
t · f, α1 , . . . αk
st : γ, α1 , . . . , αk s : =f γ, α1 , . . . , αk
merge1
t · f, ι1 , . . . , ιl
ts : γ, α1 , . . . , αk , ι1 , . . . , ιl s · =f γ, α1 , . . . , αk
t · f δ, ι1 , . . . , ιl
s : γ, α1 , . . . , αk , t : δ, ι1 , . . . , ιl
merge2 merge3
s : +f γ, α1 , . . . , αi−1 , t : −f, αi+1 , . . . , αk ts : γ, α1 , . . . , αi−1 , αi+1 , . . . , αk s · +f γ, α1 , . . . , αi−1 , t : −f δ, αi+1 , . . . , αk s : γ, α1 , . . . , αi−1 , t : δ, αi+1 , . . . , αk
3
move1 move2
MG Operations on Dependency Trees
In this section we introduce a formalism to derive dependency structures from MGs. Throughout the discussion N denotes the set of non-negative integers. 3.1
Dependency Trees
DG is typically discussed in terms of directed dependency graphs. However, the directed nature of dependency arrows and the single-headed condition [13] allow these graphs to also be viewed as trees. We define dependency trees in terms of their nodes, with each node in a dependency tree labeled by an address, a sequence of positive integers. We write λ for the empty sequence of integers. Letters u, v, w are variables for addresses, s,t are variables for sets of addresses, and x, y are variables for sequences of addresses. If u and v are addresses, then the concatenation of the two is as well, denoted by uv. Given an address u and a set of addresses s, we write ↑u s for the set { uv | v ∈ s }. Given an address u and a sequence of addresses x = v 1 , . . . , v n , we write ↑u x for the sequence uv 1 , . . . , uv n . Note that ↑u s is a set of addresses, whereas ↑u x is a sequence of addresses. A tree domain is a set t of addresses such that, for each address u and each integer i ∈ N, if ui ∈ t, then u ∈ t (prefix-closed), and uj ∈ t for all 1 ≤ j ≤ i (left-sibling closed). A linearization of a finite set S is a sequence of elements of S in which each element occurs exactly once. For the purposes of this paper, a dependency tree is a pair (t, x), where t is a tree domain, and x is a linearization of t. A segmented dependency tree is a non-empty sequence (s1 , x1 ), . . . , (sn , xn ), where each si is a set of addresses, each xi is a linearization of si , all sets si are pairwise disjoint, and the union of the sets si forms a tree domain. A pair (si , xi ) is called a component, which corresponds to chains in Stabler and Keenan’s [17] terminology.
4
M.F. Boston, J.T. Hale, and M. Kuhlmann
An expression is a sequence of triples (c1 , τ 1 , γ 1 ), . . . , (cn , τ n , γ n ), where (c1 , . . . , cn ) is a segmented dependency tree, each τ i is a type (lexical or derived), and each γ i is a sequence of features. We write these triples as ci :: γ i (if the type is lexical), ci : γ i (if the type is derived), or ci · γ i (if the type does not matter). We use the letters α and ι as variables for elements of an expression. Given an element α = ((s, x), τ, γ) and an address u, we write ↑u α for the element ((↑u s, ↑u x), τ, γ). Given an expression d with associated tree domain t, we write next(d) for the minimal positive integer i such that i ∈ / t. 3.2
Merge
Merge operations allow additional dependency structure to be added to an initially derived tree. These changes are recorded in the manipulation of the dependency tree addresses, as formalized in the previous section. Table 2 provides a dependency interpretation for each of the structure-building rules introduced in Table 1. The mergeDG functions create a dependency between two trees, such that the root of the left tree becomes the head of the root of the right tree, where left and right correspond to the trees in the rules. For example, in merge1DG the ↑1 t notation signifies that t is now the first daughter of s. Its components are similarly updated. Table 2. Merge in terms of dependency trees ({λ}, λ) :: =f γ
(t, x) · f, α1 , . . . , αk
({λ} ∪ ↑1 t, λ · ↑1 x) : γ, ↑1 α1 , . . . , ↑1 αk (s, x) : =f γ, α1 , . . . , αk
merge1DG
(t, y) · f, ι1 , . . . , ιl
(s ∪ ↑i t, ↑i y · x) : γ, α1 , . . . , αk , ↑i ι1 , . . . , ↑i ιl (s, x) · =f γ, α1 , . . . , αk
merge2DG
(t, y) · f δ, ι1 , . . . , ιl
(s, x) : γ, α1 , . . . , αk , (↑i t, ↑i y) : δ, ↑i ι1 , . . . , ↑i ιl
merge3DG
where i = next ((s, x) · =f γ, α1 , . . . , αk )
Applying the merge1DG rule to a simple English grammar creates the dependency tree in Figure (1). Dependency relations between nodes are notated with solid arrows and node labels are notated with dotted lines; the dependency graphs shown in the figures are encoded by the address sets described above in Table 2. A lexicon for these examples is provided in Figure 1(a). As was mentioned above, the merge rules apply in different contexts depending on the tree types and number of features. Merge1DG can apply in Figure 1 because the selector tree the is simple and the selected tree boat does not have additional features δ. The entire derived tree forms a single component, denoted by the dashed box. Merge2 contrasts with merge1 in the linearized order of the nodes: in this case, the right tree is ordered before the left tree and its children, as in Figure 2(a).
Dependency Structures Derived from Minimalist Grammars the::=n d
the::=n d (b)
boat::n
docked::=d=d v where::d -wh (a) Lexicon 1.
boat::=n (c)
5
::=v +wh c
the:d boat (d)
Fig. 1. merge1DG applies to two simple dependency trees
The rules given in Table 2 are a deduction system for expressions: sequences of triples whose first components together define a segmented dependency tree. Each component, spanning any number of words, has a feature-sequence associated with it. Merge3DG introduces new, unlinearized components into the derivation. These components are unordered with respect to the other components, though the words within the components are ordered. In Figure 2(b), docked and where are represented by separate dashed boxes; this indicates that their relative linear order is unknown. Merge3DG contrasts with applications of merge1DG merge2DG, where two dependency trees are merged into one, fullyordered component, demonstrated by Figures 1(d) and 2(a). 3.3
Move
The move operation does not create or destroy dependencies in the derived tree. It re-orders the nodes, and reduces the number of components in the tree by one. Table 3 defines these rules in terms of dependency trees. Table 3. Move in terms of dependency trees (s, x) : +f γ, α1 , . . . , αi−1 , (t, y) : −f, αi+1 , . . . , αk (s ∪ t, yx) : γ, α1 , . . . , αi−1 , αi+1 , . . . , αk s · +f γ, α1 , . . . , αi−1 , t : −f δ, αi+1 , . . . , αk s : γ, α1 , . . . , αi−1 , t : δ, αi+1 , . . . , αk
move1DG
move2DG
Figure 3 demonstrates the move1DG operation on a simple structure. The where node is not only reordered to the front of the tree, but it also becomes part of the node’s component. Note that for this example we use an to denote the structural position that has the +wh feature. This follows from standard linguistic practice; in some languages, this position can be marked by an overt lexical item. In English, it is not. Both and overt lexical items can have licensor features. Unlike the previous merge and move operations described, move2DG does not change the dependency structure or linearization of the tree. Move2DG applies when the licensee component has additional features that require further movements. Its sole purpose is to cancel the licensor and licensee features; this feature
6
M.F. Boston, J.T. Hale, and M. Kuhlmann
the boat sailed: v (a)
docked:=d v where:-wh (b)
Fig. 2. merge2DG and merge3DG
:+wh c the boat docked where: -wh
where :c the boat docked Fig. 3. move1DG
cancellation is necessary to preserve the linguistic intuition of the intermediate derivation step. Figure 4 demonstrates Move2DG. Although the +wh and −wh features are canceled, the linear order is unaffected and the additional licensee feature −t on where remains. Following [16] and [17], we restrict movement with the Shortest Move Condition (SMC), defined in (1). (1) None of α1 , ..., αi−1 , αi+1 , ..., αk has −f as its first feature. Adoption of the SMC guarantees a version of MGs that are weakly equivalent to LCFRS [2]. The rules above derive connected dependency structures. This is demonstrated by induction on the derived structure: Single-node trees (i.e., simple lexical items) vacuously satisfy connectedness. All merge rules create dependencies between trees, and movements do not destroy any already-created dependencies. Therefore, a dependency structure at any derivation step will be connected. Provided that every expression in the lexicon has exactly one base feature, the dependency trees derived from MGs will not contain multi-headed nodes (i.e., nodes with multiple parents). This single-headedness proof follows straightforwardly from two lemmas concerning the role of base features in expressions E = (c1 , τ 1 , γ 1 ), . . . , (cn , τ n , γ n ). Lemma 1 asserts the unique existence of a base feature f in the first feature sequence γ 1 . Lemma 2 denies the existence of base features in later components.
Dependency Structures Derived from Minimalist Grammars
7
:+wh c the boat docked where: -wh -t
: c the boat docked where: -t Fig. 4. move2DG
The dependency structures derived at intermediate steps need not be totally ordered. Because of the merge3 and move2 rules, components can be introduced into the dependency structure that have not yet moved to their final order in the structure. However, the usual notion of start category [17] in MGs is a single base feature. This implies that in complete derivations all licensee feature have been checked. The implication guarantees that dependency trees derived using the system in Tables 2 and 3 are totally-ordered.
4
Minimalist Grammars and Block Degree
Projectivity is a constraint on dependency structures that requires subtrees to span intervals. [9] define an interval as the set [i, j] := {k ∈ V |i ≤ k and k ≤ j}, where i and j are endpoints, and V is a set of nodes as defined in Section 3.1. Non-projective structures violate this constraint. The node labeled docked in Figure 3 spans two intervals: its child spans interval 0 and the node and its other children span intervals 2-4. Following [8] we use the notion of block degrees to characterize non-projective structures. A tree’s block degree is the maximum number of intervals each of its subtrees span. The block degree for Figure 3, repeated in Figure 5, is two: each of the intervals of the node labeled docked forms a block. Shaded boxes notate node blocks.
where:
:c the boat docked
Fig. 5. The block degree of this structure is 2
8
M.F. Boston, J.T. Hale, and M. Kuhlmann
By construction, mergeDG always forms dependency relations between the roots of subtrees. All nodes in the resulting expression are part of the same interval. Move1DG has the potential to create non-projective structures: constituents can move away from the interval that the parent node spans to create a separate constituent block, as demonstrated by Figure 5. In this example, the element intervenes between docked to where dependency. As discussed above, in other languages could be replaced by an overt lexical item. Both types of intervention are considered non-projective in this work. Because only movements can cause non-projectivity, and because all movements are triggered in the MG framework by a licensor and licensee pair, the block degree of the derived tree is bounded by the number of licensees. In other words, the number of licensees determines the maximum block degree of the structure. This number has previously been identified as an upper bound on the complexity of MGs [11,6]. The coincidence of this result follows from work by [14], who attributed the increased parsing complexity of LCFRS to nonprojectivity [8].
5
Minimalist Grammars and Nestedness
A further constraint on the class of dependency structures is wellnestedness [8]. Wellnested structures prohibit the “crossing” of disjoint subtree intervals. Any structure that is not wellnested is said to be illnested, as in Figure 6(b). Here, the subtree spanning the hearing on the issue crosses the subtree spanning scheduled today. [8] demonstrates that grammars that derive illnested structures are more powerful than grammars that do not, which leads to higher parsing complexity. the::=n =p d -y hearing::n on::=d p -w the::=n d issue::n is:: =v =d t -x scheduled::=r v today::r -w ::=t +w c ::=c +x c ::=c +y c ::=c +z c (a)
the
hearing
is
scheduled
(b)
on
the
issue
Fig. 6. MGs derive illnested dependency structures
today
Dependency Structures Derived from Minimalist Grammars
the: -z
today
on
is
the
hearing
on: -x the issue is: t -y scheduled (a) Merged structure before movement.
9
today: -w
:c the: -z hearing on: -x the issue is: -y scheduled (b) Merge of ::=v +w c and is: -y. Movement of today: -w.
the
issue :c today the: -z hearing is: -y scheduled (c) Merge of ::=c +x c and : c. Movement of on: -x.
scheduled :c on the issue today the: -z (d) Merge of ::=c +y c and : c. Movement of is: -y.
hearing
hearing :c is scheduled on the issue today (e) Merge of ::=c +z c and : c. Movement of the: -z. Fig. 7. Derivation of illnested English structure
10
M.F. Boston, J.T. Hale, and M. Kuhlmann
We prove that MGs are able to derive illnested structures by example. The grammar in Figure 6(a) derives the illnested English structure in Figure 6(b). The result is a 1-illnested structure, the lowest level of illnestedness in the characterization of [10]. Not all mildly context-sensitive formalisms can derive illnested structures. For example, TAGs can only generate wellnested structures with a block-degree of at most two [8]. Our proof demonstrates that MGs derive structures with higher block degrees, which have a potential for illnested structures. This allows MGs to generate the same string languages as LCFRS, which also generate illnested structures [15]. The illnested structure in Figure 6(b) is also interesting from a linguistic perspective. It represents a case of noun-complement clause extraposition [4], where the complement on the issue is extraposed from the determiner phrase the hearing. The additional extraposition of the adverb today from the verb phrase is scheduled leads to the illnested final structure. Several analyses of extraposition are put forth in the literature, but here we choose Kayne’s [7] “stranding analysis”, where a series of leftward movements leads to modifier stranding. These movements are each motivated by empty functional categories that could be overt in other possible human languages. In this lexicon, first the adverb is moved by the licensor +w (Figure 7(a)), followed by the prepositional phrase, the verb phrase, and finally the noun phrase, as in Figures 7(b) through 7(e). The analysis of the illnested structure in Figure 6(b) in terms of extraposition provides a first step towards an understanding of the linguistic relevance of illnested structures. This is not only useful for the analysis of dependencies in formal grammars, but also the analysis of extraposition in linguistics. Investigating the linguistic qualities of illnested structures cross-linguistically also shows promise. For example, treebanks from languages with freer word order, such as Czech, tend to have more illnested structures [8]. The sentence in Figure 8 is sentence number Ln94209_45.a/18 from the PDT 2.02 . An English gloss of the sentence is “A strong individual will obviously withstand a high risk better than a weak individual”.3 This particular example is a comparative construction (better X than Y) [3], which can give rise to illnestedness in Czech. The MG acknowledges syntactic relationships between the comparative construction and the adjectives weak and strong. The specific analysis stays close to the Kaynian tradition in supposing empty categories that intertwine with the dependencies from two different subtrees. Other constructions that cause illnestedness in the PDT are subject complements, verb-nominal predicates, and coordination (Kateřina Veselá, p.c.). At least in the case of Czech, this evidence suggests that the expressive power of a syntactic movement rule (rather than just attachment) is required.
2 3
Punctuation is removed to simplify the diagram. The authors thank Jiří Havelka (IBM Czech Republic), Kateřina Veselá (Charles University), and E. Wayles Browne (Cornell University) for the translation and linguistic analysis of the illnested structures in the PDT.
Dependency Structures Derived from Minimalist Grammars
Vysokému::Atr se::AuxT lépe::=AuxC Adv slabý::ExD jedinec::=Atr Sb -a ::=Pred +a c ::=c +d c ::=c +f c
11
riziku::=Atr Obj -f samozřejmě::AuxY než::=ExD AuxC -b silný::Atr -d ubrání::=Obj =Adv =AuxY =AuxT =Sb Pred -e ::=c +b c ::=c +e c
(a)
Vysokému riziku :c se samozřejmě lépe high
ubrání
silný než
slabý jedinec
risk :c self obviously better will-defend strong than weak individual
(b) Fig. 8. An illnested Czech example from the Prague Dependency Treebank
6
Conclusion
This paper provides a definition of MG merge and move operations in terms of dependency trees, and examines the properties of these operations in terms of both projectivity and nestedness constraints. We find that MGs with movement rules derive illnested structure of exactly the sort required by Czech comparatives and English noun-complement clause extractions. The work also provides a basis for future research in determining how different types of MG movement, such as head, covert, and remnant movement, interact with dependency constraints and properties like illnestedness. Dependencygenerative capacity may also provide a new avenue of research into determining how different types of locality constraints (besides the SMC) interact with generative capacity [2].
References 1. Chomsky, N.: The Minimalist Program. MIT Press, Boston (1995) 2. Gärtner, H.M., Michaelis, J.: Some remarks on locality conditions and Minimalist Grammars. In: Sauerland, U., Gärtner, H.M. (eds.) Interfaces + Recursion = Language?, pp. 161–195. Mouton de Gruyter, Berlin (2007) 3. Goldberg, A.: Constructions at Work: The Nature of Generalization in Language. Oxford University Press, New York (2006)
12
M.F. Boston, J.T. Hale, and M. Kuhlmann
4. Guéron, J., May, R.: Extraposition and logical form. Linguistic Inquiry 15, 1–32 (1984) 5. Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M.: Prague dependency treebank 2.0 (2000) 6. Harkema, H.: A characterization of minimalist languages. In: de Groote, P., Morrill, G., Retoré, C. (eds.) LACL 2001. LNCS (LNAI), vol. 2099, p. 193. Springer, Heidelberg (2001) 7. Kayne, R.S.: The Antisymmetry of Syntax. MIT Press, Cambridge (1994) 8. Kuhlmann, M.: Dependency structures and lexicalized grammars. Ph.D. thesis, Universität des Saarlandes (2007) 9. Kuhlmann, M., Nivre, J.: Mildly non-projective dependency structures. In: Proceedings of the COLING/ACL 2006, pp. 507–514 (2006) 10. Maier, W., Lichte, T.: Characterizing discontinuity in constituent treebanks. In: Proceedings of the Fourteenth Conference on Formal Grammar (2009) http://webloria.loria.fr/~degroote/FG09/Maier.pdf 11. Michaelis, J.: Derivational minimalism is mildly context-sensitive. In: Moortgat, M. (ed.) LACL 1998. LNCS (LNAI), vol. 2014, p. 179. Springer, Heidelberg (2001) 12. Michaelis, J.: On formal properties of Minimalist Grammars. Linguistics in Potsdam (LiP) 13, Universitätsbibliothek Publikationsstelle, Potsdam (2001) 13. Nivre, J.: Inductive Dependency Parsing. In: Text, Speech and Language Technology, Springer, New York (2006) 14. Satta, G.: Recognition of linear context-free rewriting systems. In: Proceedings of the Association for Computational Linguists (ACL), pp. 89–95 (1992) 15. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars. Theoretical Computer Science 88(2), 191–229 (1991) 16. Stabler, E.P.: Derivational minimalism. In: Retoré, C. (ed.) LACL 1996. LNCS (LNAI), vol. 1328, pp. 68–95. Springer, Heidelberg (1997) 17. Stabler, E.P., Keenan, E.: Structural similarity within and among languages. Theoretical Computer Science 293(2), 345–363 (2003) 18. Tesnière, L.: Éléments de syntaxe structurale. Editions Klincksiek (1959) 19. Vijay-shanker, K., Weir, D.J., Joshi, A.K.: Characterizing structural descriptions produced by various grammatical formalisms. In: Proceedings of the Association for Computational Linguists (ACL), pp. 104–111 (1987)
Deforesting Logical Form Zhong Chen and John T. Hale Department of Linguistics Cornell University Ithaca, NY 14853-4701, USA {zc77,jthale}@cornell.edu
Abstract. This paper argues against Logical Form (LF) as an intermediate level of representation in language processing. We apply a program transformation technique called deforestation to demonstrate the inessentiality of LF in a parsing system that builds semantic interpretations. We consider two phenomena, Quantifier Raising in English and Wh-movement in Chinese, which have played key roles in the broader argument for LF. Deforestation derives LF-free versions of these parsing systems. This casts doubt on LF’s relevance for processing models, contrary to suggestions in the literature.
1
Introduction
It is the business of the computational linguist, in his role as a cognitive scientist, to explain how a physically-realizable system could ever give rise to the diversity of language behaviors that ordinary people exhibit. In pursuing this grand aspiration, it makes sense to leverage whatever is known about language itself in the pursuit of computational models of language use. This idea, known as the Competence Hypothesis [5], dates back to the earliest days of generative grammar. Hypothesis 1 (Competence Hypothesis). A reasonable model of language use will incorporate, as a basic component, the generative grammar that expresses the speaker-hearer’s knowledge of the language. The Competence Hypothesis is the point of departure for the results reported in this paper. Section 4 and 5 show how a syntactic theory that incorporates a level of Logical Form (LF) can be applied fairly directly in a physically-realizable parser. These demonstrations are positive results about the viability of certain kinds of transformational grammars in models of language use. While not strictly novel, such positive results are important for cognitive scientists and others who wish to maintain the Competence Hypothesis about the grammar-parser relationship. In particular, the parser described in Section 4.2 handles a variety of English quantifier-scope examples that served to motivate LF when that level was first introduced. Section 5.2 extends the same technique to Chinese wh-questions, a case that has been widely taken as evidence in favor of such a level. C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 13–28, 2010. c Springer-Verlag Berlin Heidelberg 2010
14
Z. Chen and J.T. Hale
Sections 4.3 and 5.3 then apply a general program transformation technique called deforestation to each parser. Deforestation, as championed by Wadler [25] and described in more detail in Section 3, is a general method for getting rid of intermediate data structures in functional programs. In the current application, it is the LF representations that are eliminated. The resultant parsing programs do not construct any such representations, although they do compute the same input-output function. The outcome of deforestation is a witness to the fact that the parser need not construct LFs in order to do the job the grammar specifies. This inessentiality suggests that representations at the LF level should not be viewed as causally implicated in human sentence comprehension, despite suggestions to the contrary in the literature. For instance, Berwick and Weinberg suggest that: There is good evidence that the data structures or units of representation posited by theories of transformational grammar are actually implicated causally in online language processing. [1, 197] Consider this claim in relation to the positive and negative results. If one were to observe a characteristic pattern of errors or response times across two language understanding tasks that vary only in their LF-handling requirements, then a good theory of these results might involve some kind of LF parser whose computational actions causally depend on LF representations. The positive results of Sections 4.2 and 5.2 show how this could be done. However, the negative results to be presented in Sections 4.3 and 5.3 imply that the very same kind of model can, through deforestation, be reformulated to avoid calculating with precisely the LF representations that Berwick and Weinberg suggest are causally-implicated in language understanding. This is a paradox. To escape it, such evidence, if it exists, must instead be viewed as evidence for cognitive processes that calculate the language relationships that LF specifies, rather than as evidence for LF per se. The conjunction of these negative and positive results holds special significance for linguistics because it casts doubt on two widely held hypotheses. Hypothesis 2 (No levels are irrelevant). To understand a sentence it is first necessary to reconstruct its analysis on each linguistic level. [4, 87] Hypothesis 3 (LF Hypothesis). LF is the level of syntactic representation that is interpreted by semantic rules. [22, 248] We present counter-examples where understanding does not require reconstruction of the LF level, and where semantic rules would not apply at LF. With the broader significance of the question in mind, Section 2 identifies the sense of the term “Logical Form” at issue. A brief introduction to deforestation follows in Section 3. Sections 4.2 and 5.2 then define two parsers whose program text closely mirrors the grammar. Sections 4.3 and 5.3 discuss how deforestation applies to both programs. Section 6 makes some concluding remarks, speculating on relationships between this deforestation result and recent developments in linguistic theory.
Deforesting Logical Form
2 2.1
15
What Is Meant by LF LF Is a Level of Representation
Outside of the transformational generative grammar community, the words “Logical Form” refer to a symbolization of a sentence’s meaning in some agreed-upon logic. However, within this community, LF is a technical term that refers to a specific level of representation — an obligatory subpart of well-formed structural descriptions. Hornstein writes, LF is the level of representation at which all grammatical structure relevant to semantic interpretation is provided. [15, 3] Hornstein describes LF as being introduced in response to a realization that “surface structure cannot adequately bear the interpretive load expected of it” [15, 2]. Thus, in the transition from Chomsky’s Revised Extended Standard Theory to theories based on Government and Binding, an interpretive semantics is retained, while the precise level of representation being most directly interpreted is altered. 2.2
LF Is an Interface Level
In the course of applying LF to problems of quantifier scope, May [23] makes clear that LF is a kind of interface level by writing, We understand Logical Form to be the interface between a highly restricted theory of linguistic form and a more general theory of natural language semantics and pragmatics. [23, 2] The heavy lifting will be done by this more general theory. May emphasizes the mismatch between the limited capabilities of the rules affecting LF, compared to those that will be required to turn LFs into bona fide semantic representations. The latter occupies a different level, LF , pronounced “LF prime”. Representations at LF are derived by rules applying to the output of sentence grammar . . . Since the rules mapping Logical Form to LF are not rules of core grammar, they are not constrained by the restrictions limiting the expressive power of rules of core grammar. [23, 27] Representations at LF are subject to “the recursive clauses of a Tarskian truthcondition theory” [23, 26]. This is to be contrasted with representations at LF that are not. These representations are phrase markers, just like the immediate constituency trees at surface structure. This sense of LF has been repeatedly invoked in Chomsky’s recent work [7, fn 20][8, fn 11]. With this LF–terminology clarified, subsequent sections go on to apply functional programming techniques to define a surface structure parser and extend it with a transformational rule, Move-α, to build LF representations. We investigate two cases of this rule, Quantifier Raising in Section 4 and Wh-movement in Section 5. The relationship between these levels is depicted in Figure 1.
16
Z. Chen and J.T. Hale
Sentences
parser
Surface Structures
move-α (QR, Wh-movement...)
Logical Forms
mapping rules (convert)
LF′s
Fig. 1. The interlevel relationships from a sentence to LF s
3
Deforestation
Deforestation is a source-to-source program transformation used to derive programs that are more efficient in the sense that they allocate fewer data structures that only exist ephemerally during a program’s execution. Wadler provides a small deforestation example [25] which we repeat below as Listing 1 in the interest of self-contained presentation. n In this example, the main function sumSquares calculates the value x=1 x2 . let rec upto m n = match (m>n) with true −> [] | false −> m::(upto (m+1) n) let square x = x∗x let rec map f xs = match xs with [] −> [] | y :: ys −> (f y)::(map f ys) let sum xs = let rec sumAux a = function [] −> a | x :: rest −> sumAux (a+x) rest in sumAux 0 xs let sumSquares x = sum (map square (upto 1 x))
Listing 1. Classic deforestation example
The manner in which sumSquares actually calculates the sum of squares is typical of functional programs. Intuitively, it seems necessary that the upto function must first create the list [1, 2, . . . , n]. Then the map function maps square over this list, yielding a new list: [1, 4, . . . , n2 ]. These lists are both kinds of intermediate data. They are artifacts of the manner in which the function is calculated. the 3 Indeed, 1 2 sum function transmutes its input list into a single integer, 6 2n + 3n + n . This makes clear that the intermediate lists have a limited lifetime during the execution of the function. The idea of deforestation is to translate programs like the one in Listing 1 into programs like the one in Listing 2. let sumSquaresDeforested x = let rec h a m n = if m > n then a else h (a + (square m)) (m+1) n in h 0 1 x
Listing 2. Deforested program does not allocate intermediate lists
Deforesting Logical Form
17
The program in Listing 2 recurs on an increasing integer m and scrupulously avoids calling the Caml list constructor (::). The name “deforestation” is evocative of the fact that intermediate trees can be eliminated in the same way as lists. Over the years, many deforestation results have been obtained in the programming languages community. The typical paper presents a system of transformation rules that converts classes of program fragments in an input language into related fragments in the output language. An example rule from Wadler’s transformation scheme T is given is given below. T f t1 . . . tk = T t[t1 /v1 , . . . , tk /vk ]
(4)
where f is defined by f v1 . . . vk = t The deforestation rule above essentially says that if the definition of a function f is a term t in the input language, then the deforestation of f applied to some argument terms t1 through tk is simply the replacement of the function by its definition, subject to a substitution of the formal parameters v1 ,. . . ,vk by the actual parameters t1 ,. . . ,tk . This kind of program transformation is also known as the unfold transformation [3]. Wadler provides six other rules that handle other features of his input language; the original paper [25] should be consulted for full details. In the present application, neither automatic application nor heightened efficiency is the goal. Rather, at issue is the linguistic question whether or not a parser that uses LF strictly needs to do so. Sections 4.3 and 5.3 argue for negative answers to this question.
4
Quantifier Raising
Quantifier Raising (QR) is an adjunction transformation1 proposed by May [23] as part of a syntactic account of quantifier scope ambiguities. May discusses the following sentence (5) with two quantifiers every and some. It has only one reading; the LF representation is given in (6). (5) Every body in some Italian city met John. ‘There is an Italian city, such that all the people in it met John.’ (6) ∃x (∀y ((Italian-city (x) & (body-in (y, x) → met-John (y))))) 4.1
Proper Binding
When May’s QR rule freely applies to a sentence with multiple quantifiers, it derives multiple LFs that intuitively correspond to different quantifier scopes. 1
The word “transformation” here is intended solely to mean a function from trees to trees. Adjunction is a particular kind of transformation such that, in the result, there is a branch whose parent label is the same as the label of the sister of the re-arranged subtree.
18
Z. Chen and J.T. Hale
Sometimes this derivational ambiguity correctly reflects scope ambiguity. However, in semantically unambiguous cases like (5), the QR rule overgenerates. A representational constraint, the Proper Binding Condition (PBC) [12] helps to rule out certain bad cases that would otherwise be generated by unfettered application of QR. Principle 1 (The Proper Binding Condition on QR). Every raised quantified phrase must c-command2 its trace. Perhaps the most direct rendering of May’s idea would be a program in which LFs are repeatedly generated but then immediately filtered for Proper Binding. One of the LFs for example (5), shown in Figure 2(b) below, would be ruled out by the PBC because DP1 does not c-command its trace t1 . S
S
DP1 D some
S NP
Adj
DP2
DP2
S
N D
NP
Italian city every
VP V
N N
t2
PP
body P t1
DP
met PN John
S
D
NP
every
N N
DP1 D
PP
body P t1
some
S t2
NP Adj
VP V
N
Italian city
DP
met PN
in
John
in
(a) LF1
(b) *LF2
Fig. 2. The logical form of the Quantifier Raising example (5)
This kind of “Generate-and-Test” application of Proper Binding is obviously wasteful. There is no need to actually create LF trees that are doomed to failure. Consider the origins of possible Proper Binding violations. Because QR stacks up subtrees at the front of the sentence, the main threat is posed by inopportune choice of subtree to QR. A precedence-ordered list of quantified nodes is destined to fail the PBC just in case that list orders a super-constituent before one of its sub-constituents. This observation paves the way for a change of representation from concrete hierarchical trees to lists of tree-addresses3. Principle 2 below reformulates the PBC in terms of this alternative representation. 2
3
C-command is a relationship defined on tree nodes in a syntactic structure. A node α c-commands another node β if the first node above α contains β. The “Gorn address” is used here. It is a method of addressing an interior node within a tree [14].) Here we illustrate the Gorn address as an integer list. The Gorn address of the tree root is an empty list [] with the first child [0] and the second child [1]. The j-th child of the node with the Gorn address [i] has an address [i, j − 1].
Deforesting Logical Form
19
Principle 2 (Linear Proper Binding Condition on quantified phrases) Let L = n1 , . . . , nm be a list of tree-node addresses of quantified phrases. L will violate the Proper Binding Condition if any address ni is a prefix of nj (i < j). The tree-addresses of every body in some Italian city and some Italian city in the surface structure tree are [0] and [0, 1, 0, 1, 1] respectively. The first address is the prefix of the latter. A list of quantified nodes [[0], [0, 1, 0, 1, 1]], indicating that every body outscopes some Italian city, will violate Principle 2. Recognizing this allows a parser to avoid constructing uninterpretable LFs like the one in Figure 2(b). Note that the effect of this linear PBC is exactly the same as that of its hierarchical cousin, Principle 1. The following sections take up alternative implementations of the function from sentences to LF representations. One of these implementations can be derived from the other. The first implementation obtains LF s after building LFs. The second one does not allocate any intermediate LF but instead calculates LF s directly based on surface structure analyses. 4.2
The Quantifier Raising Implementation
The Appendix presents a small Context Free Grammar used in a standard combinator parser4 that analyzes example (5). The LF-using implementation takes each PBC-respecting quantifier ordering, applies QR, then converts the resulting LF into an LF formula. Illicit LFs, such as the one in Figure 2(b), are never built due to the linear PBC. Raised quantified phrases are translated into pieces of formulas using two key rules: some Italian city . . . ⇒ ∃x (Italian-city(x) & . . . ) every body in x . . . ⇒ ∀y (body-in(y, x) → . . . ) All these procedures can be combined in a LF parser as in Listing 3. let withLF ss = Seq.map convert (Seq.map qr (candidate quantifier orderings ss))
Listing 3. QR analyzer with plug-in that uses LF
The higher-order function Seq.map applies its first argument to every answer in a stream of alternative answers. Our program faithfully calculates interactions between the constraints that May’s competence theory specifies. Its output corresponds to the LF representation in (6). # Seq.iter formula (analyze withLF "every body in some italian city met john");; ‘exists v31[forall v32[(italian-city(v31) & (body-in(v32,v31) -> met-john(v32)))]]’ - : unit = ()
4
A combinator is a function that takes other functions as arguments, yielding more interesting and complicated functions just through function-application rather than any kind of variable-binding [10][11]. Parser combinators are higher-order functions that put together more complex parsers from simpler ones. The parsing method is called combinatory parsing [2][13].
20
Z. Chen and J.T. Hale
4.3
Deforesting LF in Quantifier Raising
Executing analyze withLF allocates an intermediate list of quantifier-raised logical forms. These phrase markers will be used as the basis for LF representations by convert. The existence of an equivalent program, that does not construct them, would demonstrate the inessentiality of LF. let fmay (ss , places) = convert (qr (ss , places)) deforest let withoutLF ss = Seq.map fmay (candidate quantifier orderings ss)
Listing 4. LF-free quantifier analyzer
The key idea is to replace the composition of the two functions convert and qr into one deforested function fmay which does the same thing without constructing any intermediate LFs. Algorithm 1 provides pseudocode for this function, which can be obtained via a sequence of deforestation steps5 , as shown in Table 1. It refers to rules defined by Wadler [25], including the unfold program transformation symbolized in (4). The implementation takes advantage of the fact that the list of quantified nodes, QDPplaces, can be filtered for compliance with the linear PBC ahead of time. Table 1. Steps in the deforestation of QR example (5) Wadler’s Number Action in the program 3 unfold a function application convert with its definition 6 unfold a function application qr with its definition 7 broadcast the inner case statement of a QDP outwards 5 simplify the matching on a constructor. Use fact: NP is the second daughter of DP 5 get rid of the matching on a Some constructor Not relevant knot-tie in the outermost match. Realize we started by translating (convert (qr (ss,places))) == fmay (ss,places)
Since the PBC applies equally well to lists of “places”, it is no longer necessary to actually build tree structures. The deforested procedure recurs down the list of QDPplaces (Line 5); applies QR rules according to their quantifier type (Line 10 or 16) and turns phrases into pieces of formulas using predicateify (Line 7 or 13). No phrase markers are moved and no LF representations are built.
5
Wh-Movement
Apart from quantifier scope, perhaps the most seminal application of LF in syntax is motivated by wh-questions. The same deforestation techniques are equally applicable to this case. They similarly demonstrate that a LF-free semantic 5
The full derivation of the deforestation is omitted in the paper due to space limits. The rule numbers used in Table 1 is consistent with Wadler [25, 238, Figure 4].
Deforesting Logical Form
21
Algorithm 1. Pseudocode for the deforested fmay 1: 2: 3: 4: 5: 6:
function fmay (ss,QDPplaces) if no more QDPplaces then predicateify(ss) else examine the next QDPplace qdp if qdp = DP then D NP
7: 8: 9: 10: 11: 12:
every let restrictor = predicateify(NP) let v be a fresh variable name let body = ss with qdp replaced by an indexed variable ∀ v restrictor(v) → fmay (body, remaining QDPplaces) end if if qdp = DP then D NP
13: 14: 15: 16: 17: 18: 19:
some let restrictor = predicateify(NP) let v be a fresh variable name let body = ss with qdp replaced by an indexed variable ∃ v restrictor(v) ∧ fmay (body, remaining QDPplaces) end if end if end function
interpreter may be obtained by deforesting a straightforward implementation that does use LF. To understand this implementation, a bit of background on wh-questions is in order. The standard treatment of wh-questions reflects languages like English where a wh-word overtly moves to a clause-initial position when forming an interrogative. However, it is well-known that in Chinese, the same sorts of interrogatives leave their wh-elements in clause-internal positions. This has come to be known as “wh-in-situ” [18]. Although the wh-word remains in its surface position, some syntacticians have argued that movement to a clause-initial landing site does occur, just as in English, but that movement is not visible on the surface because it occurs at LF [16][17]. This analysis thus makes crucial use of LF as a level of analysis. 5.1
The ECP and the Argument/Adjunct Asymmetry
At the heart of the argument for LF is the idea that this “covert” movement in Chinese interrogatives creates ambiguity. Example (7) below is a case in point: if Move-α were allowed to apply freely, then both readings, (7a) and (7b) should be acceptable interpretations of the sentence. In actuality only (7a) is acceptable. (7) ni xiangzhidao wo weishenme mai shenme? you wonder I why buy what a. ‘What is the x such that you wonder why I bought x?’ b. *‘What is the reason x such that you wonder what I bought for x?’
22
Z. Chen and J.T. Hale CP C
C
TP DP
T
Pronoun T
VP
ni V V
CP
xiangzhidao
C C
TP
DP
T
Pronoun wo
T
VP V
AdvP Adverbial wh weishenme
V V
DP
mai Nominal wh shenme
Fig. 3. The surface structure of the Chinese Wh-movement example (7)
The verb xiangzhidao ‘wonder’ in (7) forms a subordinate question. The presence of two alternative landing sites, shown in Figure 3, allows Wh-movement to derive two different LF possibilities. These two LFs, illustrated in Figure 4, correspond to alternative readings. However, only reading (7a) is acknowledged by Chinese speakers. Sentence (7) does not question the reason for buying as in (7b). Rather, it is a direct question about the object of mai ‘buy’. Huang argues that Empty Category Principle (ECP) [6] correctly excludes the unattested interpretation as shown in Figure 4(b). Principle 3 (The Empty Category Principle). A non-pronominal empty category (i.e., trace) is properly governed by either a lexical head or its antecedent.6 Figure 4 illustrates the ECP’s filtering action on example (7). In this Figure, dotted arrows indicate government. In the ECP-respecting LF shown in Figure 4(a), the trace t1 of the moved wh-word shenme is an empty category lexically governed by the verb mai ‘buy’. Trace t2 is antecedent governed. In the ECP-failing LF of Figure 4(b), the trace of shenme t1 , is also lexically governed. However, weishenme’s trace, t2 is left ungoverned. As an adjunct, it is not lexically governed. Nor is it antecedent governed — the nearest binder weishenme lies beyond a clause-boundary. 6
The antecedent can be a moved category (i.e., wh-phrase). We follow Huang et al [18] in using the classical, “disjunctive” version of the ECP.
Deforesting Logical Form CP DP1
23
CP
C
AdvP2
Nominal wh C
C
Adverbial wh
TP
C
shenme
TP
weishenme
DP
DP
T
Pronoun ni
T
Pronoun
T
VP
ni
T
VP
V
V
V
V
CP
xiangzhidao
xiangzhidao AdvP2 Adverbial wh
C C
TP
weishenme DP Pronoun T
CP DP1 Nominal wh
DP Pronoun T
VP
T VP
wo
V t2
TP
shenme
T
wo
C C
mai
(a) LF1
V V t1
V t1 mai
V t2
V
(b) *LF2
Fig. 4. The logical form of the Chinese Wh-movement example (7)
Like the Proper Binding Condition, the ECP can also be reformulated as a constraint on lists; in this case lists of wh elements. The key requirement is that only wh-arguments should outscope other wh-phrases. Principle 4 (Linear Empty Category Principle on wh-questions). Let L = n1 , . . . , nm be a scope-ordered list of wh-elements where n1 has the widest scope and nm has the narrowest. L will violate the Empty Category Principle if any wh-adjunct is at the position ni and i < m. The correctness of Principle 4 derives from the fact that moved wh-elements always end up at the edge of a clause. If a wh-adjunct scopes over another wh-phrase, it must have crossed a clause boundary in violation of the ECP. We postulate LF representations as in (8) in which wh-phrases denote sets of propositions [21]. The wh-argument shenme scopes over the wh-adjunct weishenme. The answer to weishenme is a set of propositions which CAUSEs the action wo-mai to happen as part of the state s. (8) λP ∃x (P = ni-xiangzhidao (λQ∃s (Q= CAUSE (s, wo-mai (x))))) 5.2
The Wh-Movement Implementation
The implementation of Wh-movement is mostly analogous to QR in Section 4.2. We define a combinator parser for the Chinese fragment in the Appendix. A Wh-movement function fills a landing site with a wh-phrase and creates a trace. A filter applies the ECP to the derived LFs. This Chinese parser differs from the QR example in its ability to handle subordinate clauses. It obtains LF representations for the main clause and the
24
Z. Chen and J.T. Hale CP CP DP1
Nominal wh
C AdvP2 C
TP
Adverbial wh C
shenme DP
C TP
weishenme
T
Pronoun T
DP
T
Pronoun T
VP
ni
VP V
wo CP
V
t2
V (b) subordinate clause
mai
xiangzhidao
λP∃x[P = ni-xiangzhidao (Subordinate Clause)]
cpconvert
λQ∃s[Q = CAUSE(s, wo-mai (x))]
(a) Main clause Fig. 5.
V V t1
(b) Subordinate clause
translates LF into LF of the example (7)
subordinate clause separately and then encapsulates them together, as shown in Figure 5. The cpconvert function applies two mapping rules: shenme ⇒ λP ∃x (P = . . . x . . . ) weishenme ⇒ λQ∃s (Q = CAUSE (s, . . . )) This function plugs-in to yield an LF-using analyzer as shown in Listing 5. let withCHSLF ss = Seq.map cpconvert (Seq.map mvwh (candidate whps orderings ss))
Listing 5. wh-question analyzer with plug-in that uses LF
This analyzer correctly finds the LF shown in (8). # Seq.iter formula (analyzeCHS withCHSLF "ni xiangzhidao wo weishenme mai shenme");; ‘lambda h[exists v6[h=[ni-xiangzhidao(lambda i[exists s[i=[CAUSE(s,wo-mai(v6))]]])]]]’ - : unit = ()
5.3
Deforesting LF in Wh-Movement
Executing the program in Section 5.2 causes intermediate wh-moved LF trees to be created. This is similar to the QR program discussed in Section 4.2 which allocates quantifier-raised structures. Using the same deforestation techniques, an equivalent program that does not allocate these intermediate data structures can be obtained.
Deforesting Logical Form
25
let fhuang ( ss , places) = cpconvert (mvwh (ss,places)) deforest let withoutCHSLF ss = Seq.map fhuang (candidate whp orderings ss)
Listing 6. LF-free analyzer for Chinese wh-questions
The Wh-movement function mvwh takes a pair consisting of a surface structure and an ECP-compliant list of wh-element addresses. After deforestation, the resultant function fhuang in Algorithm 2 is similar to fmay but also has the ability to handle complex embedded sentences. Apart from deciding the type of wh-phrase (Line 6 and 20), an additional condition detects whether the current structure is complex (Line 10 and 24). If it is, the program obtains the LF representation for the first clause then recursively works on the subordinate clause and the remaining wh-phrases (Line 17 and 31). Table 2 lists the deforestation steps used in the derivation of this deforested fhuang . Algorithm 2. Pseudocode for the deforested fhuang 1: 2: 3: 4: 5: 6:
function fhuang (ss,WHPplaces) if no more WHPplaces then predicateify(ss) else examine the next WHPplace whp if whp = DP then Nominal wh
7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
shenme let v be a fresh variable name let P be a fresh variable name let body = ss with whp replaced by an indexed variable if body has no subordinate clause then λ P ∃ v . P = fhuang (body, remaining WHPplace) else let Q be a fresh variable name let cp1 = body with the subordinate clause replaced by Q let cp2 = the subordinate clause of body let WHPplaces = the WH-phrase places of the cp2 λ P ∃ v . P = (λ Q .predicateify (cp1 )) fhuang cp2 , WHPplaces end if end if if whp = AdvP then adverbial wh
21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35:
weishenme let s be a variable name for the state/event let P be a fresh variable name let body = ss with whp replaced by an indexed variable if body has no subordinate clause then λ P ∃ s . P = CAUSE (s, fhuang (body, remaining WHPplaces)) else let Q be a fresh variable name let cp1 = body with the subordinate clause replaced by Q let cp2 = the subordinate clause of body let WHPplaces = the WH-phrase places of the cp2 λ P ∃ s . P = (λ Q .CAUSE (s, predicateify(cp1 ))) fhuang cp2 , WHPplaces end if end if end if end function
26
Z. Chen and J.T. Hale Table 2. Steps in the deforestation of Wh-movement example (7) Wadler’s Number Action in the program 3 unfold a function application cpconvert with its definition 6 unfold a function application mvwh with its definition 7 broadcast the inner case statement of a WHP outwards 5 simplify the matching on a constructor; set the subordinate clause as the current structure; update WHP addresses accordingly. 5 get rid of the matching on a Some constructor Not relevant a knot-tie in the outermost match. We started by translating (cpconvert (mvwh (ss,places))) == fhuang (ss,places)
6
Conclusion
Parsing systems initially defined with the help of LF can be re-formulated so as to avoid constructing any LFs, or so it turns out in the well-known cases of English quantifier-scope and Chinese wh-questions. These findings suggest that LF may not in fact be an essential level of representation in a parser, because it can always be deforested away. They are consistent with the alternative organization depicted in Figure 6.
move-α Sentences
parsers
Logical Forms
mapping rules
Surface Structures
LF′s deforested f
Fig. 6. The interlevel relationships from a sentence to LF s
Of course, it remains to be seen if a facet of the LF idea will be uncovered that is in-principle un-deforestable. The positive results suggest that such an outcome is unlikely because the central elements of May and Huang’s proposal are so naturally captured by the programs in Listings 3 and 5. In the end, there is no formal criterion for having adequately realized the spirit of a linguistic analysis in an automaton. The most we can ask is that the grammar somehow be recognizable within the parser (Hypothesis 1). Nevertheless, the result harmonizes with a variety of other research. For instance, our method uses lists of quantified phrases or wh-phrases in a way that is reminiscent of Cooper Storage [9]. Cooper storage stores the denotation of quantified phrases or wh-phrases and retrieves them in a certain order. The linear versions of PBC and ECP could be construed as constraints on the order of retrieving them from the store.
Deforesting Logical Form
27
In addition, our method is consistent with the work of Mark Johnson, who uses the fold/unfold transformation to, in a sense, deforest the S-structure and D-structure levels of representation out of a deductive parser [20]. It also harmonizes with the research program that seeks directly-compositional competence-grammars [19][24]. The present paper argues only that LF is unnecessary in the processor, leaving open the possibility that LF might be advantageous in the grammar for empirical or conceptual reasons. Proponents of directly compositional grammars have argued that the available evidence fails to motivate LF even in competence. If they are right, a fortiori there is no need for LF in performance models.
Acknowledgements The authors would like to thank Hongyuan Dong, Scott Grimm, Tim Hunter, Graham Katz, Greg Kobele, Alan Munn and Michael Putnam and ChungChieh Shan for helpful feedback on this material. They are not to be held responsible for any errors, misstatements or other shortcomings that persist in the final version.
References 1. Berwick, R.C., Weinberg, A.S.: The Grammatical Basis of Linguistic Performance. MIT Press, Cambridge (1984) 2. Burge, W.H.: Recursive Programming Techniques. Addison-Wesley, Reading (1975) 3. Burstall, R., Darlington, J.: A transformation system for developing recursive programs. Journal of the Association for Computing Machinery 24(1), 44–67 (1977) 4. Chomsky, N.: Syntactic structures. Mouton de Gruyter, Berlin (1957) 5. Chomsky, N.: Aspects of the Theory of Syntax. MIT Press, Cambridge (1965) 6. Chomsky, N.: Lectures on Government and Binding. Foris, Dordrecht (1981) 7. Chomsky, N.: Approaching UG from below. In: Sauerland, U., G¨ artner, H.M. (eds.) Interfaces + recursion = language?: Chomsky’s minimalism and the view from syntax-semantics, pp. 1–29. Mouton de Gruyter, Berlin (2007) 8. Chomsky, N.: On phases. In: Freidin, R., Otero, C., Zubzarreta, M.-L. (eds.) Foundational Issues in Linguistic Theory: Essays in Honor of Jean-Roger Vergnaud, pp. 133–166. MIT Press, Cambridge (2008) 9. Cooper, R.H.: Montague’s Semantic Theory and Transformational Syntax. Ph.D. thesis, Umass (1975) 10. Curry, H.B., Feys, R.: Combinatory Logic, vol. 1. North-Holland, Amsterdam (1958) 11. Curry, H.B., Hindley, J.R., Seldin, J.P.: Combinatory Logic, vol. 2. North-Holland, Amsterdam (1972) 12. Fiengo, R.: Semantic Conditions on Surface Structure. Ph.D. thesis, MIT, Cambridge (1974) 13. Frost, R., Launchbury, J.: Constructing natural language interpreters in a lazy functional language. The Computer Journal 32(2), 108–121 (1989)
28
Z. Chen and J.T. Hale
14. Gorn, S.: Explicit definitions and linguistic dominoes. In: Hart, J., Takasu, S. (eds.) Systems and Computer Science, University of Toronto Press (1967) 15. Hornstein, N.: Logical Form: From GB to Minimalism. Blackwell, Oxford (1995) 16. Huang, C.T.J.: Logic relations in Chinese and the theory of grammar. Ph.D. thesis, MIT, Cambridge; edited version published by Garland, New York, 1998 (1982) 17. Huang, C.T.J.: Move wh in a language without wh-movement. The linguistic review 1, 369–416 (1982) 18. Huang, C.T.J., Li, Y.H.A., Li, Y.: The Syntax of Chinese. Cambridge University Press, Cambridge (2009) 19. Jacobson, P.: Paycheck pronouns, Bach-Peters sentences, and variable-free semantics. Natural Language Semantics 8(2), 77–155 (2000) 20. Johnson, M.: Parsing as deduction: the use of knowledge of language. Journal of Psycholinguistic Research 18(1), 105–128 (1989) 21. Karttunen, L.: Syntax and semantics of questions. Linguistics and Philosophy 1, 3–44 (1977) 22. Larson, R., Segal, G.: Knowledge of meaning. MIT Press, Cambridge (1995) 23. May, R.: The grammar of quantification. Ph.D. thesis, MIT, Cambridge (1977) 24. Steedman, M.: The Syntactic Process. MIT Press, Cambridge (2000) 25. Wadler, P.: Transforming programs to eliminate trees. Theoretical Computer Science 73, 231–248 (1990)
Appendix: Grammar of Two Examples Grammar of QR S → DP VP DP → D NP DP → PN NP → Adj N NP → Adj NP N → N PP VP → V DP VP → V DP PP PP → P DP PP
example (5) D → every D → some PN → John N → city N → body Adj → Italian P → in V → met
Grammar of Wh-movement example (7) CP → Spec C Pronoun → ni C →C TP Pronoun → wo TP → DP T NOMwh → shenme TP → Spec T ADVwh → weishenme T →T VP V → xiangzhidao DP → Pronoun V → mai DP → NOMwh NP → NOMwh AdvP → ADVwh V →V CP V → AdvP V V →V DP VP → V
On the Probability Distribution of Typological Frequencies Michael Cysouw Max Planck Institute for Evolutionary Anthropology, Leipzig
[email protected] Abstract. Some language types are more frequent among the world’s languages than others, and the field of linguistic typology attempts to elucidate the reasons for such differences in type frequency. However, there is no consensus in that field about the stochastic processes that shape these frequencies, and there is thus likewise no agreement about the expected probability distribution of typological frequencies. This paper explains the problem and presents a first attempt to build a theory of typological probability purely based on processes of language change.
1
Probability Distributions in Typological Research
A central objective of the typological study of linguistic diversity is to explain why certain kinds of linguistic structures are much more frequently attested among the world’s languages than others. Unfortunately, such interpretations of empirically attested frequencies often rely on purely non-mathematic intuitions to judge whether observed frequencies are in any sense noteworthy or not. In the typological literature, this frequently leads to a tacit assumption that typological frequencies are evenly distributed, i.e. that a priori all language-types should be equally frequent, and any observed skewing of frequencies is thus in need of an explanation. Such an argumentation can be found, for example, in the widely read typological textbook by Comrie [2]: “In a representative sample of languages, if no universal were involved, i.e. if the distribution of types along some parameter were purely random, then we would expect each type to have roughly an equal number of representatives. To the extent that the actual distribution departs from this random distribution, the linguist is obliged to state and, if possible, account for this discrepancy” (p. 20) Various more sophisticated approaches to the interpretation of empirical frequencies often assume an underlyingly normal (or, more precisely, multinomial) distribution, as for example indicated by the regular use of χ2 statistics or Fisher’s exact test (e.g. in Cysouw [3]). Janssen et al. [6] and Maslova [12] explicitly discuss the problem of tacit assumptions of the underlying probability distributions as made in linguistic typology. As a practical solution to circumvent this problem for the assessment of statistical significance, Janssen et al. [6] propose to use randomization-based significance tests. Such tests do not make C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 29–35, 2010. c Springer-Verlag Berlin Heidelberg 2010
30
M. Cysouw
any assumptions about the underlying probability distribution. This of course still leaves open the question about the nature of these distributions. There is a small literature that explicitly deals with the question of the underlying probability distribution of typological variables, but there is no agreement whatsoever. The first paper to make any such proposal was Lehfeldt [9], who proposes a gamma distribution for the size of phoneme inventories. In a reaction to this claim Justeson and Stephens [7] proposed a log-normal distribution for the same data. The size of the phoneme inventory is a clear case of linguistic complexity. More general, Nichols [14] (and more recently Nichols et al. [15]) proposed a normal distribution for linguistic complexity. Similarly, Maddieson [10] hints at a normal distribution for the size of consonant inventories. Finally, Maslova [12] argues that the frequencies of types in the World Atlas of Language Structures (henceforth WALS, [5]) follows a pareto distribution. There are thus at least proposals for gamma, log-normal, normal and pareto distributions of typological variables. Most of these proposals arrive at their distribution on the basis of the inspection of empirical values. For example, the only argument Lehfeldt (1975) offered for the gamma distribution is a (rather speculative) interpretations of some moment-like characteristics of the empirical frequencies. Nichols et al. [15] only observe bell-shaped distribution, and propose a normal distribution on that meagre basis. Though empirical distributions might suggest a particular underlying probability distribution, but they can never be used to argue for a particular distribution. Both the gamma distribution and the log-normal distribution can be fitted easily to the empirically observed phoneme-size distribution. The proper argument for a particular probability distribution is the explication of the stochastic process that causes the distribution to arise. This approach was used by Justeson and Stephens [7] in their plea for a log-normal distribution of phoneme inventory size. Phoneme inventories, they argue, are based on phonological feature inventories. Given n binary features, it is possible to compose 2n phonemes. Now, assuming that feature inventories are normally distributed (a claim they do not further elucidate), phoneme inventories will thus be lognormally distributed (i.e. the logarithm of the phoneme inventory size will be normally distributed). Irrespective of the correctness of their claim, this argument is a good example of an attempt to find a stochastic reason for a particular distribution. The actual proposal of Justeson and Stephens is not convincing, because they do not substantiate the normal distribution of feature inventories. Still, their approach is an important step in the right direction.
2
The Stochastic Process of Language Change
To investigate the nature of any underlying probability distribution it is necessary to consider the stochastic process that causes the phenomenon at hand. For typological frequencies there are at least two (non-exclusive) kind of processes that can be considered. The frequencies of linguistic types in the world’s languages are partly shaped by cognitive processes, and partly by diachronic
On the Probability Distribution of Typological Frequencies
31
processes. In this paper I will restrict myself to further investigate the latter idea, namely that the process of language change determines the probability distribution of typological frequencies. The synchronic frequencies of a typological variable (for example the word order of verb and object cf [4]) can be seen as the result of the diachronic processes of language change (cf. Plank and Schellinger [16]). More precisely, the current number of languages of a particular linguistic type can be analyzed as the result of a Markov process in which language change from one type to another sequentially through time (cf. Maslova [11]). For example, a verb-object language can change into a object-verb language, and vice-verse, and this process of change from one type to the other determines the probability distribution of the linguistic type. As a first (strongly simplified) approach to the stochastic nature of this process of type-change, I will in this paper consider type-change as a simple birthdeath process: a verb-object language is “born” when an object-verb language changes to a verb-object language, and a verb-object language “dies” when this language changes to an object-verb language.1 Also as a first approximation, I will assume that such type-changes take place according to a Poisson process. A Poisson process is the stochastic process in which events occur continuously and independently of one another, which seems to be a suitable assumption for language change. Such a basic birth-death model with events happening according to a Poisson distribution is known in queueing theory as an M/M/1 process (using the notation from Kendall [8] in which M stands for a “Markovian” process). Normally, queueing models are used to describe the behavior of a queue in a shop. Given a (large) population of potential buyers, some will once in a while come to a cash register to pay for some goods (i.e. a “birth” in the queue) and then, possibly after some waiting time, pay and leave the queue again (i.e. a “death” in the queue). The queueing model also presents a suitable metaphor to illuminate the dynamics of typological variables. Consider all the worlds languages throughout the history of homo loquens as the (large) population under investigation. Through time languages change from one type to another type, and vice-versa. Metaphorically, one can then interpret the number of languages of a particular type at a particular point in time as the length of a queue. A central parameter of a queueing model is the traffic rate t, which is defined as the fraction of the arrival rate λ and the and the departure rate μ: t = λ/μ. The arrival and the departure rate designate the average number of arrivals and departures in the queue per time unit. However, time is factored out in the traffic 1
Altmann [1] also uses a birth-death model to investigate distributions in a crosslinguistic contexts. However, he investigates another kind of phenomenon, namely the probability distribution of the number of different types in dialectological maps. This aspect is closely related to the number of types per map in a typological atlas like WALS, though there is more arbitrariness in the number of types in a typological map compared to the number of types in a dialectological map. Altmann convincingly argues that the number of types on dialectological maps should be negatively binomially distributed.
32
M. Cysouw
rate t, so this rate is just a general indication of the dynamics of the queue. In a stable queue, the traffic rate must be between 0 and 1 (a traffic rate larger than one would result in an indefinite growing of the queue). Now, in an M/M/1 model with traffic rate t, the probability distribution of the queue length q is distributed according to (1), which is a slight variation on a regular (negative) exponential distribution (cf. Mitsenmacher and Upfal [13], p. 212). Following this model, typological frequencies should accordingly be roughly (negatively) exponentially distributed. P (q = n) = (1 − t) · tn
3
(1)
Meta-typological Fitting
To get an impression how these assumptions fare empirically, I will present a small meta-typological experiment (for an introduction to meta-typology, cf. Maslova [12]. As described earlier, such an experiment is no argument for a particular distribution; it will only show that it is possible to model empirical frequencies by using a negative exponential distribution. Whether this is indeed the right distribution can never be proved by a well-fitted curve. The theoretical derivation of the distribution has to be convincing, not the empirical adequacy. For the meta-typological experiment, I randomly selected one cross-linguistic type from each chapter of WALS [5]. A histogram of the number of languages per type is shown in Figure 1. Based on a similar distribution, Maslova [12] proposed that the size of cross-linguistic types follows a pareto distribution. However, as described in the previous section, it seems to make more sense to consider this an exponential distribution as defined in (1). To fit the empirical data to the proposed distribution in (1) I divided the types from WALS into bins of size 10, i.e. all types with 1 to 10 languages were combined into one group, likewise all types with 11-20 languages, etc. For each of these bins, I counted the number of types in it. For example, there are 18 types that have between 1 to 10 languages, which is 12.9% of all 140 types. So, the probability to have a“queue” with a length between 1 and 10 languages is 0.129. Fitting these empirical probabilities for type-sizes to the proposed distribution in (1) results in a traffic rate t = .85 ± .01.2 It is important to note that I took the bare number of languages per type as documented in each chapter in WALS. This decision has some complications, because the set of languages considered (i.e. the “sample”) is rather different between the different chapters in WALS. The number of languages in a particular type will normally differ when the researcher considered 100 languages or 500 languages as a sample. Still, I decided against normalizing the samples, 2
I used the function nls (“non-linear least squares”) from the statistics environment R [17] to estimate the traffic rate from the data, given the predicted distribution in (1). Also note that these fitted values represent a random sample of types from WALS, and the results will thus differ slightly depending on the choice of types.
0
5 10
20
Frequency
30
On the Probability Distribution of Typological Frequencies
0
100
200
300
400
500
600
size of types
0.10 0.05 0.00
fitted probabilities
0.15
Fig. 1. Histogram of type sizes from WALS
0.00
0.05
0.10
0.15
empirically observed probabilities Fig. 2. Fit of empirical distribution to predicted distribution
33
34
M. Cysouw
because that would introduce an artificial upper boundary. This decision implies, however, that the distribution of type-size in the current selection of data from WALS is also influenced by yet another random variable, namely the size of the sample from each chapter. From the perspective of typology, this is a rather strange approach, because a typological samples attempt to sample to current world’s languages. For the current purpose, however, the population to be sampled is not the current world’s languages, but all languages that were ever spoken, or will ever be spoken through time and space. From that perspective, any restriction on sample size will only restrict the number of languages, but it will not influence the traffic rate, nor the type-distribution. The relation between the empirically observed probabilities and the fitted probabilities is shown in Figure 2. It is thus easily possible to nicely fit the empirical data to the proposed distribution in (1). However, as argued in the previous section, this is not a proof of the proposal, but only an illustration. The nature of a probability distribution can never be empirically proven, but only made more plausible by a solid analysis of the underlying processes.
4
Outlook
The model presented in this paper is restricted to a very simplistic birth-death model of typological change. More complex models will have to be considered to also cover the more interesting cases like the size of phoneme inventories. Basically, to extend the current approach, Markov models involving multiple states and specified transition probabilities between these states are needed. For example, phoneme inventories can be considered a linearly ordered set of states, and the process of adding or losing one phoneme can also be considered a poisson process. At this point, I do not know what the resulting probability distribution would be in such a model, but it would not surprise me if Lehfeldt’s [9] proposal of a gamma distribution would turn out to be in the right direction after all.
References 1. Altmann, G.: Die Entstehung diatopischer Varianten. Zeitschrift f¨ ur Sprachwissenschaft 4(2), 139–155 (1985) 2. Comrie, B.: Language Universals and Linguistic Typology. Blackwell, Oxford (1989) 3. Cysouw, M.: Against implicational universals. Linguistic Typology 7(1), 89–101 (2003) 4. Dryer, M.S.: Order of object and verb. In: Haspelmath, M., Dryer, M.S., Gil, D., Comrie, B. (eds.) World Atlas of Language Structures, pp. 338–341. Oxford University Press, Oxford (2005) 5. Haspelmath, M., Dryer, M.S., Comrie, B., Gil, D. (eds.): The World Atlas of Language Structures. Oxford University Press, Oxford (2005) 6. Janssen, D.P., Bickel, B., Z´ un ˜iga, F.: Randomization tests in language typology. Linguistic Typology 10(3), 419–440 (2006)
On the Probability Distribution of Typological Frequencies
35
7. Justeson, J.S., Stephens, L.D.: On the relationship between the numbers of vowels and consonants in phonological systems. Linguistics 22, 531–545 (1984) 8. Kendall, D.G.: Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded markov chain. The Annals of Mathematical Statistics 24(3), 338–354 (1953) 9. Lehfeldt, W.: Die Verteilung der Phonemanzahl in den nat¨ urlichen Sprachen. Phonetica 31, 274–287 (1975) 10. Maddieson, I.: Consonant inventories. In: Haspelmath, M., Dryer, M.S., Gil, D., Comrie, B. (eds.) World Atlas of Language Structures, pp. 10–13. Oxford University Press, Oxford (2005) 11. Maslova, E.: A dynamic approach to the verification of distributional universals. Linguistic Typology 4(3), 307–333 (2000) 12. Maslova, E.: Meta-typological distributions. Sprachtypologie und Universalienforschung 61(3), 199–207 (2008) 13. Mitzenmacher, M., Upfal, E.: Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge Uniersity Press, Cambridge (2005) 14. Nichols, J.: Linguistic Diversity in Space and Time. University of Chicago Press, Chicago (1992) 15. Nichols, J., Barnes, J., Peterson, D.A.: The robust bell curve of morphological complexity. Linguistic Typology 10(1), 96–106 (2006) 16. Plank, F., Schellinger, W.: Dual laws in (no) time. Sprachtypologie und Universalienforschung 53(1), 46–52 (2000) 17. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008)
A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus Timothy A.D. Fowler Department of Computer Science University of Toronto 10 King’s College Rd., Toronto, ON, M5S 3G4, Canada
[email protected] Abstract. [2] introduced the bounded-order Lambek Calculus and provided a polynomial time algorithm for its sequent derivability. However, this result is limited because it requires exponential time in the presence of lexical ambiguity. That is, [2] did not provide a polynomial time parsing algorithm. The purpose of this paper will be to provide such an algorithm. We will prove an asymptotic bound of O(n4 ) for parsing and improve the bound for sequent derivability from O(n5 ) to O(n3 ).
1
Introduction
The Lambek calculus (L) [5] is a categorial grammar formalism having a number of attractive properties for modelling natural language. In particular, it is strongly lexicalized which allows the parsing problem to be a problem over a local set of categories, rather than a global set of rules. Furthermore, there is a categorial semantics accompanying every syntactic parse. Despite these attractive properties, [7] proved that L is weakly equivalent to context-free grammars (CFGs) which are widely agreed to be insufficient for modelling all natural language phenomena. In addition, [8] proved that the sequent derivability problem for L is NP-complete. The weak equivalence to CFGs has been addressed by a number of authors. [4] introduced a mildly contextsensitive extension of L, increasing the weak equivalence of the formalism and [9] proved that L is more powerful than CFGs in terms of strong generative capacity. [6] proved that restrictions of the multi-modal non-associative Lambek calculus is mildly context-sensitive, which raises interesting questions as to the importance of associativity to parsing. We will address the issue of parsing complexity by extending the results of [2] and providing a parsing algorithm for the Lambek calculus which runs in O(n4 ) time for an input of size n when the order of categories is bounded by a constant. The key to this result is in restricting Fowler’s representation of partial proofs to be local to certain categories rather than global. Our results are for the product-free fragment of the Lambek calculus (L) and the variant that allows empty premises (L∗ ), for simplicity and because the product connective has limited linguistic application. This work can be also seen as a generalization of C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 36–43, 2010. c Springer-Verlag Berlin Heidelberg 2010
A PTIME Algorithm for the Bounded Order Lambek Calculus
37
[1] which proved that if the order of categories is less than two then polynomial time sequent derivability is possible.
2
The Lambek Calculus
The set of categories C is built up from a set of atoms (e.g. {S, N P, N, P P }) and the two binary connectives / and \. A Lambek grammar G is a 4-tuple Σ, A, R, S where Σ is an alphabet, A is a set of atoms, R is a relation between symbols in Σ and categories in C and S is the set of sentence categories. A sequent is a sequence of categories (called the antecedents) together with the symbol and one more category (called the succedent). The parsing problem for a string of symbols s1 . . . sl can be characterized in terms of the sequent derivability problem as follows: s1 . . . sl ∈ Σ ∗ is parseable in a Lambek grammar Σ, A, R, S if there exists c1 , . . . , ck ∈ C and s ∈ S such that ci ∈ R(si ) for 1 ≤ i ≤ k and the sequent c1 . . . ck s is derivable in the Lambek calculus. The sequent derivability problem can be characterized by conditions on term graphs [3]. A term graph for a sequent is formed in a two step process. The first step is deterministic and begins with associating a polarity with each category in the sequent: Negative for the antecedents and positive for the succedent. Next, we consider the polarized categories as vertices and decompose the slashes according to the following vertex rewriting rules: (α/β)− ⇒ α− → β + (β\α)− ⇒ β + ← α− (α/β)+ ⇒ β − α+ (β\α)+ ⇒ α+ β − The neighbourhood of the polarized category on the left of each rule is assigned to α. The dashed edges are called Lambek edges and the non-dashed edges are called regular edges. This process translates categories into trees. Then, additional Lambek edges are introduced from the root of the succedent tree to the roots of the antecedent trees which are referred to as rooted Lambek edges. The result of the graph rewriting is an ordered sequence of polarized atoms. The second step is non-deterministic and assigns a complete matching of the polarized atoms. The matching must be planar above the atoms and match occurrences of atoms only with other occurrences of the same atom but of opposite polarity. These pairings of atom occurrences are called matches. The edges in
S
S
NP
NP
S
NP
NP
S
Fig. 1. An integral term graph for the sequent S/(N P \S), (N P \S)/N P, N P S
38
T.A.D. Fowler
the matching are regular edges and are directed from the positive atom to the negative atom. An example of a term graph can be seen in figure 1. We will prefix the term regular (resp. Lambek) to the usual definitions of graph theory when we mean to refer to the sub-graph of a term graph obtained by restricting the edge set to only the regular (resp. Lambek) edges. A term graph is L∗ -integral if it satisfies the following conditions: 1. G is regular acyclic. 2. For every Lambek edge s, t in G there is a regular path from s to t. A term graph is integral if it is L∗ -integral and it satisfies the following: 3. For every Lambek edge s, t in G, there is a regular path from s to a vertex x in G such that if x has a non-rooted Lambek in-edge s , x then there is no regular path from s to s . Theorem 1. A sequent is derivable in L iff it has an integral term graph. A sequent is derivable in L∗ iff it has an L∗ -integral term graph. Proof. See [3].
Parsing with L∗
3
[2] presented a chart parsing algorithm that uses an abstraction over term graphs1 as a representation and considers the exponential number of matchings by building them incrementally from smaller matchings. The indices of the chart are the atoms of the sequent. Applying this algorithm to the parsing problem is exponential time because near the bottom of the chart, there would need to be one entry in the chart for every possible sequence of categories. We will generalize this algorithm to the parsing problem by restricting the domain of the abstractions to be the categories of the atoms at its endpoints. 3.1
Abstract Term Graphs
In this section, we introduce versions of the L∗ -integrity conditions that can be enforced incrementally during chart parsing and using only the information contained in the abstractions. A partial matching is a contiguous sub-matching of a complete matching. The notion of partial term graph is defined in the obvious way. In a partial term graph, a vertex is open if it does not have a match. A partial term graph is L∗ -incrementally integral if it satisfies the following: 1. G is regular acyclic 2. For every Lambek edge s, t either there is a regular path from s to t or there is an open positive vertex u such that there is a regular path from s to u and there is an open negative vertex v such that there is a regular path from t to v. 1
There are some differences between the presentations. For discussion, see [3].
A PTIME Algorithm for the Bounded Order Lambek Calculus
39
The intuition is that a partial term graph G is L∗ -incrementally integral exactly when extending its matching to obtain an L∗ -integral term graph has not already been ruled out. It should be easy to see that L∗ -incrementally integral term graphs are L∗ -integral. We will only insert representations into the chart for partial term graphs that are L∗ -incrementally integral. An abstract term graph A(G) is obtained from an L∗ -incrementally integral term graph G by performing the following operations: 1. Deleting Lambek edges s, t s.t. there is a regular path from s to t. 2. Replacing Lambek edges s, t where t is not open by s, r where r is the open negative vertex from which there is a regular path to t. 3. Replacing Lambek edges s, t where s is not open by edges r, t for r ∈ R where R is the set of open positive vertices with regular paths from s. 4. Contracting regular paths between open vertices to a single edge and deleting the non-open vertices on the path For the term graph with the empty matching G , A(G ) = G . Planar matchings are built out of smaller partial planar matchings in one of two ways: Bracketing and Adjoining. Bracketing a partial matching extends it by introducing a match between the two atoms on either side of the matching. Adjoining two partial matchings concatenates the two partial matchings to obtain a single larger matching. [2] details two sub-algorithms for bracketing and adjoining ATGs. Given an ATG A(G), Fowler’s bracketing algorithm determines whether extending G by bracketing (for any term graph G with ATG A(G)) results in an incrementally integral term graph H and returns the ATG A(H) if it does. Given two ATGs A(G1 ) and A(G2 ), the adjoining algorithm does the same for adjoining. 3.2
The Parsing Algorithm
Polynomial time parsing depends on the following key insight: the only information in ATGs needed to run the bracketing and adjoining algorithms is the edges between atoms originating from categories at the endpoints of the ATG. Let G = Σ, A, R, S be a Lambek grammar and let s1 . . . sl ∈ Σ ∗ be the input string. Because an atom or category can occur more than once in the input, we need the notions of category occurrence and atom occurrence to be specific instances of a category and an atom, respectively. Given a category occurrence c, the partial term graph for c, denoted T (c), is the graph obtained from the deterministic step of term graph formation without the rooted edges. The atom sequence for c, denoted A(c), is the sequence of atom occurrences of T (c). Given an atom occurrence a, C(a) is the category occurrence from which it was obtained and given a category occurrence c, S(c) is the symbol si from which it was obtained. Our new definition of ATGs is as follows: For a partial term graph G, whose matching has endpoints as and ae , its ATG A(G) is obtained as before but with the following final step:
40
T.A.D. Fowler
5. Deleting all vertices v such that C(v)
= C(as ) and C(v)
= C(ae ) The bracketing and adjoining algorithms of [2] require knowledge both about the ATGs that they are bracketing and adjoining but also the state of the term graphs before any matchings have been made, which is referred to as the base ATG. In our case, the base ATG is simply the union of the term graphs of all categories of all symbols. In addition, [2] includes labels on the Lambek edges of ATGs that are the same as the target of the Lambek edge in the base ATG. Section 4 details a simpler and more elegant method of representing Lambek edges in ATGs. Finally, since the partial term graphs for categories do not include the rooted Lambek edges, they will need to be inserted whenever an ATG, G2 , with a right endpoint in a succedent category is combined with an ATG, G1 , with a left endpoint in a different category than the left endpoint of G2 . Then, Lambek edges are inserted from the positive vertices with in-degree 0 in G1 to the negative vertices with in-degree 0 in G2 . Our chart will be indexed by atom occurrences belonging to categories of symbols in the input string and categories in S. Entries consist of sets of ATGs and each entry as , ae will be specified by its leftmost matched atom as and its rightmost matched atom ae . A chart indexed in this way can be implemented by arbitrarily ordering the categories for a symbol and keeping track of the locations of word and category boundaries in the sequence of atoms. Sections 3.2 and 3.2 will outline the insertion of ATGs into the chart. Inserting ATGs for Minimal Matchings. A minimal matching is a matching consisting of exactly one match. Minimal matchings are inserted in two distinct steps. First, we insert minimal matchings over atom occurrences as and ae where C(as ) = C(ae ) = C and as and ae are adjacent in C. The ATG input to the bracketing algorithm is T (C). = Second, we insert minimal matchings over atoms as and ae where C(as )
C(ae ), as is the rightmost atom in C(as ) and ae is the leftmost atom in C(ae ). The ATG input to the bracketing algorithm is T (C(as )) ∪ T (C(ae )). Inserting ATGs for Non-minimal Matchings. Once we have inserted the ATGs for minimal matchings, we must process each ATG in the chart. That is, we consider all possible ways of bracketing and adjoining each ATG in the chart, which may require insertion of more ATGs that, in turn, need to be processed. An entry as , ae left subsumes an entry at , af if one of the following are satisfied: 1. at is equal to as 2. C(at ) = C(as ) = C and at appears to the right of as in A(C) = C(as ) and S(C(at ) appears to the right of S(C(as )) in the input 3. C(at )
string The notion of right subsumes is defined similarly for af and ae . An entry E subsumes an entry F if it left subsumes and right subsumes F . The intuition is that E subsumes F iff the endpoints of F appear between the endpoints of E.
A PTIME Algorithm for the Bounded Order Lambek Calculus
41
The size of an entry as , ae is the distance between as and ae , if C(as ) = C(ae ) and the distance between S(C(as )) and S(C(ae )), if C(as )
= C(ae ). However, any category with endpoints in the same category is smaller than any category for which they are not. We must take care not to process an entry until all ATGs for that entry have been inserted. All ATGs for an entry E have been inserted only after all entries that E subsumes have been processed. To ensure this, we process entries as follows. First, we process entries whose endpoints are in the same category. We process entries from smallest to largest and among entries of equal size from left to right. Second, we process entries whose endpoints are in different categories in the same order. To process an entry E = as , ae , we must consider all possible bracketings and adjoinings. To do this, we first calculate the set of atoms L(as ) and R(ae ), which will correspond to atoms which could occur to the left of as in a sequent and to the right of ae , respectively. If as is the leftmost atom in its category, then L(as ) is the set of atoms such that they are the rightmost atom in a category occurrence for the symbol occurring to the left of S(C(as )) in the input string. If S(C(as )) is the leftmost symbol in the input string then L(as ) = ∅. If as is not the leftmost atom in its category, then L(as ) is the singleton consisting of the atom to the left of as in A(C(as )). R(ae ) is computed analogously. Then, let G be an ATG in E and let al ∈ L(as ) and ar ∈ R(ae ). First, we run the bracketing algorithm with input G, al and ar . Then, for each ATG H in an entry F where F = a, al or F = ar , a for some atom a and such that F is smaller than E, we run the adjoining algorithm with input G and H. Lastly, for each ATG H in an entry the same size as E with endpoint al , we run the adjoining algorithm with input G and H. When running the bracketing and adjoining algorithms we insert the output into the chart at the appropriate place when they are successful. After processing every entry, we output “YES” if there are any ATGs in any entry as , ae such that as is the leftmost atom in a category in s1 and ae is the rightmost atom in a category in S. Otherwise, we output “NO”. Correctness. It is not hard to prove that the chart iteration process in the preceding sections considers every possible matching over every possible sequent for the sentence. Then, the bracketing and adjoining algorithms of [2] ensure any ATG inserted is the ATG of an L∗ -incrementally integral partial term graph. Finally, any ATG in an entry whose left endpoint is the leftmost atom of a category for s1 and whose right endpoint is the rightmost atom of a category for S is the ATG of an L∗ -incrementally integral term graph over some sequence of categories which must be an L∗ -integral term graph.
4
Running Time
[2] introduced the bounded order Lambek calculus where the categories in the input have order bounded by k. In the context of term graphs, bounding order by k is equivalent to bounding the length of paths in the base ATG. The key
42
T.A.D. Fowler B C A D E
H H I I I I
G F A ⇒ G
F D E
B C
Fig. 2. Comparing ATG representations
insight for achieving polynomial time is that bounding the lengths of paths in partial term graphs bounds the lengths of paths in ATGs in the chart, which in turn, bounds the variation. To improve the bound and simplify the representation, we introduce a better representation of ATGs. In the original definition of ATGs, the out-neighbourhood of negative vertices is not structured and the applicability of the integrity conditions during bracketing and adjoining is informed by the labels on Lambek edges. However, from ATG A(G) on the left of Fig. 2, we can deduce that for Lambek edges X, H and Y, I in G, there must be a regular path from Y to X. If we maintain this structure explicitly, then we can remove the need for labels on Lambek edges, since the labels are only used to deduce this structure. Furthermore, we can observe that any two positive vertices in an ATG who share an in-neighbour in the base ATG must have identical neighbourhoods in the ATG. Rather than representing each such positive vertex distinctly, we can use a placeholder that points to sets of positive vertices sharing in-neighbours in the base ATG. These two modifications are shown in Fig. 2 where the sets are listed below the placeholders. Finally, observe that any partition of the atoms of a category results in a path in the base ATG. Given an entry as , ae , the partition of the atoms to the left of as and to the right of ae results in two such paths. Then, any two ATGs in as , ae , can differ only by the neighbourhoods they assign to vertices incident to edges on one of the two paths. Bounded order gives us bounded paths and our new representation bounds the branching factor on these paths which means that the sub-graph of an ATG which is manipulated during bracketing and adjoining is of a constant size. Therefore, adjoining and bracketing are constant time operations. Also, the number of ATGs for an entry must also be constant, since the variation within them is bounded by a constant. Let n be the number of atoms in categories of symbols in the input. Then, the chart is of size O(n2 ) and while processing an entry, we bracket O(n) times and adjoin with the ATGs in up to O(n2 ) entries. Thus, the running time of our algorithm is O(n4 ). Without lexical ambiguity, we only need to adjoin with ATGs in O(n) entries yielding a time bound of O(n3 ) for sequent derivability.
5
Parsing with L
To ensure correctness of the algorithm with respect to L, we need a notion of incremental integrity for partial term graphs. A partial term graph G is incrementally integral if it is L∗ -incrementally integral and satisfies the following:
A PTIME Algorithm for the Bounded Order Lambek Calculus
43
3. For every Lambek edge s, t in G either there is a regular path from s to a negative vertex x such that if x has a non-rooted Lambek in-edge s , x then there is no regular path from s to s or there is an open positive vertex u such that there is a regular path from s to u and there is a negative vertex v and an open negative vertex w such that there is a regular path from w to v and if v has a non-rooted Lambek in-edge s , v then there is no regular path from w to s . To enforce this condition during chart parsing, we will need to associate an empty premise bit to positive vertices in ATGs. The bit associated to vp will represent whether there is a regular path from the source of a Lambek edge that has not yet met condition 3 to vp in the term graph. The empty premise bits for an ATG can be calculated analogously to the calculation of Lambek edges during adjoining and bracketing, since they represent the same notions of paths in underlying term graphs.
6
Conclusion
We have provided a parsing algorithm for both the Lambek calculus and the Lambek calculus allowing empty premises when the order of categories is bounded by a constant that runs in time O(n4 ). Also, we reduced the asymptotic bound for sequent derivability of [2] from O(n5 ) to O(n3 ). To the best of our knowledge, no linguistic analysis has called for categories of high order which means that these results allow for practical parsing with the Lambek calculus.
References 1. Aarts, E.: Proving theorems of the second order lambek calculus in polynomial time. Studia Logica 53(3), 373–387 (1994) 2. Fowler, T.A.D.: Efficient parsing with the Product-Free lambek calculus. In: Proceedings of The 22nd International Conference on Computational Linguistics (2008) 3. Fowler, T.A.D.: Term graphs and the NP-completeness of the Product-Free lambek calculus. In: Proceedings of the 14th Conference on Formal Grammar (2009) 4. Kruijff, G.J.M., Baldridge, J.M.: Relating categorial type logics and CCG through simulation. Unpublished Manuscript, University of Edinburgh (2000) 5. Lambek, J.: The mathematics of sentence structure. American Mathematical Monthly 65(3), 154–170 (1958) 6. Moot, R.: Lambek grammars, tree adjoining grammars and hyperedge replacement grammars. In: Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms (2008) 7. Pentus, M.: Product-free lambek calculus and context-free grammars. The Journal of Symbolic Logic 62(2), 648–660 (1997) 8. Savateev, Y.: Product-free lambek calculus is NP-complete. CUNY Technical Report (September 2008) 9. Tiede, H.J.: Deductive Systems and Grammars: Proofs as Grammatical Structures. PhD thesis, Indiana University (1999)
LC Graphs for the Lambek Calculus with Product Timothy A.D. Fowler Department of Computer Science University of Toronto 10 King’s College Rd., Toronto, ON, M5S 3G4, Canada
[email protected] Abstract. This paper introduces a novel graph representation of proof nets for the Lambek calculus that extends the LC graph representation of [13] to include the product connective. This graph representation more clearly specifies the difference between the Lambek calculus with and without product than other proof net representations, which is important to the search for polynomial time among Lambek calculus fragments. We use LC graphs to further the efforts to characterize the boundary between polynomial time and NP-complete sequent derivability by analyzing the NPcompleteness proof of [14] and discussing a sequent derivability algorithm.
1
Introduction
The Lambek calculus [11] is a categorial grammar having four variants which will be considered in this paper: the Lambek calculus with product (L• ), the Lambek calculus without product (L), the Lambek calculus with product allowing empty premises (L•∗ ) and the Lambek calculus without product allowing empty premises (L∗ ). These four calculi can be characterized by the inference rules shown in figure 1. Lowercase Greek letters represent categories built from a set of atoms and the three connectives /, \ and • and uppercase Greek letters represent sequences of these categories. L•∗ uses these inference rules as they are given. L∗ is identical except that •L and •R are prohibited. L• and L differ from their counterparts that allow empty premises by prohibiting Γ from being empty in /R and \R. A wide variety of work has contributed to the search for the boundary between polynomial time and NP-completeness for sequent derivability in L and L• . In particular, [9], [6], [16], [15], [13] and [4] have contributed to the search for polynomial time by furthering research into proof nets and chart parsing algorithms for the four variants of the Lambek calculus defined here. As an alternative to this approach, [1] provides a polynomial time algorithm for L when the input is restricted to categories of order less than two. In a similar vein, [2] and [7] provide polynomial time algorithms for the non-associative variants of L and L• . In contrast to this work, [14] proved that sequent derivability in both L• and L•∗ is NP-complete. However, because the necessity of product for modelling C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 44–55, 2010. c Springer-Verlag Berlin Heidelberg 2010
LC Graphs for the Lambek Calculus with Product
45
αα Γ α ΔβΘ γ \L ΔΓ α\βΘ γ
αΓ β \R Γ α\β
Γ α ΔβΘ γ /L Δβ/αΓ Θ γ
Γα β /R Γ β/α
Γ αβΔ γ •L Γ α • βΔ γ
Γ α Δβ •R ΓΔ α • β
Fig. 1. Inference rules of the Lambek calculus
natural language has not been firmly established, the computational complexity of sequent derivability in both L and L∗ remains an important open problem. This paper will continue this research with the intent of discovering the precise computational differences between L and L• and with an eye towards solving the problem of sequent derivability in L. We will introduce a graph formalism for representing proofs in L• and use it to analyze the NP-completeness proof of [14]. An intuitive graphical presentation is made of Pentus’ proof and we also discuss the possibility of transforming that proof into an NP-completeness proof for L. Then, in the conclusion, we will discuss the use of this graph formalism as the basis of a chart parsing algorithm. Beyond purely theoretical interest, the Lambek calculus can be motivated by the practical success of Combinatory Categorial Grammar (CCG) [18,5,3]. However, despite the similarities between the approaches of the Lambek calculus and CCG, it is well-known that CCG recognizes languages that are super-contextfree [10] whereas the Lambek calculus recognizes only the context-free languages. This could be seen to be problematic since it is also well-known that there are natural languages which are not context-free [17]. Despite this, the Lambek Calculus is interesting because it is weakly equivalent to context-free grammars, which are widely used in the computational linguistics community for practical parsing, but not strongly equivalent [19], allowing for valuable comparisons. Furthermore, the Lambek calculus has been the basis of more complex systems that recognize languages that are not context-free [12] and any investigation into these more complex systems must begin with the Lambek calculus. The proofs in this paper are kept at a high level. More detail can be found in [8].
2
Proof Nets and Graph Representations
Exploring the sequent derivability problem in the Lambek calculus via the inference rules shown in figure 1 has proven to be quite cumbersome and as a result most work in this area is done via proof nets.
46
T.A.D. Fowler
Proof nets, as originally introduced by [9], are an extra-logical proof system which eliminates spurious ambiguity. A proof structure consists of a deterministic proof frame and a non-deterministic axiomatic linkage. First, all formulae in the sequent are assigned a polarity. Formulae in the antecedent are assigned negative polarity and the formula in the succedent is assigned positive polarity. The proof frame is a proof-like structure built on a sequent using the decomposition rules shown in figure 2. α+ −
β− α\β −
α
β α/β
α−
+
−
β−
α • β−
⊗ ⊗ ℘
β+ β
−
β+
α− α\β +
α α/β
℘
+
+
α+
α • β+
℘ ⊗
Fig. 2. Proof frame rules
Each connective-polarity pair has a unique rule which gives us a unique proof frame for a given sequent. The top of the proof frame consists of atoms with polarities which are called the axiomatic formulae. An axiomatic linkage is a bijection that matches axiomatic formulae with the same atom but opposite polarities. See figure 4 for an example of a proof structure for a sequent. Some proof structures correspond to proofs in the Lambek calculus and those which do are called proof nets. It should be noted that all proof nets for the Lambek Calculus require a planar axiomatic linkage. A variety of methods have been introduced to determine whether a proof structure is a proof net, all of which are based on graphs. These methods fall into two major categories described in sections 2.1 and 2.2. The primary difference between these two traditions is the fact that the Girard-style conditions can also be used for non-intuitionistic variants of the Lambek calculus whereas the Roorda-style conditions take advantage of the intuitionistic nature of the Lambek calculus. 2.1
Girard Style Correctness Conditions
Presentations in this style are based on the original correctness conditions given in [9]. This style is characterized by building graphs based on the proof frame rules based only on whether the rule is a ⊗-rule or a ℘-rule. Work in this style includes the graphs of [6], R&B graphs of [15], quantum graphs of [16] and switch graphs of [4]. [6] were the first to formulate graph representations of proof nets in the Girard style. The DR-graph of a proof structure is obtained by translating each formula in the proof structure into a vertex and then inserting edges between each parentchild pair in the proof frame and between each pair of axiomatic formulae in the axiomatic linkage.
LC Graphs for the Lambek Calculus with Product
47
Then, a switching of the DR-graph is obtained by finding the set of all ℘rules in the proof frame and deleting exactly one of the two edges between the conclusion and the two premises of each rule in the proof frame. [6] proved that a proof structure is a proof net if and only if every switching is acyclic. 2.2
Roorda Style Correctness Conditions
[16] introduced a significantly different method for evaluating the correctness of proof structures. This method requires the annotation of the proof frame with lambda calculus terms as well as the creation of a set of substitutions as shown in square brackets in figure 3. +
−
α : u
β : tu −
α\β : t − + α : tu β : u −
α/β : t − − α : (t)0 β : (t)1 −
α•β : t
+
−
β : v
α : u +
α\β : v − + β : u α : v +
α/β : v + + β : v α : v +
α•β : v
[v := λu.v ] [v := λu.v ] [v := v , v ]
Fig. 3. Annotation of lambda terms and substitutions
+
−
In addition to the substitutions specified above, for each pair X : α, X : Δ in the axiomatic linkage, we add a substitution of the form [α := Δ]. [16] then provides a method for determining proof structure correctness based on variable substitutions for L, L∗ , L• and L•∗ . [13] introduces a graph representation of this method for L called LC graphs. An LC graph is obtained from a proof structure by taking the vertex set as the set of lambda variables occurring in the proof frame. Then, directed edges are introduced from the lambda variable on the left of a substitution to the lambda variables on the right of the substitution for each substitution. [13] then gives the following correctness conditions for these LC graphs: – I(1) There is a unique vertex s in G with in-degree 0 such that for all v ∈ V , s v.1 – I(2) G is acyclic. – I(3) For every substitution of the form [v := λu.w], w u. – I(CT) For every substitution of the form [v := λu.w], there exists a negative vertex x in G and there is no substitution of the form [v := λx.w ]. [13] proves that a sequent is derivable in L∗ iff it has an LC graph that satisfies I(1-3) and that a sequent is derivable in L iff it has an LC graph that satisfies I(1-3) and I(CT). 1
denotes path accessibility.
48
T.A.D. Fowler
−
A : (f )0
−
A : (f )1 −
℘
(A • A) : f −
+
A : ab −
A : b ⊗
(A/A) : a
+
+
A : g +
(A/(A • A)) : e
A : d
+
((A/(A • A)) • A) : c
℘
⊗
Substitutions := [c := e, d], [e := λf.g], [g := ab], [b := (f )1 ], [d := (f )0 ] Fig. 4. A proof structure for (A/A) ((A/(A • A)) • A) with annotations
2.3
Evaluation of Girard and Roorda Style Correctness Conditions
Given our goal of investigating the boundary of tractability for the variants of the Lambek calculus which requires an investigation into the computational differences between L and L• , we must evaluate the two proof net styles. The Girard style conditions have the advantage that they have been defined for both L and L• but the significant disadvantage that by ignoring the differences among ⊗ rules and among ℘ rules, removing product does not simplify the complexity of these conditions. On the other hand, the Roorda style conditions do become simplified with the removal of product, given that projections and pairings are removed. However, no graph formalism has been introduced for L• in this style until now.
3
LC Graphs for L• and L•∗
We will construct our LC graphs for sequents with products in exactly the same way as for those without products with the obvious difference that we will have the two • rules and the substitutions associated with the positive • rule. It turns out that this is all that is necessary to accommodate both L• and L•∗ . Then, we will add the following correctness condition. – I(4) For every substitution of the form [v := λu.v ] and for every x ∈ V , either every path from x to u contains v or v x. We can prove that these correctness conditions are sound and complete relative to the correctness conditions for variable substitutions in [16] in a very similar way to the proofs for LC graphs in [13]. Most proofs follow from the close mirroring between the correctness conditions for LC graphs and the correctness conditions for variable substitutions. To prove that I(1) is necessary requires an application of structural induction and some facts about projections in the lambda calculus. Details of these proofs can be found in [8]. It is important to notice that the only difference between LC graphs for L and those for L• is a single correctness condition. This simple difference does not appear in the treatment of proof nets in [16] with the result that we now
LC Graphs for the Lambek Calculus with Product
49
d c
a
f
g
e
b Fig. 5. The LC graph for the proof structure in figure 4
have a new tool for examining how different the two calculi are in terms of their parsing complexity. Figure 4 shows a proof structure for a sequent which is potentially derivable in L• . Figure 5 shows the corresponding LC graph. The path from d to f violates I(4) causing this proof structure to not qualify as a proof net.
4
LC Graphs over Atoms
In this section, we will introduce a novel version of LC graphs that uses the atom occurrences across the top of the proof frame as vertices rather than the lambda variables. This new representation is a step towards a chart parsing algorithm based on LC graphs because by discarding the lambda terms, we obtain a closer connection between the axiomatic linkages and the LC graphs. The structure of these new LC graphs will closely mirror the structure of those in the preceding section. Definition 1. An LC graph over atoms for a sequent is a directed graph whose vertices are category occurrences and whose edges are introduced in four groups. We will proceed with a deterministic step first and a non-deterministic step second. First, we assign polarities to category occurrences by assigning negative polarity to occurrences in the antecedent and positive polarity to the succedent. Then, the first two groups of edges are introduced by decomposing the category occurrences via the following vertex rewrite rules: (α/β)− ⇒ α− → β + −
(1)
(α/β) ⇒ β α (β\α)− ⇒ β + ← α−
(2) (3)
(β\α)+ ⇒ α+ β − α− (β • α)− ⇒ β −
(4) (5)
(β • α)+ ⇒ α+
(6)
+
+
β+
Each vertex rewrite rule specifies how to rewrite a single vertex on the left side to two vertices on the right side. For rules (1-4), the neighbourhood of the vertex on the left side of each rule is assigned to α on the right side. For rules (5-6), the neighbourhood of the vertex on the left side is coped to both α and β on the
50
T.A.D. Fowler
right side. Dashed edges are referred to as Lambek edges and non-dashed edges are referred to as regular edges. These two groups of edges will be referred to as rewrite edges. After decomposition via the rewrite rules, we have an ordered set of polarized vertices, with some edges between them. We say that a vertex belongs to a category occurrence in the sequent if there is a chain of rewrites going back from the one that introduced this vertex to the one that rewrote the category occurrence. A third group of edges is introduced such that there is one Lambek edge from each vertex with in-degree 0 in the succedent to each vertex with in-degree 0 in each of the antecedent category occurrences. These edges are referred to as rooted Lambek edges. This completes the deterministic portion of term graph formation. An axiomatic linkage is defined exactly as in the preceding section. The fourth group of edges are introduced as regular edges from the positive vertices to the negative vertices they are linked to. This definition of LC graphs is quite similar to the definition based on lambda variables which we will now refer to as LC graphs over variables. An example of an LC graph over atoms is shown in figure 6 with the equivalent LC graph over variables shown in figure 7. We define the following correctness conditions on an LC graph over atoms G: – I’(1) G is acyclic – I’(2) For each negative vertex t in G and each vertex x in G, either there is no regular path from x to t or for some Lambek edge s, t, there is a regular path from s to x or there is a regular path from x to s. – I’(CT) For each positive vertex s in G that is the source of a Lambek edge, there is a regular path from s to some vertex x such that either x has an in-edge that is a rooted Lambek edge or x has an in-edge that is a Lambek edge s , x such that there is no regular path from s to s . These conditions correspond to sequent derivability in the Lambek calculus with product in the following way: Definition 2. An LC graph over atoms is L•∗ -integral if it satisfies I (1) and I (2). An LC graph over atoms is integral if it is L•∗ -integral and also satisfies I (CT ). Theorem 1. A sequent is derivable in L•∗ iff it has an L•∗ -integral LC graph over atoms. Proof. We will prove this result by a mapping between the variables in an LC graph over variables and the atoms in an LC graph over atoms with a number of additional considerations. The primary mapping will be to map an atom to the leftmost variable appearing in its lambda term label. This has the effect that negative vertices in an LC graph over variables are reoriented to have regular out-edges to their positive siblings in an LC graph over atoms. There are four additional structural differences:
LC Graphs for the Lambek Calculus with Product
51
(1) Positive occurrences of products In an LC graph over variables, there are a number of lambda variables that do not appear as the leftmost variable for an axiomatic formula which means that they have no corresponding atom in an LC graph over atoms. However, these variables serve only to allow branches in the paths of an LC graph over variables which is encoded similarly in an LC graph over atoms. We then need only ensure that for conditions I(3), I(4) and I(CT) this internal structure is preserved, which is done by the rewrite Lambek edges. (2) Negative occurrences of products Negative occurrences of products in a proof structure introduce projections and duplicate terms which has the effect of having multiple atoms labelled by terms with the same leftmost variable. In an LC graph over atoms, these atoms are necessarily represented by different atoms. To ensure that the two types of graphs are structurally equivalent, we duplicate the regular outedges during LC graph over atom construction and do the same for Lambek in-edges ensuring that each of the different atoms in an LC graph over atoms requires the same paths as in an LC graph over variables. (3) Rewrite Lambek edges In an LC graph over variables, conditions I(3) and I(4) specify conditions on the substitutions from the proof frame specifying paths between certain variables in those substitutions. Rather than maintaining these substitutions, LC graphs over atoms simply make these conditions explicit in the form of Lambek edges and requiring Lambek edges to have accompanying regular paths. (4) Rooted Lambek edges Since I(1), I(3) and I(4) all specify the existence of certain paths in an LC graph over variables, rooted Lambek edges allow us to combine these three conditions in an LC graph over atoms. This is done by inserting Lambek edges between each source in the succedent category’s LC graph and each source in each antecedent category’s LC graph. Then, satisfying I’(2) is precisely equivalent to satisfying I(1), I(3) and I(4). Then, given these structural correspondences, I’(1) and I’(2) are simply the natural translations of I(1-4). Theorem 2. A sequent is derivable in L• iff it has an integral LC graph over atoms. Proof. Given the structural correspondences given in the preceding theorem, I’(CT) is the straightforward translation of I(CT) into the language of LC graphs over atoms.
5
The NP-Completeness Proof
[14] showed that sequent derivability in both L• and L•∗ is NP-complete and our purpose in this section will be to analyze that proof using LC graphs to determine whether it can be adapted to an NP-completeness proof for derivability in L and L∗.
52
T.A.D. Fowler
Pentus’ proof is via a reduction from SAT. Given a SAT instance c1 ∧ . . . ∧ cm , [14] introduced the following categories for t ∈ {0, 1}, 1 ≤ i ≤ n and 0 ≤ j ≤ m. ¬1 v is a shorthand for v and ¬0 v is a shorthand for ¬v. Ei0 (t) = p0i−1 \p0i Eij (t) = (pji−1 \Eij−1 (t)) • pji if ¬t xi ∈ cj Eij (t) = pji−1 \(Eij−1 (t) • pji ) otherwise G0 = p00 \p0n Gj = (pj0 \Gj−1 ) • pjn Hi0 = p0i−1 \p0i Hij = pji−1 \(Hij−1 • pji ) Fi = (Eim (1)/Him ) • Him • (Him \Eim (0)) These categories are then used to construct the sequent F1 , . . . , Fn Gm . [14] then proved that F1 , . . . , Fn Gm is derivable in L• if and only if E1 (t1 ), . . . , En (tn ) Gm is derivable in L• for some truth assignment t1 , . . . tn ∈ {0, 1}n . We now want to consider all possible LC graphs for E1 (t1 ), . . . , En (tn ) Gm . Because each atom occurs exactly once with positive polarity and once with negative polarity in E1 (t1 ), . . . , En (tn ) Gm , there is exactly one possibly integral term graph and the axiomatic linkage for that term graph is planar. Given the similarity of these sequents, we can depict the LC graph over variables for an arbitrary truth assignment t1 , . . . , tn in figure 6, given the appropriate variable assignments in the proof frame. For the precise details behind the variables in the LC graphs for these sequents see [8]. In a similar way, the LC graph over atoms can be depicted as in figure 7. The LC graph is independent of t1 , . . . , tn except for an m by n chart of edges (shown as finely dashed edges). Then, in both figures 6 and 7, the edge in column j, row i is not present if and only if ¬ti xi appears in cj . Consider the LC graph over variables in figure 6. It is not difficult to see that no proof structure for this sequent can ever violate I(1), I(2) or I(3) by checking its LC graph. Since ¬ti xi is present in cj if and only if the presence of that variable causes cj to be true for truth assignment t1 , . . . , tn , all of the edges
m
c u
bm
dm em
... ...
1
c
b1
d1 e1
d0 b0
un .. .
am n .. .
...
a1n .. .
a0n .. .
u1
am 1
...
a11
a01
Fig. 6. LC graph of E1 (t1 ), . . . , En (tn ) G
LC Graphs for the Lambek Calculus with Product G+
pm n
...
p1n
p0n
G−
pm n
...
p1n
p0n
En+
pm n−1
...
p1n−1
p0n−1
.. .
.. .
.. .
p11
p01
E2−
pm 1
.. . ...
E1+
pm 0
...
p10
p00
E1−
pm 0
...
p10
p00
53
Fig. 7. LC graph over atoms of E1 (t1 ), . . . , En (tn ) G
in a column are present if and only if cj is necessarily false under t1 , . . . , tn . As can be seen this occurs if and only if an I(4) violation is caused by the path from bj to dj . With this result, we can see that not only is I(4) an important part of Pentus’ NP-completeness proof, but that it is the only correctness condition with any influence on the derivability of F1 , . . . , Fn Gm and consequently on the satisfiability of c1 ∧ . . . ∧ cm . Now, consider the LC graph over atoms in figure 7. The structure is quite similar to the LC graph over variables. In particular, the SAT instance is satisfiable if and only if for some 1 ≤ i ≤ m, there is no regular path from pin to pi0 . But I (2) is integral if and only if no such path exists due to the location of Lambek edges. With this presentation of LC graphs over atoms, we can see that the important structure that cannot be represented in LC graphs for L is the negative atoms with multiple Lambek in-edges and multiple regular in-edges. This is due to the fact that without the copying effect of the vertex rewriting rules for product, multiple in-edges of each type of edge are impossible. To adapt this proof to L, some way will need to be found to emulate this structure in LC graphs for L.
6
Conclusion
Having introduced LC graphs over variables for L• and L•∗ , comparing them with LC graphs for L reveals that the difference is only a single path condition on certain vertices in the graph. Furthermore, we can see by applying this observation to the NP-completeness proof of [14] that this path condition is absolutely essential to that proof. This has given us a graphical insight into the precise differences between L and L• . Furthermore, we have gained insight into precisely the kinds of structures in the LC-Graphs for L• that we will need to construct in the LC-Graphs for L if we are to prove the NP-completeness of L in a similar way.
54
T.A.D. Fowler
In addition to extending the LC-Graphs of [13], we also introduced the simplified variant which we called LC-Graphs over atoms. LC-Graphs over atoms allows a closer tie between the axiomatic linkage and the LC-Graph which allows us to define a very natural chart parsing algorithm that uses this representation. Such an algorithm would incrementally build axiomatic linkages, inserting partially completed LC-Graphs into the chart. In addition, each axiomatic linkage corresponds to exactly one edge in an LC-Graph and that edge is the only unconstrained part of the neighbourhoods of its endpoints. Thus, we can contract the paths around the vertices to allow for a more compact representation. To manipulate this more compact representation, we would need to define incremental versions of our correctness conditions. Further exploration of this algorithm is likely to give us insight into the precise boundary between polynomial time and NP-completeness for the variants of the Lambek calculus. This boundary is not limited simply to the variants proposed here, but also to restrictions of L and L• such as those with bounded order as in [1].
References 1. Aarts, E.: Proving theorems of the second order lambek calculus in polynomial time. Studia Logica 53(3), 373–387 (1994) 2. Aarts, E., Trautwein, K.: Non-associative lambek categorial grammar in polynomial time. Mathematical Logic Quarterly 41(4), 476–484 (1995) 3. Bos, J., Markert, K.: Recognising textual entailment with logical inference. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 628–635 (2005) 4. Carpenter, B., Morrill, G.: Switch graphs for parsing type logical grammars. In: Proceedings of IWPT 2005, Vancouver (2005) 5. Clark, S., Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: Proceedings of the 42nd Meeting of the ACL, pp. 104–111 (2004) 6. Danos, V., Regnier, L.: The structure of multiplicatives. Archive for Mathematical logic 28(3), 181–203 (1989) 7. de Groote, P.: The non-associative lambek calculus with product in polynomial time. In: Murray, N.V. (ed.) TABLEAUX 1999. LNCS (LNAI), vol. 1617, pp. 128–139. Springer, Heidelberg (1999) 8. Fowler, T.A.D.: A graph formalism for proofs in the Lambek Calculus with Product. Master’s thesis, University of Toronto (2006) 9. Girard, J.Y.: Linear logic. Theoretical Computer Science 50(1), 1–102 (1987) 10. Joshi, A.K., Vijay-Shanker, K., Weir, D.: The convergence of mildly ContextSensitive grammar formalisms. In: Foundational issues in natural language processing, pp. 31–81 (1991) 11. Lambek, J.: The mathematics of sentence structure. American Mathematical Monthly 65(3), 154–170 (1958) 12. Moot, R., Puite, Q.: Proof nets for the multimodal lambek calculus. Studia Logica 71(3), 415–442 (2002)
LC Graphs for the Lambek Calculus with Product
55
13. Penn, G.: A graph-theoretic approach to sequent derivability in the lambek calculus. Electronic Notes in Theoretical Computer Science 53, 274–295 (2004) 14. Pentus, M.: Lambek calculus is NP-complete. Theoretical Computer Science 357(1-3), 186–201 (2006) 15. Retore, C.: Perfect matchings and series-parallel graphs: multiplicatives proof nets as R&B-graphs. Electronic Notes in Theoretical Computer Science 3, 167–182 (1996) 16. Roorda, D.: Resource logics: proof-theoretical investigations. PhD thesis, Universiteit van Amsterdam (1991) 17. Shieber, S.M.: Evidence against the context-freeness of natural language. Linguistics and Philosophy 8(3), 333–343 (1985) 18. Steedman, M.: The syntactic process. MIT Press, Cambridge (2000) 19. Tiede, H.J.: Deductive Systems and Grammars: Proofs as Grammatical Structures. PhD thesis, Indiana University (1999)
Proof-Theoretic Semantics for a Natural Language Fragment Nissim Francez1 and Roy Dyckhoff2 1
2
1
Computer Science dept.,Technion-IIT, Haifa, Israel
[email protected] School of Computer Science, University of St Andrews, Scotland, UK
[email protected] Introduction
We propose a Proof-Theoretic Semantics (PTS) for a (positive) fragment E0+ of Natural Language (NL) (English in this case). The semantics is intended [7] to be incorporated into actual grammars, within the framework of Type-Logical Grammar (TLG) [12]. Thereby, this semantics constitutes an alternative to the traditional model-theoretic semantics (MTS), originating in Montague’s seminal work [11], used in T LG. We would like to stress on the onset, that this paper is mainly intended for setting the stage for the proposed approach, focusing on properties of the system of rules itself. Subsequent work (references provided) focuses on more properties of the semantics itself and expand in depth on several semantic issues. There is no claim that the current paper solves any open semantic problems in MTS. By providing in detail an alternative approach we prepare the ground for further research that might settle the rivalry between the two approaches. The essence of our proposal is: – For meanings of sentences, replace truth conditions (in arbitrary models) by canonical derivability conditions (from suitable assumptions). In particular, this involves a “dedicated” proof-system (in natural deduction form), based on which the derivability conditions are defined. In a sense, the proof system should reflect the “use” of the sentences in the fragment, and should allow to recover pre-theoretic properties of the meanings of these sentences such as entailment and assertability conditions. The system should be harmonious, in that its rules have a certain balance between introduction and elimination, in order to qualify as meaning conferring. Two notions of harmony are shown to be satisfied by the proposed rules (see Section 4). – For subsentential phrases, down to lexical words, replace their denotations (in arbitrary models) as conferring meaning, by their contributions to the meanings (derivability conditions) of sentences in which they occur. This adheres to Frege’s context principle. As mentioned, this is reported elsewhere ([7]). C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 56–71, 2010. c Springer-Verlag Berlin Heidelberg 2010
Proof-Theoretic Semantics for a Natural Language Fragment
57
The following quotation from [22] (p. 525) emphasizes this lack of applicability to N L, the original reason for considering P T S to start with: Although the “meaning as use” approach has been quite prominent for half a century now and provided one of the cornerstones of philosophy of language, in particular of ordinary language philosophy, it has never become prevailing in the formal semantics of artificial and natural languages. In formal semantics, the denotational approach which starts with interpretations of singular terms and predicates, then fixes the meaning of sentences in terms of truth conditions, and finally defines logical consequence as truth preservation under all interpretations, has always dominated. The main motivation for pursuing PTS originates from the criticism, by several philosophers of language and logicians, about the adequacy of MTS as a theory of meaning. Notably, Dummett (e.g., [3]), Brandom (e.g., [2]) and others. The most famous criticism is Dummet’s manifestation argument, regarding grasping the meaning of a sentence as involving the ability (at least in principle) to verify it, as a condition for its assertability. If, as MTS maintains, meanings are truthconditions (in arbitrary models), manifestation would require deciding whether the truth-conditions obtain in a given arbitrary model. This is not the case even for the simplest sentences, involving only predication, as set membership is not decidable1 in general. In devising a PTS and incorporating it into the grammar of N L, we are not necessarily committing ourselves to the accompanying philosophical positions, such as anti-realism. Some of these philosophical principles have been put under scrutiny. Rather, our point of departure is computational linguistics, with its stress of effectiveness of its methods and theories. There are several differences in the way the PTS is conceived, owing to the differences between E0+ and traditional formal calculi for which ND-systems were proposed in logic as a basis for PTS. – Logical calculi are recursive, in that each operator (connective, quantifier) is applied to (one, two or more) formulas of the calculus, to yield another formula. Thus, there is a natural notion of the main operator, introduced into/eliminated from a formula. In E0+ there is no such notion of a main operator. In this sense, all E0+ sentences are atomic (in not having a sentence as a constituent). – Formal calculi are semantically unambiguous, while E0+ (and NL in general) is semantically ambiguous. In a PTS, the semantic ambiguity manifests itself via different derivations (from same assumptions). This will be exemplified below by showing how traditional quantifier scope ambiguity manifests itself (see (3.1)). – Formal logical calculi usually have (formal) theorems, having a proof, i.e. a (closed) derivation from no open assumptions. In natural language (an in particular in the fragment we consider here) there are hardly any formal theorems. 1
There is no precise statement by Dummet as to what is taken as “decidable”. It is plausible, at least in a computational linguistics context, to identify this notion with effectiveness (i.e., algorithmic decidability).
58
N. Francez and R. Dyckhoff
Typically, sentences are contingent and their derivations rely on open assumptions (but see Section (4.2)). This difference has a direct influence on the conception of PTS-validity (of arguments, or derivations) (see [5]).
2
The Natural Deduction Proof System
2.1
The NL Core Fragment E0+
We start with the core extensional fragment E0+ of English, with sentences headed by intransitive and transitive verbs, and determiner phrases with a (count) noun2 and a determiner. In addition, there is the copula. This is a typical fragment of many NLs, syntactically focusing on subcategorization, and semantically focusing on predication and quantification. Some typical sentences are listed3 below. (1) every/some girl smiles (2) every/some girl is a student (3) every/some girl loves every/some boy Note the absence of proper names, to be added later in the paper, and negative determiners like no, not included here (hence the superscript ‘+’ in the names of these positive fragments). Expressions such as every girl, some boy are dps (determiner-phrases). Every position that can be filled with a dp is a locus of introduction (of the quantifier corresponding to the determiner of the introduced dp). We note that this fragment contains only two determiners, ‘every’ and ‘some’, each treated in a sui generis way. In [1] we present a general treatment of determiners (and dps) in PTS, providing, for example, proof-theoretic characterization of their monotonicity properties, and capturing proof-theoretically their conservativity, traditionally expressed in model-theoretic terms. Also, a deeper study of negative determiners such as ‘no’, is added to the fragment elsewhere. + The Extended Proof-Language L+ 0 . The proof system N0 is defined over + + a language L0 , extending E0 , schematizing over it, and disambiguating its sentences. We use X, Y, ... to schematize over nouns4 , P, Q to schematize over intransitive verbs, and R to schematize over transitive verbs. In addition, L+ 0 incorporates a countable set P of individual parameters, ranged over by metavariables (in boldface font) like j, k, r. Syntactically, parameters are also regarded as dps. For simplicity, we consider is a as a single lexical unit, isa. Schematic sentences containing occurrences of parameters, referred to as pseudo-sentences, have a role only in derivations within the proof system; they are artifacts of inference, not of assertion. In the sequel, unless otherwise stated, we use ‘sentence’ generically both for sentences and for pseudo-sentences. 2 3 4
Currently, only singular (and not plural) nouns are considered. Throughout, all NL expressions are displayed in the san-serif font. Here nouns are lexical nouns only; later in the paper the language is augmented with compound nouns, also falling under the X, Y, ... schematization.
Proof-Theoretic Semantics for a Natural Language Fragment
59
We use meta-variable S to range over L+ 0 sentences. For any dp-expression D having a quantifier, the notation S[(D)n ] refers to a sentence S having a designated position filled by D, where n is the scope level (sl) of the quantifier in D. In case D has no quantifier (i.e., it is a parameter), sl = 0. The higher the sl, the higher the scope. For example, S[(some X)2 ], has some X in the higher scope, like (every X)1 loves (some Y )2 , representing in the object wide-scope reading of the E0+ sentence every X loves some Y . We use the conventions that within a rule, both S[D1 ], S[D2 ] refer to the same designated position in S, and when the sl can be unambiguously determined it is omitted. We use r(S) to indicate the rank of S, the highest sl on a dp within S. The notation is extended to more than one dp, where, say, S[(D1 )i , (D2 )j ] indicates a sentence S with two designated positions filled, respectively, by D1 , D2 , each with its respective sl. Pseudo-sentences are classified into two groups. – Ground: Ground pseudo-sentences contain5 only parameters in every position that can be filled by a dp. Note that for a ground S, r(S) = 0. – Non-ground: Non-ground pseudo-sentences contain a dp with a determiner in at least one such position (but not in all). The ground pseudo-sentences play the role of atomic sentences, and their meaning is assumed given, externally to the ND proof-system. The latter defines sentential meanings of non-ground pseudo-sentenses (and, in particular, E0+ sentences), relative to the given meanings of ground pseudo-sentences. 2.2
The Natural Deduction Proof-System N0+
The presentation is in Gentzen’s “Logistic”-style ND, with shared contexts, single succedent, set antecedent sequents Γ S, formed over contexts of L+ 0 sentences. We enclose discharged assumptions in (indexed) square brackets, using the index to mark the rule-application responsible for the discharge. There are introduction-rules (I-rules) and elimination-rules (E-rules) for each determiner forming a dp, the latter marked for its scope level. The usual notion of (treeshaped) derivation is assumed. We use D for derivations, where DΓ S is a derivation of sentence S∈L+ 0 from context Γ . We use Γ, S for the context extending Γ with sentence S. F (Γ ; j) means j is fresh for Γ . In the rule names, we abbreviate ‘every’ and ‘some’ to ‘e’ and ‘s’, respectively. The meta-rules for N0+ are presented in Figure 1. A word of explanation about the I-rules is due. The scopelevel r(S[j]) is the highest scope of a quantifier already present in S[j]. When a new dp is introduced into the position currently filled by j, it obtains the scope level r(S[j]) + 1, thereby its quantifier becoming the one of the highest scope in the resulting sentence. as for the E-rules, they always eliminate the quantifier of the highest scope. Note that the E-rules are of a format known As generalized elimination, relying on drawing arbitrary consequences from the major premiss. This issue is elaborated upon in Section 4.1, and in a more general setting in [6]. A convenient derived E-rule, shortening derivations is: 5
Note that this use of ‘ground’ is different from the one in logic programming, where it is used for a term without any (free) variables.
60
N. Francez and R. Dyckhoff
Γ, SS Γ, [j isa X]i S[j] (eI i ) Γ S[(every X)r(S[j])+1] Γ S[(every X)r(S[j])+1]
(Ax) Γ j isa X Γ S[j] (sI) Γ S[(some X)r(S[j])+1 ]
Γ j isa X
Γ, [S[j]]i S
Γ S Γ S[(some X)r(S[j])+1 ] Γ, [j isa X]j , [S[j]]i S Γ S
(eE i )
(sE i,j )
where F(Γ, S[every X]; j) in (eI), and F(Γ, S[some X], S ; j) for (sE).
Fig. 1. The meta-rules for N0+
Γ S[(every X)r(S[j])+1 ] Γ j isa X Γ S[j]
ˆ (eE)
(its derivability is shown in the full
paper). Lemma (weakening6 ): If Γ S, then Γ, Γ S. The need for a contraction-rule and the justification of the freshness condition are in the full paper. Below is an example derivation establishing some U isa X, (every X)2 R (some Y )1 , every Y isa Z (some U )1 R (some Z)2
D1 D2 Let (some U )2 R (some Z)1 and (some U )1 R (some Y )2 be the following two sub-derivations. (every X)2 R (some Y )1 [r isa X]2 ˆ (eE) [r isa U ]1 r R some Y (sI) (some U )2 R (some Y )1 some U isa X 1,2 (sE ) D1 : (some U )2 R (some Y )1 every Y isa Z [j isa Y ]4 ˆ (eE) j isa Z [some U R j]3 (sI) D2 : (some U )1 R (some Z)2 The whole derivation combines the two sub-derivations by D1 D2 (some U )2 R (some Y )1 (some U )1 R (some Z)2 (sE 3,4 ) (some U )1 R (some Z)2 6
Weakening is not really needed, and is introduced here for technical reason, having an easier proof of the termination of the proof-search in the Sequent-Calculus (see below). Ultimately, there might be a need to remove it.
Proof-Theoretic Semantics for a Natural Language Fragment
61
In PTS in logic, there is a notion of a canonical proof, namely a proof the last step in which is an application of an I-rule. In system where proof-normalization obtains, every proof, i.e., closed derivation (with no open assumptions) can be reduced to a canonical one. Here, in our NL-PTS, we are mainly interested in open derivations, having open assumptions. We extend the notion of canonicity to open derivations, and take (see Section 3) them to contribute to sentential meanings. However, as shown in the full paper, not every open derivation can c be reduced to a canonical one. We use c for canonical derivability, and DΓ S c for a canonical derivation of S from (open) assumptions Γ . Furthermore, [[S]]Γ denotes the collection of all (if any) canonical derivations of S from Γ .
3
The Sentential Proof-Theoretic Meaning
In the discussions of PTS in logic, it is usually stated that ‘the ND-rules determine the meanings (of the connectives/quantifiers)’. However, there is no explicit denotational meaning7 defined (proof-theoretic, not model-theoretic, denotation). In other words, there is no explicit definition of the result of this determination. Thus, one cannot express claims of the form ‘the meaning of S has this or that property’, or generalizations about all meanings, involving quantification over meanings. In particular, if one wants to apply Frege’s context principle to those PTS-meanings, and derive meanings for subsentential phrases (including lexical words) as contributions to sentential meanings, such an explication is needed (see [4] and [7]). We take here the PTS-meaning of a E0+ sentence S, and also of an L+ 0 nonground pseudo-sentence S, to be the function from contexts Γ returning the collection of all the canonical derivations in N0+ of S from Γ . For a ground + L+ 0 pseudo-sentence S, its meaning is assumed given, and the meaning of E0 + sentences, as well as non-ground L0 pseudo-sentences is defined relative to the given meanings of ground sentences. In accordance with many views in philosophy of language, every derivation in the meaning of a sentence S can be viewed as providing G[[S]], grounds of asserting S (recall that ground pseudo-sentences are not making any assertion). Semantic equivalence of sentences is based on equality of meaning (and not interderivability). In addition, a weaker semantic equivalence is based on equality of grounds of assertion. Definition (PTS-meaning, grounds): For a sentence S, or a non-ground PTS c pseudo-sentence S, in L+ =df. λΓ.[[S]]Γ G[[S]] =df. {Γ | Γ c S}, where: 0 : [[S]]L+ 0 – For S a sentence in E0+ , Γ consists of E0+ -sentences only. Parameters are not “observable” in grounds for assertion. – For S a pseudo-sentence in L+ 0 , Γ may also contain pseudo-sentences. Recall that the meanings of ground pseudo-sentences are given and meaning for E0+ is defined relative to them. 7
Known also as the semantic value.
62
N. Francez and R. Dyckhoff
The main formal property of meanings (under this definition) is the overcoming (at least for the fragment considered) of the manifestation argument against MTS: asserting a sentence S is based on (algorithmically) decidable grounds (see 4.3). A speaker in possession of Γ can decide whether Γ c S. Clearly, further research is needed to determine the limits (in terms of the size of the fragment captured) of this effective approach to meaning. 3.1
Interlude: Semantic Ambiguity
Consider one of the well-known features of E0+ : quantifier scope ambiguity. The following E0+ sentence is usually attributed two readings, with the following FOL expressions of their respective truth-conditions in model-theoretic semantics. (4) Every girl loves some boy Subject wide-scope (sws): ∀x.girl(x)→∃y.boy(y)∧love(x, y) Subject narrow-scope (sns): ∃y.boy(y)∧∀x.girl(x)→love(x, y) In our PTS, the difference in meanings reflects itself by having different derivations, differing in order of introduction of the subject and object dps. Subject wide-scope (sws): [r isa girl]i D2 D1 r loves j j isa boy (sI) r loves (some boy)1 (eI i ) (every girl)2 loves (some boy)1 Subject narrow-scope (sns): [r isa girl]i D1 r loves j D2 (eI i ) (every girl)1 loves j j isa boy (sI) (every girl)1 loves (some boy)2 Note that there is no way to introduce a dp with a narrow-scope where the dp with the wider-scope has already been introduced. The central pre-theoretic relationship between the two readings is the entailment8 present here in the form (every girl)1 loves (some boy)2 (every girl)2 loves (some boy)1 as shown by the following derivation. 8
A more general treatment of truth and entailment among sentences is deferred to [1], where truth under Γ is captured as non-emptiness of the grounds for assertion (for any given Γ ).
Proof-Theoretic Semantics for a Natural Language Fragment
[j isa X]1
63
[(every X)1 R r]2 ˆ (eE) jRr [r isa Y ]3 (sI) j R (some Y )1 (every X)1 R (some Y )2 (sE 2,3 ) j R (some Y )1 1 (eI ) (every X)2 R (some Y )1
Of course, in the inverse entailment does not hold.
4 4.1
Properties of N0+ Harmony
The origin of P T S for logic is already in the work of Gentzen [8], who invented the natural deduction proof-system for F OL. He hinted there, that I-rules could be seen as the definition of the logical constant, while the E-rules are nothing more than consequences of this definition. This was later refined into the Inversion Principle by Prawitz ([16]), which shows how the I-rules determine the E-rules. The I-rules were taken as a determination of the meaning of the logical constant under consideration, instead of the model-theoretic interpretation, that appeals to truth in a model. However, in view of Prior’s [17] attack, by presenting a connective ‘tonk’, whose I-rule was that of a disjunction, while its E-rule was that of conjunction, trivializing the whole deductive theory by rendering every two propositions inter-derivable, it became apparent that not every combination of ND-rules can serve as a basis for P T S. The notion of harmony of the ND-rules [3], taken in a broad sense to express a certain balance between E-rules and I-rules (absent from the tonk rules) became a serious contender for an appropriateness condition for ND-rules to serve as a basis for a P T S. See [18,20] for a critical discussion of tonk’s disharmony. We consider two harmony notions, and show that N0+ satisfies both. – General-Elimination (GE) harmony: In order to be harmonious, an Erule has to have some specific form, depending on the corresponding I-rules. This form is known as generalized E-rules, and was considered by [15] as having a better relationship to cut-free sequent-calculus derivations. Such an E-rule allows drawing an arbitrary conclusion, provided it is derivable from the premisses of the corresponding I-rule(s). This form guarantees that the inversion-principle obtains, and leads to the availability of proof-reduction, the elimination of a detour caused by an introduction immediately followed by an elimination. This underlies proof normalization, and also constitutes a requirement of intrinsic harmony (see below). Proof-normalization (in its strong version) requires that there is no possibility of an infinite sequence of such reductions (see [21] for a general discussion of the role of normalization in PTS). In [6], we show that a rule-form generalizing a proposal by Read [18], guarantees the availability of the required reduction. All the E-rules in N0+ are of this generalized-elimination form, hence N0+ is GE-harmonious.
64
N. Francez and R. Dyckhoff
Local Intrinsic harmony: Here, in order to be harmonious, no constraints on the form of the E-rules is imposed, but they have to stand in a certain relationship to the I-rules, to directly reflect the required balance among them. We consider here a specific proposal by [14], based on two properties known as local soundness and local completeness. – Local Soundness: Every introduction followed directly by an elimination can be reduced. This shows that the elimination-rules are not too strong w.r.t. the I-rules. – Local Completeness: There is a way to eliminate9 and to reintroduce, recovering the original formula. This process is called expansion. This shows that the E-rules are not too weak w.r.t. the I-rules. In the case of logic, introduction and elimination are of a top-level operator. Here, they refer to the introduction of a dp into every allowable position (and any scope level), and elimination from the same position. We show local intrinsic harmony (in the above sense) for (eI) ((sI) is similar), even though [6] shows this follows from the form of the rules. We do, however, omit showing the reductions/expansions for the extensions of the fragment presented below. Local soundness [j isa X]i D1 S[j] (eI i ) S[(every X)]
D2 k isa X
S
[S[k]]j D3 S (eE j )
D2 D1 [k isa X/j isa X, k/j] S[k] D3 ;r S
D2 Here D1 [k isa X/j isa X, k/j] denotes a derivation in which every instance D2 of use of the assumption j isa X is replaced by the derivation k isa X of its variant k isa X. Since j is fresh for the assumptions on which D1 depends, the replacement of j by k is permissible. Local completeness D S[(every X)] D S[(every X)] ;e
[j isa X]1 [S[j]]2 (eE 2 ) S[j] 1 (eI ) S[(every X)]
There are also other views of harmony, e.g., based on a conservative extension of the theory of the introduced operator [3]. 9
‘Eliminate’ here means applying an E-rule, not necessarily actually eliminating the operator occurrence at hand.
Proof-Theoretic Semantics for a Natural Language Fragment
4.2
65
Closed Derivations
A derivation of Γ S is closed iff Γ = ∅. In logic, closed derivations are a central topic, determining the (formal) theorems of the logic. In particular, for bivalent logics, they induce the (syntactic) notions tautology and contradiction. In L+ 0 , in the absence of negation and negative determiners (like no), there is no natural notion of a contradiction. Furthermore, the only “positive” closed derivation in N0+ is for the sentences of the form every X isa X. The closed derivation is shown [j isa X]1 j isa X (eI 1 ) by: every X isa X In particular, note that some X isa X does not hold. We refer to [5] for a discussion of the influence of the (almost) absent closed derivations on the notion of proof-theoretic validity, the PTS counterpart of the model-theoretic view of validity as truth-preservation. 4.3
Decidability of N0+ Derivability
We now attend to the decidability of derivability in N0+ . It makes P T S-based + + meaning effective for L+ 0 . Figure 2 displays a sequent-calculus SC0 for L0 , + easily shown equivalent to the N0 (in having the same provable sequents). The rules are arranged in the usual way of L-rules (introduction in the antecedent) and R-rules (introduction in the succedent). The following claims are routinely
Γ, SS Γ, j isa X, S[(every X)r(S[j])+1], S[j]S Γ, j isa X, S[(every X)r(S[j])+1 ]S Γ, j isa X, S[j]S Γ, S[(some X)r(S[j])+1 ]S
(Ls)
(ID)
(Le)
Γ, j isa XS[j] (Re) Γ S[(every X)r(S[j])+1 ]
Γ j isa X Γ S[j] (Rs) Γ S[(some X)r(S[j])+1 ]
where j is fresh in Re and Ls.
Fig. 2. A sequent-calculus SC0+ for L+ 0
established for SC0+ . – The structural rules of weakening (W ) and contraction (C) are admissible. – (Cut) is admissible. The full paper shows the existence of a terminating proof-search procedure.
66
N. Francez and R. Dyckhoff
5
Extending the Fragment
Next, we consider some simple extensions of E0+ (and the induced extension 10 of L+ proper names, and the other two are related to 0 ). The first one adds extending the notion of noun. In E0+ , we had only primitive nouns. We now consider two forms of compound noun: one formed by adding adjectives and the other by adding relative clauses. In both cases, in the corresponding extensions of N0+ , we let X, Y schematize over compound nouns also in the original rules. 5.1
Adding Proper Names
We extend E0+ with proper names in dp positions. Typical sentences are: (5) Rachel is a girl (6) Rachel smiles (7) Rachel loves every/some boy (8) every boy loves Rachel Proper names are strictly distinct from parameters in the way they function in the proof-system, as explained below. We retain the name E0+ for this (minor) extension. In L+ 0 , let proper names be schematized by N , and add pseudo-sentences of the forms (9) j is N, N is j (10) j is k, N is M Note that pseudo-sentences having a proper name in any dp-position are not ground! First, we add I-rules and E-rules for is (a disguised identity). We adopt a version of the rules in [19]. Γ, [S[j]]1 S[k] (isI 1 ) Γ j is k
Γ j is k Γ S[j] Γ, [S[k]]1 S (isE 1 ) S
where S does not occur in Γ . From these, we can derive rules for reflexivity (is − ref l), symmetry (is − sym) and transitivity (is − tr). For shortening the presentation of derivations, combinations of these rules are still referred to as applications of (isE). Next, we incorporate I-rules and E-rules of proper names into dp-positions. Γ j is N Γ S[j] (nI) Γ S[N ]
Γ S[N ] Γ, [j is N ]1 , [S[j]]2 S (nE 1,2 ) Γ S ,
j fresh for Γ, S
Below are two example derivations. Rachel isa girl, every girl smilesRachel smiles: Note that Rachel is not a paramˆ is not directly applicable. eter, and (eE) 10
This is different from the role of names in [13]; his names are our parameters. He has no proper names provided by the NL fragment itself.
Proof-Theoretic Semantics for a Natural Language Fragment
67
[r isa girl]2 every girl smiles ˆ (eE) [r is Rachel]1 r smiles (nI) Rachel isa girl Rachel smiles (nE 1,2 ) Rachel smiles Rachel isa girl, Rachel smilessome girl smiles: Again, since Rachel is not a parameter, (sI) is not directly applicable. [r1 is Rachel]1
[r2 is Rachel]3
r1 is r2
(isE)
r2 isa girl Rachel smiles Rachel isa girl
[r1 isa girl]2
(isE)
some girl smiles some girl smiles
some girl smiles
[r2 smiles]4
(sI)
(nE 3,4 )
(nE 1,2 )
The corresponding extension to the sequent calculus SC0+ consists of the following rules. Γ, j is N, S[j] S Γ j is N Γ S[j] (Ln) (Rn) Γ, S[N ] S Γ S[N ] 5.2
Adding Adjectives
We augment E0+ with sentences containing adjectives, schematized by A. We consider here only what is known in MTS as intersective adjectives. Typical sentences are: (11) Rachel is a beautiful girl/clever beautiful girl/clever beautiful red-headed girl (12) Rachel/every girl/some/girl is beautiful (13) Rachel/every beautiful girl/some beautiful girl smiles (14) Rachel/every beautiful girl/some beautiful girl loves Jacob/every clever boy/some clever boy A noun preceded by an adjective is again a (compound) noun. Denote this ex+ tension by E0,adj . Recall that, in the N0+ rules, the noun schematization should + be taken over compound nouns too. Note that E0,adj is no longer finite, as an unbounded number of adjectives may precede a noun. We augment N0+ with the following ND-rules for adjectives. Γ j isa A X Γ j isa X Γ j is A (adjI) Γ j isa A X
Γ, [j isa X]1 , [j is A]2 S (adjE 1,2 ) Γ S
+ Let the resulting system be N0,adj . Again, we can obtain the following derived elimination rules, used to shorten presentations of example derivations.
68
N. Francez and R. Dyckhoff
Γ j isa A X (adj Eˆ1 ) Γ j isa X
Γ j isa A X (adj Eˆ2 ) Γ j is A
Note that the intersectivity here is manifested by the rules themselves (embodying an “invisible” conjunctive operator), at the sentential level. These rules induce intersectivity as a lexical property of (some) adjectives by the way lexical meanings are extracted from sentential meanings, as shown in [7]. The following sequent, the corresponding entailment of which is often taken as the definition + of intersective adjectives, is derivable in N0,adj : j isa A X, j isa Y j isa A Y as shown by j isa A X ˆ2 ) (adj E j isa Y j is A (adjI) j isa A Y As an example of derivations using the rules for adjectives, consider the following derivation for j loves every girl j loves every beautiful girl In MTS terminology, the corresponding entailment is a witness to the downward monotonicity of the meaning of every in its second argument. We use an obvious schematization. [r isa A Y ]1 ˆ (adj E) j R every Y r isa Y ˆ (eE) jRr (eI 1 ) j R every A Y A proof-theoretic reconstruction of monotonicity is presented in [1]. Under this definition of the meaning of intersective adjectives, such adjectives are also extensional, in the sense of satisfying the following entailment: every X isa Y every A X isa A Y , as shown by the following derivation: [j isa A X]1 ˆ1 ) (adj E [j isa A X]1 every X isa Y j isa X ˆ (eE) (adj Eˆ2 ) j isa Y j is A (adjI) j isa A Y (eI 1 ) every A X isa A Y Decidability of derivability remains intact, by adding to SC0+ the following two + for L+ rules, obtaining thereby a sequent-calculus SC0,adj 0,adj . Γ, j is A, j isa XS Γ j is A Γ j isa X (Ladj) (Radj) Γ, j isa A XS Γ j isa A X
Proof-Theoretic Semantics for a Natural Language Fragment
5.3
69
Adding Relative Clauses
We next add to the fragment relative clauses (rcs). This fragment transcends the locality of subcategorization in E0+ , in having long-distance dependencies. We refer to this (still positive) fragment as E1+ . Typical sentences include the following. (15) Jacob/every boy/some boy loves every/some girl who(m) smiles/loves every flower/Rachel loves (16) Rachel/every girl/some girl is a girl who loves Jacob/every boy (17) Jacob loves every girl who loves every boy who smiles (nested relative clause) So, girl who smiles and girl who loves every boy are compound nouns. We treat somewhat loosely to the case of the relative pronoun, in the form of who(m), abbreviating either who or whom, as the case requires. Note that E1+ , by its nesting of rcs, expands the stock of available positions for dp-introduction/elimination. Thus, in (17), ‘every boy who smiles’ is the object of the relative clause modifying the object of the matrix clause. In addition, new scope relationships arise among the multitude of dps present in E1+ sentences. Island conditions, preventing some of the scopal relationships, are ignored here. The corresponding ND-system N1+ extends N0+ by adding the following I-rules and E-rules. For their formulation, we extend the distinguished position notation with S[−], indicating that the position is unfilled. For example, loves every girl and every girl loves have their subject and object dp positions, respectively, unfilled. Γ j isa X Γ S[j] (relI) Γ j isa X who S[−]
Γ, [j isa X]1 , [S[j]]2 S
Γ j isa X who S[−]
Γ S
(relE 1,2 )
,
j fresh
The simplified elimination-rules are: Γ j isa X who S[−] ˆ 1 (relE) Γ j isa X
Γ j isa X who S[−] ˆ 2 (relE) Γ S[j]
As an example of a derivation in this fragment, consider some girl who smiles sings N + some girl sings 1
exhibiting the model-theoretical upward monotonicity of some in its first argument. [r isa X who P1 ]1 ˆ 1 (relE) r isa X [r P2 ]2 (sI) some X P2 some X who P1 P2 (sE 1,2 ) some X P2
70
N. Francez and R. Dyckhoff
Similarly, the following witness of the downward monotonicity of ‘every’ (in its first argument) can be derived. every girl singsN + every girl who smiles sings 1
[j isa girl who smiles]1 (relEˆ1 ) every girl sings j isa girl ˆ (eE) j sings (eI 1 ) every girl who smiles sings Once again, decidability of derivability is shown by means of the following additional sequent-calculus rules, added to SC0+ , to form SC1+ . Γ, j isa X, S[j]S Γ j isa X Γ S[j] (Rrel) (Lrel) Γ, j isa X who S[−] S Γ j isa X who S[−]
6
Conclusions
The assignment of proof-theoretical meanings to NL-sentences and to subsentential phrases is, to the best of our knowledge, completely new. There is a vast literature on the use of proof-theory in deriving meanings; however, the derived meanings are all model-theoretic. Besides the traditional meaning derivation in TLG, relying on the Curry-Howard correspondence, there is also a similar approach in LFG called ‘glue’, using linear logic for the derivations. There also approaches like [9], that read of meanings from proof-nets instead of derivations. In all these approaches, the common theme is that some proof-theoretic object is used for deriving meanings, and does not constitute the meaning. The latter is usually formulated in some (extension of a) λ-calculus, the terms of which are interpreted model-theoretically in Henkin models. There is a body of work also going under the title of PTS for sentential meanings, based on constructive type-theory (MLTT), which is clearly related, but, we believe, different than our approach to PTS. The differences are discussed in the full paper.
Acknowledgements The work was supported by EPSRC grant EP/D064015/1, and grant 2006938 by the Israeli Academy for Sciences (ISF). We thank the following colleagues and students for various illuminating discussions, and for critical remarks on preliminary drafts: Gilad Ben-Avi, Iddo Ben-Zvi, Ole Hjortland, James Mckinna, Larry Moss, Dag Prawitz, Stephen Read.
Proof-Theoretic Semantics for a Natural Language Fragment
71
References 1. Ben-Avi, G., Francez, N.: A proof-theoretic reconstruction of generalized quantifiers (2009) (submitted for publication) 2. Brandom, R.B.: Articulating Reasons. Harvard University Press, Cambridge (2000) 3. Dummett, M.: The Logical Basis of Metaphysics. Harvard University Press, Cambridge (1991) 4. Francez, N., Ben-Avi, G.: Proof-theoretic semantic values for logical operator. Synthese (2009) (under refereeing) 5. Francez, N., Dyckhoff, R.: A note on proof-theoretic validity (2007) (in preparation) 6. Francez, N., Dyckhoff, R.: A note on harmony. Journal of Philosophical Logic (2007) (submitted) 7. Francez, N., Dyckhoff, R., Ben-Avi, G.: Proof-theoretic semantics for subsentential phrases. Studia Logica 94, 381–401 (2010), doi:10.1007/s11225-010-9241-y 8. Gentzen, G.: Investigations into logical deduction. In: Szabo, M. (ed.) The collected papers of Gerhard Gentzen, pp. 68–131. North-Holland, Amsterdam (1935) (english translation of the 1935 paper in German) 9. de Groote, P., Retore, C.: On the semantic readings of proof-nets. In: Kruijf, G.J., Oehrle, D. (eds.) Formal Grammar, pp. 57–70. FOLLI (1996) 10. Kremer, M.: Read on identity and harmony – a friendly correction and simplification. Analysis 67(2), 157–159 (2007) 11. Montague, R.: The proper treatment of quantification in ordinary english. In: Hintikka, J., Moravcsik, J., Suppes, P. (eds.) Approaches to natural language, Reidl, Dordrecht (1973); proceedings of the 1970 Stanford workshop on grammar and semantics 12. Moortgat, M.: Categorial type logics. In: van Benthem, J., ter Meulen, A. (eds.) Handbook of Logic and Language, pp. 93–178. North-Holland, Amsterdam (1997) 13. Moss, L.: Syllogistic logics with verbs. Journal of Logic and Information (to appear 2010) 14. Pfenning, F., Davies, R.: A judgmental reconstruction of modal logic. Mathematical Structures in Computer Science 11, 511–540 (2001) 15. Plato, J.V.: Natural deduction with general elimination rules. Archive Mathematical Logic 40, 541–567 (2001) 16. Prawitz, D.: Natural Deduction: Proof-Theoretical Study. Almqvist and Wicksell, Stockholm (1965) 17. Prior, A.N.: The roundabout inference-ticket. Analysis 21, 38–39 (1960) 18. Read, S.: Harmony and autonomy in classical logic. Journal of Philosophical Logic 29, 123–154 (2000) 19. Read, S.: Identity and harmony. Analysis 64(2), 113–119 (2004); see correction in [10] 20. Read, S.: Harmony and modality. In: D´egremont, C., Kieff, L., R¨ uckert, H. (eds.) Dialogues, Logics and Other Strong Things: Essays in Honour of Shahid Rahman, pp. 285–303. College Publications (2008) 21. Restall, G.: Proof theory and meaning: on the context of deducibility. In: Proceedings of Logica 2007, Hejnice, Czech Republic (2007) 22. Schroeder-Heister, P.: Validity concepts in proof-theoretic semantics. In: Kale, R., Schroeder-Heister, P. (eds.) Proof-Theoretic Semantics, vol. 148, pp. 525–571 (February 2006), special issue of Synthese
Some Interdefinability Results for Syntactic Constraint Classes Thomas Graf Department of Linguistics University of California, Los Angeles
[email protected] http://tgraf.bol.ucla.edu
Abstract. Choosing as my vantage point the linguistically motivated M¨ uller-Sternefeld hierarchy [23], which classifies constraints according to their locality properties, I investigate the interplay of various syntactic constraint classes on a formal level. For non-comparative constraints, I use Rogers’s framework of multi-dimensional trees [31] to state M¨ uller and Sternefeld’s definitions in general yet rigorous terms that are compatible with a wide range of syntactic theories, and I formulate conditions under which distinct non-comparative constraints are equivalent. Comparative constraints, on the other hand, are shown to be best understood in terms of optimality systems [5]. From this I derive that some of them are reducible to non-comparative constraints. The results jointly vindicate a broadly construed version of the M¨ uller-Sternefeld hierarchy, yet they also support a refined picture of constraint interaction that has profound repercussions for both the study of locality phenomena in natural language and how the complexity of linguistic proposals is to be assessed. Keywords: Syntactic constraints, Transderivationality, Economy conditions, Model theoretic syntax, Multi-dimensional trees, Optimality systems.
1
Introduction
Constraints are arguably one of the most prominent tools in modern syntactic analysis. Although the dominance of derivational approaches in the linguistic mainstream since the inception of Chomsky’s Minimalist Program [2,3] might suggest otherwise, generative frameworks still feature a dazzling diversity of principles and well-formedness conditions. The array of commonly assumed constraints ranges from the well-established Shortest Move Constraint to the fiercely debated principles of binding theory, but we also find slightly more esoteric proposals such as Rule I [26], MaxElide [35], GPSG’s Exhaustive Constant Partial Ordering Axiom [6] or the almost forgotten Avoid Pronoun Principle of classic GB. A closer examination of these constraints shows that they differ significantly in the structures they operate on and how they succeed at restricting the set of expressions. A natural question to ask, then, is if we can identify commonalities C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 72–87, 2010. c Springer-Verlag Berlin Heidelberg 2010
Interdefinability of Constraints
73
between different constraints, and what the formal and linguistic content of these commonalities might be. The M¨ uller-Sternefeld (MS) hierarchy [23,21] is — to my knowledge — the only articulate attempt at a classification of linguistic constraints so far. Basing their analysis on linguistic reasoning grounded in locality considerations, M¨ uller and Sternefeld distinguish several kinds of constraints, which in turn can be grouped into two bigger classes. The first one is the class of non-comparative constraints (NCCs): representational constraints are well-formedness conditions on standard trees (e.g. ECP, government), derivational constraints restrict the shape of trees that are adjacent in a derivation (e.g. Shortest Move), and global constraints apply to derivationally non-adjacent trees (e.g. Projection Principle). The second class is instantiated by comparative constraints (CCs), which operate on sets of structures. Given a set of structures, a CC returns the best member(s) of this set, which is usually called the optimal candidate. Crucially, the optimal candidate does not have to be well-formed — it just has to be better than the competing candidates. M¨ uller slightly revises this picture in [21] and further distinguishes CC according to the type of structures they operate on. If the structures in question are trees, the constraint is called translocal (e.g. Avoid Pronoun Principle); if they are derivations, it is called transderivational (e.g. Fewest Steps, MaxElide, Rule I). Finally, it is also maintained in [21] that these five subclasses can be partially ordered by their expressivity: representational = derivational < global < translocal < transderivational. A parametric depiction of the constraint classification and the expressivity hierarchy, which jointly make up the MS-hierarchy, is given in Fig. 1. The MS-hierarchy has a strong intuitive appeal, at least insofar as derivations, long-distance restrictions and operations on sets seem more complex than representations, strictly local restrictions and operations on trees, respectively. However, counterexamples are readily at hand. For instance, it is a simple coding Constraints
Comparative
Non-comparative
Derivations
Representations Representational
Representations
Derivations
Translocal
Transderivational
Level 3
Level 4
Adjacent nodes Arbitrary nodes Derivational
Level 1
Global Level 2
Fig. 1. The M¨ uller-Sternefeld hierarchy of constraints
74
T. Graf
exercise to implement any transderivational constraint as a global constraint by concatenating the distinct derivations into one big derivation, provided there are no substantial restrictions on how we may enrich our grammar formalism. As another example, it was shown in [16] that Minimalist Grammars with the Specifier Island Constraint (SPIC) but without the Shortest Move Constraint can generate any type-0 language. But the SPIC is a very simple derivational constraint, so it unequivocally belongs to the weakest class in the hierarchy, which is at odds with its unexpected effects on expressivity. Therefore, the MS-hierarchy makes the wrong predictions in its current form, or rather, it makes no predictions at all, because its notion of complexity and its assumptions concerning the power of the syntactic framework are left unspecified. In this paper, I show how a model theoretically informed perspective does away with these shortcomings and enables us to refine the MS-hierarchy such that the relations between constraint classes can be studied in a rigorous yet linguistically insightful way. In particular, I adopt Rogers’s multi-dimensional trees framework [31] as a restricted metatheory of linguistic proposals in order to ensure that the results hold for a wide range of syntactic theories. We proceed as follows: After a brief discussion of technical preliminaries I move on to the definition of classes of NCCs in Sect. 3 and study their behavior and interrelationship in arbitrary multi-dimensional tree grammars. I show that a proper subclass of the global constraints can be reduced to local constraints. In Sect. 4, I then turn to a discussion of CCs, why they require the model theoretic approach to be supplemented by optimality systems [5], and which CCs can be reduced to NCCs.
2
Preliminaries
Most of my results I couch in terms of the multi-dimensional tree (MDT) framework developed by Rogers [31,32]. The main appeal of MDTs for this endeavor is that they make it possible to abstract away from theory-specific idiosyncrasies. This allows for general characterizations of constraint classes and their reducibility that hold for a diverse range of linguistic theories. MDT renditions of GB [29], GPSG [27] and TAG [30] have already been developed; the translation procedure from HPSG to TAG defined in [14] should allow us to reign in (a fragment of) the former as well. Further, recent results suggest that an approximation of Minimalist Grammars [33] is feasible, too: for every Minimalist Grammar we can construct a strongly equivalent k-MCFG [20], and for each k ≥ 2, the class of 2k−1 -MCFLs properly includes the class of level-k control languages [12], which in turn are equivalent to the string yield of the set of (k + 1)-dimensional trees [31]. While initially intimidating due to cumbersome notation, MDTs are fairly easy to grasp at an intuitive level. Looking at familiar cases first, we note that a string can be understood as a unary branching tree, a set of nodes ordered by the precedence relation. But as there is only one axis along which its nodes are ordered, it is reasonable to call a string a one-dimensional tree, rather than a
Interdefinability of Constraints
Z A B C D E
Z A
= = = = = =
0 1 0 , 0 , 0
F G H I J K
= = = = = =
0 , 1 0 , 1, 0 0 , 1, 1 1 , 1 , 0 1 , 1
C
= = = = =
1 , 1, 0 1 , 1 , 1 , 1 , 0 1 , 1 , 1 1 , 1 , 2
A
I
B
L M N O P
75
D
D
I
K F
M
J
E
E
F G
G
H L
N
O
P
H
J
M
N
O
P
L
Fig. 2. A T 3 (with O a foot node), its node addresses and its 2-dimensional yield
unary branching one. In a standard tree, on the other hand, the set of nodes is ordered by two relations, usually called dominance and precedence. Suppose s is the mother of two nodes t and u in some standard tree, and also assume that t precedes u. Then we might say that s dominates the string tu. Given our new perspective on strings as one-dimensional trees, this suggests to construe standard trees as relating nodes to one-dimensional trees by immediate dominance. Thus it makes only sense to refer to them as two-dimensional objects. But from here it is only a small step to the concept of MDTs. A three-dimensional tree (see Fig. 2 for an example) relates nodes to two-dimensional, i.e. standard trees (for readers familiar with TAG, it might be helpful to know that three-dimensional trees correspond to TAG derivations). A four-dimensional tree relates nodes to three-dimensional trees, and so on. In general, a d-dimensional tree is a set of nodes ordered by d dominance relations such that the nth dominance relation relates nodes to (n − 1)-dimensional trees (for d = 1, assume that single nodes are zero-dimensional trees). To make this precise, we define d-dimensional trees as generalizations of Gorn tree domains. First, let a higher-order sequence be defined inductively as follows: – 0 1 := {1} – n+1 1 is the smallest set containing and if both x1 , . . . , xl ∈ y ∈ n 1 then x1 , . . . , xl , y ∈ n+1 1.
n+1
1 and
Concatenation of sequences is denoted by · and defined only for sequences of the same order. A 0-dimensional tree is either ∅ or {1}. For d ≥ 1, a d-dimensional tree T d is a set of dth -order sequences satisfying – T d ⊆ d 1, and
76
T. Graf
– ∀s, t ∈ d 1[s · t ∈ T d → s ∈ T d ], and – ∀s ∈ d 1[ w ∈ (d−1) 1 | s · w ∈ T d is a (d − 1)-dimensional tree]. The reader might want to take a look at Fig. 2 again for a better understanding of the correspondence between sequences and tree nodes (first-order sequences are represented by numerals to improve readability; e.g. 0 = and 2 = 1, 1). Several important notions are straightforwardly defined in terms of higherorder sequences. The leaves of T d are the nodes at addresses that are not properly extended by any other address in T d. The depth of T d is the length of its longest top level sequence, which in more intuitive terms corresponds to the length of the longest path of successors at dimension d from the root to a leaf. Given a T d and some node s of T d, the child structure of s in T d is the set t ∈ T d−1 | s · t ∈ T d . For example, the child structure of B in Fig. 2 is the T 2 with its root labeled D. For any T d and 1 ≤ i ≤ d, its branching factor at dimension i is 1 plus the maximum depth of the T i−1 child structures contained by T d . If the branching factor of some T d is at most n for all dimensions 1 ≤ i ≤ d, we call it n-branching and write Tnd. For any non-empty alphabet Σ, TΣd := T, , T a T d and a function from Σ to ℘(T ), is a Σ-labeled d-dimensional tree. In general, we require all trees to be labeled and simply write TΣd . The i-dimensional yield of TΣd is obtained by recursively rewriting all nodes at dimension j > i, starting at dimension d, by their (j − 1)-dimensional child structure. Trees with more than two dimensions have some of their leaves at each dimension i > 2 marked as foot nodes, which are the joints where the (i − 1) child structures are merged together. In forming the 2-dimensional yield of our example tree, K is rewritten by the 2-dimensional tree rooted by M. The daughter of K ends up dominated by O rather than N or P because O is marked as the foot node. For a sufficiently rigorous description of how the i-dimensional yield is computed, see [31, p.281–283] and [32, p.301– 307]. A sequence s1 , . . . , sm of nodes of T d, m ≥ 1, is an i-path iff with respect to the i-dimensional yield of T d , s1 is the root, sm a leaf, and for all sj , sj+1 , 1 ≤ j < m, it holds that sj immediately dominates sj+1 at dimension i. The set of all i-paths of T d is its i-path language. A set of TΣd s is also called a T d language, denoted LdΣ . Unless stated otherwise, the branching factor is assumed to be bounded for every LdΣ , that is to say, there is some n ∈ N such that each T ∈ LdΣ is n-branching. Call T d local iff its depth is 1. In Fig. 2, the T 3 rooted by K and the T 2 rooted by M are local; the T 3 rooted by B is also local, even though its child structure, the T 2 rooted by D, d over an alphabet Σ is a finite language of local TΣd s. is not. A T d grammar GΣ d d Let GΣ (Σ0 ) denote the set of TΣd s licensed by a grammar GΣ relative to a set of d initial symbols Σ0 ⊆ Σ, which is the set of all TΣ s with their root labeled by a symbol drawn from Σ0 and each of their local d-dimensional subtrees contained d d d in GΣ . A language LdΣ is a local set iff it is GΣ (Σ0 ) for some GΣ and some d d Σ0 ⊆ Σ. Intuitively, a local set of T s is a T language where all trees can be built up from local trees. An important fact about local sets is that they are fully characterized by subtree substitution closure.
Interdefinability of Constraints
77
Theorem 1 (Subtree substitution closure). LdΣ is a local set of TΣd s iff for all T, T ∈ LdΣ , all s ∈ T and all t ∈ T , if s and t have the same label, then the result of substituting the subtree rooted by s for the subtree rooted by t is in LdΣ . Proof. An easy lift of the proof in [28] to arbitrary dimension d.
d For our logical approach, we interpret a Tn,Σ as an initial segment of the rela d d := d Tn , i 1≤i≤d , where Tn is the infinite T d in which every tional structure Tn point has a child structure of depth n − 1 in all its dimensions, and where i denotes immediate dominance at dimension i, that is x i y iff y is the immediate successor of x in the ith dimension. x d y iff y = x · s x d−1 y iff x = p · s and y = p · s · w .. .
x 1 y iff x = p · s · · · · w · · · and y = p · s · · · · w · 1 · · · The weak monadic second-order logic for Tdn is denoted by msod and includes — besides the usual connectives, quantifiers and grouping symbols — constants for each i , 1 ≤ i ≤ d, and two countably infinite sets of variables ranging over individuals and finite subsets, respectively. As usual, we write Tdn |= φ[s] to assert that φ is satisfied in Tdn under assignment s. For any T d , all quantifiers are assumed to be implicitly restricted to the initial segment of Tdn corresponding to T d . The set of models of φ is denoted by Mod(φ). This notation extends to sets of formulas in the obvious way. Note that LdΣ is recognizable iff LdΣ = Mod(Φ) for some set Φ of msod formulas. Let me close this section with several minor remarks. The notation A \ B is used to denote set difference. Regular expressions are employed at certain points in the usual way, with the small addition of x≤1 as a stand-in for and x. Finally, I will liberally drop subscripts and superscripts whenever possible.
3 3.1
Non-comparative Constraints Logics for Non-comparative Constraints
A short glimpse at the MS-hierarchy in Fig. 1 reveals that NCCs are distinguished by two parameters: the distance between the nodes they restrict (1 versus unbounded) and the type of structure they operate on (representations versus derivations). As I will show now, this categorization can be sharpened by recasting it in logical terms, thereby opening it up to our mathematical explorations in the following section. The distinction between representations and derivations is merely a terminological confusion in our multi-dimensional setup. A two-dimensional tree, for instance, can be interpreted as both a representational tree structure and a string derivation. This ambiguity is particularly salient for higher dimensions, where there are no linguistic preconceptions concerning the type of structure we are operating on. A better solution, then, is to distinguish NCCs according
78
T. Graf
to the highest dimension they mention in their specification (this will be made precise soon). As for the distance between restricted nodes, it seems to be best captured by the distinction between local and recognizable sets, the latter allowing for unbounded dependencies between nodes while the former are limited to wellformedness conditions that apply within trees of depth 1. As mentioned in Sect. 2, definability in mso is a logical characterization of recognizability, so in conjunction with the MDT framework, this already gives us everything we need to give a theory-neutral definition of global NCCs. For the second, restricted kind of constraints, however, we still need a logical characterization of local sets. Fortunately, this characterization was obtained for two-dimensional trees by Rogers in [28] and can easily be lifted to higher dimensions as follows. For any D ∈ {i , i }i≥1 , let Dφ(x) abbreviate the msok formula ∃y[xDy ∧ φ(y)], where x i y := y i x. We require that Tdn |= Dφ(x)[s] iff Tdn |= ∀x∃y[xDy ∧ φ(y)][s]. Declaring all other uses of quantification to be illicit yields what may be regarded as a normal modal logic. Definition 1 (RLOCk ). rlock ( relaxed lock ) is the smallest set of msok formulas over the boolean operators, individual variables, set variables and all i , i , 1 ≤ i ≤ k. In the next step, we restrict disjunction. Let lock+ be the smallest set of rlock formulas such that – all i and j , i < j ≤ k, are in the scope of exactly one more k than k , and – all k are in the scope of exactly as many k as k . Similarly, let lock− be the smallest set of rlock formulas such that – all i and j , i < j ≤ k, are in the scope of exactly as many k as k , and – all k are in the scope of exactly one more k than k . Definition 2 (LOCk ). The set of lock formulas consists of all and only those formulas that are conjunctions of – disjunctions of formulas in lock+ , and – disjunctions of formulas in lock− . The following lemmata tell us that lock restricts only k and k in a meaningful way. This will also be of use in the next section. Lemma 1 (RLOCk and LOCk+1 ). A formula φ is an rlock formula iff it is a lock+1 formula containing no k+1 and no k+1 . Proof. By induction on the complexity of φ. The crucial condition is the first clause in the definition of lock− . Lemma 2 (Normal forms). Every lock+ formula is equivalent to a disjunction of conjunctions of lock+ formulas of the form (k {i , i }∗1≤i 1, such that the (d − 1)-dimensional yield of Mod(Φ) is not definable in rlocd−1 . Proof. We already know from Lemma 1 that rlocd−1 ≤ locd . Now consider ∗ ∗ the language L := ({a, b, d} (ab∗ c)∗ {a, b, d} )∗ . So every string in L with a c also has an a preceding it, and no b may intervene between the two. It is easy to write an rloc1 formula that requires every node in the string to be labeled with exactly one symbol drawn from {a, b, c, d}. Thus it only remains to sufficiently restrict the distribution of c. If a is at most n steps to the left of c, this can be done by the formula φ := c → 1 a ∨ ( 1 1 a ∧ 1 d) ∨ . . . ∨ ( n1 a ∧ 1 d ∧ . . . ∧ n−1 d), 1 where n1 is a sequence of n many 1 . But by virtue of our formulas being required to be finite, the presence of an a can be enforced only up to n steps to the left of c. So if a is n+1 steps away from c, then φ will be false. Similar problems arise if one starts with a moving to the right. Nor is it possible to use d as an intermediary, as in, say, the formula ψ := d → ( 1 a∨ 1 d)∧(¬ 1 1 (a∨¬a) → 1 a), which forces every sequence of ds to be ultimately preceded by an a. The second conjunct is essential, since it rules out strings of the form d∗ c. But ψ is too strong, because it is not satisfied by any L-strings containing the substrings bd or cd. Note that we cannot limit ψ to ds preceding c, again due to the finite length of rloc1 formulas, which is also the reason why we cannot write an implicational formula with b as its antecedent that will block bs from occurring between a and c. This 1
Note that Φ can remain unchanged since the logical perspective allows for a node to be assigned multiple labels l1 , . . . , ln instead of the sequence l1 , . . . , ln (which is the standard procedure in automata theory).
Interdefinability of Constraints
81
exhausts all possibilities, establishing that rloc1 fails to define L because it is in general incapable of restricting sequences of unbounded size.2 That L is definable in loc2 is witnessed by the following grammar of local 2dimensional trees (with Σ0 := {b}), which derives L without the use of additional features: b
b b
b
b
b
a
d
a
b a
c
a
d d
d
d
This case can be lifted to any dimension k by regarding L as the k-path language of some T d , 1 ≤ k ≤ d. Lemma 4 (MSOd ≮ LOCd+1 ). There is a set Φ of msod formulas, d ≥ 1, such that there is no locd+1 definable set Ld+1 whose d-dimensional yield is identical to Mod(Φ). Proof. Consider the language L := (aa)∗ . Clearly, this language is definable in mso1 but not in first-order logic over strings, since it involves modulo counting. Hence it cannot be defined in loc1 either. We now show that loc2 is also too weak to define L. As Σ := {a}, the grammar for the tree language with L as its string yield can only consist of trees of depth 1 with all nodes labeled a. Clearly, none of the trees may have an odd number of leaf nodes, since this would allow us to derive a language with an odd number of as. So assume that all trees in our grammar have only an even number of leaves. But local tree sets are characterized by subtree substitution closure, whence we could rewrite a single leaf in a tree with an even number of leaves by another tree with an even numer of leaves, yielding a complex tree with an odd number of leaf nodes. This proves undefinability of L in loc2 . We can again lift this example to any dimension d ≥ 2 by viewing L as a path language. We now have lock < rlock < lock+1 and rlock < msok and msok ≮ lock+1 with respect to expressivity at dimension k, from which it follows immediately that a proper subset of all global constraints can be replaced by local ones. Theorem 4 (Reducibility at lower dimensions). Let C be the set of all k-global but not k-local constraints. Then C properly includes the set of all c ∈ C for which there is a set Φ of msod formulas, k < d, such that Mod(Φ ∪ {c}) is recognizable and there is a (k + 1)-local constraint c with Mod(Φ ∪ {c}) = Mod(Φ ∪ {c }). 2
A relaxed version of L is definable by an infinite set of rloc1 formulas. Let L be the set of all strings over {a, b, c, d}∗ containing no substring of the form (ad∗ b+ d∗ c) but a c does not have to be preceded by an a. Then one may write a formula φ that checks two intervals I, I of size m and n, respectively. In particular, φ enforces that no b occurs in I if a is at the left edge of I and no c is contained in I and c is at the right edge of I and no a is contained in I . Occurrences of b in I are banned in a symmetrical way. Pumping m and n independently gives rise to an infinite set of rloc1 formulas that defines L .
82
T. Graf
Since rlock is essentially a modal logic, we can even use model theoretic properties of modal logics, e.g. bisimulation invariance, to exhibit sufficient (but not necessary) conditions for reducibility of global constraints. The results in this section have interesting implications for linguistic theorizing. The neutralizing effects of excessive feature coding with respect to NCCs lend support to recent proposals which try to do away with mechanisms of this kind in the analysis of phenomena such as pied-piping (e.g. [1]). That reducibility is limited to a proper subclass of the global constraints, on the other hand, provides us with a new perspective on approaches which severely constrain the size of representational locality domains by recourse to local constraints on derivations (e.g. [22]). In the light of my results, they seem to be less about reducing the size of locality domains — a quantitative notion — than determining the qualitative power of global constraints in syntax.
4 4.1
Comparative Constraints Model Theory and Comparative Constraints — A Problem
Our interest in NCCs is almost entirely motivated by linguistic considerations. CCs, on the other hand, are intriguing from a mathematical perspective, too, because they make the well-formedness of a structure depend on the presence or absence of other structures, which is uncommon in model theoretic settings, to say the least. As we will see in a moment when we take a look at the properties of various subclasses of CCs, this peculiar trait forces us to move beyond a purely model theoretic approach, but — unexpectedly — not for all CCs. According to M¨ uller [21], CCs are either translocal or transderivational. Several years earlier, however, it had already been noticed by Potts [24] that the metarules of GPSG instantiate a well-behaved subclass of CCs. By definition, metarules are restrictions on the form of a grammar. They specify a template, and a grammar has to contain all rules that can be generated from said template. Metarules can be fruitfully applied towards several ends, e.g. to extract multiple constituent orders from a single rule or to ensure that the legitimacy of one construction in language L entails that another construction is licit in L, too. From these two examples it should already be clear that metarules, although they are stated as restrictions on grammars, serve in restricting entire languages rather than just the structures contained by them. In particular, metarules are a special case of closure conditions on tree languages. From this perspective, it is not too surprising that GPSG-style metarules are provably mso2 -definable [24]. In fact, many closure constraints besides metarules can be expressed in mso2 as formula schemes [27]. With respect to the MS-hierarchy, this has several implications. First, there are more subclasses of CCs than predicted. Second, metarules instantiate a subclass that despite initial appearance can be represented by global constraints. Third, not all closure constraints are reducible to global constraints, since a formula scheme might give rise to an infinite set of formulas, which cannot be replaced by a single mso formula of finite length.
Interdefinability of Constraints
83
Given that closure constraints are already more powerful than global constraints, the high position of translocal and transderivational constraints in the MS-hierarchy would be corroborated if closure constraints could be shown to be too weak to faithfully capture either class. This seems to be the case. Consider the translocal constraint Avoid Pronoun. Upon being handed a 2-dimensional tree T ∈ L, Avoid Pronoun computes T ’s reference set, which is the set of trees in L that can be obtained from T by replacing overt pronouns by covert ones and vice versa. Out of this set, it then picks the tree containing the fewest occurrences of overt pronouns as the optimal output candidate. One might try to formalize Avoid Pronoun as a closure constraint on L such that for every T ∈ L, no T = T in the reference set of T is contained in L. This will run into problems when there are several optimal output candidates, but it is only a minor complication compared to the greater, in fact insurmountable challenge a closure constraint implementation of Avoid Pronoun faces: it permits any output candidate T , not just optimal ones, to be in L as long as no candidates competing with T belong to L. In other words, the closure constraint implementation of Avoid Pronoun allows for the selection of any candidate as the optimal output candidate under the proviso that all other output candidates are discarded. This means complete failure at capturing optimality, the very essence of CCs. 4.2
Comparative Constraints as Optimality Systems
From a model theoretic perspective, CCs are a conundrum. In order to verify that a set L of structures satisfies a CC, it does not suffice to look at L in isolation, we also have to consider what L looked like before the CC was applied to it. This kind of temporal reasoning is not readily available in model theory. Admittedly one could meddle with the models to encode such metadata, e.g. by moving to an ordered set of sets, but this is likely to obfuscate rather than illuminate our understanding of CCs in linguistics. Besides, trees are sets of nodes and tree languages are sets of sets of nodes, so our models would be sets of sets of sets of nodes and hence out of the reach of mso, pushing us into the realm of higher-order logics and beyond decidability. For these reasons, then, it seems advisable to approach CCs from a different angle with tools that are already well-adapted to optimality and economy conditions: optimality systems (OSs) [5]. Definition 4 (Optimality system). An optimality system over languages L, L is a pair O := Gen, C with Gen ⊆ L × L and C := c1 , . . . , cn a linearly ordered sequence of functions ci : range(Gen) → N. For i, o , i, o ∈ Gen, i, o 2 set – – – – – –
Vk = Vk−1 ∪ {k.i | i = 1 . . . k − 2}, 00k =00k−1 ∪{(k.i, k.i + 1) | i = 1 . . . k − 3} ∪ {(k.k − 2, k − 1.1), 01k =01k−1 = ∅, 10k =10k−1 = ∅, 11k =11k−1 ∪{(k.i, i.1) | i = 1 . . . k − 2}, Gk = (Vk , 00k , 01k , 10k , 11k ).
The MDSes G2 , G3 and G4 are depicted in Figure 3. The MDSes G5 and G6 – being rather large – are depicted in Appendix A. It is immediately obvious that Gk−1 is a subgraph of Gk . We quickly check that Gk is indeed an MDS.
124
S. Kepser
4.1
4.2
3.1
3.1
2.1
1.1
2.1
2.1
1.1
1.1
Fig. 3. The MDSes G2 , G3 and G4
Lemma 3. For k > 1 each graph Gk is a simple MDS. Proof. Conditions (1), (2), and (4) (rootedness) are obviously true for all Gk . None of the graphs contains a loop (5). And simplicity is also observed. The thing to be shown are conditions (6) and (7) stating that the set of parents of each node is linearly ordered and that the lowest element in the order is the only root link. Observe that for each Gk all nodes are linearly ordered by 00k and that by definition of 00k each node different from the root has a single root link. A node (i.j) with j > 1 has a single parent, which is (i.j − 1). For i > 1 all nodes (i.1) have set M (i.1) = {(l.1) | i + 1 < l ≤ k} ∪ {(i + 1, i − 1)}. For node (1.1), M (1.1) = {(l.1) | i + 1 ≤ l ≤ k}. Lemma 4. For k > 1 each MDS Gk contains the complete graph Kk as a minor. Proof. For k = 2, 3 the complete graph Kk is the undirected version of Gk . For k > 3 the lemma is shown by induction on k. The general method is to contract all edges that connect vertices with the same main address. For k = 4 contract 4.1 004 4.2. As there is an edge from 4.1 to 1.1 and one each from 4.2 to 3.1 and 2.1 the undirected graph after contraction is K4 . For k > 4 contract the set of edges {k.i 00k k.i + 1 | 1 ≤ i ≤ k − 3}. As a result the set of vertices {k.i | 1 ≤ i ≤ k − 2} is fused to a single vertex k. By induction hypothesis, the subgraph Gk−1 can be contracted to Kk−1 . Now, since there is an edge from k.i to i.1 for each 1 ≤ i ≤ k − 2 and an edge from k.k − 2 to k − 1.1 (by definition of Gk ), each vertex in Kk−1 is connected to some vertex in {k.i | 1 ≤ i ≤ k − 2}. After fusing these into a single vertex k the resulting graph is thus Kk . Theorem 1. The MS 2 -theory of the classes of MDSes and simple MDSes is undecidable. Proof. As a consequence of the above lemma and Lemma 2 the classes of MDSes and simple MDSes have unbounded tree width.
On Monadic Second-Order Theories of Multidominance Structures
125
Seese ([7], Theorem 8) showed that if a class of graphs has a decidable MS 2 theory, it has bounded tree width.
6
Equivalence of MS 1 and MS 2 on MDSes
The aim of this section is to show that MS 1 has the same expressive power over MDSes as MS 2 . In other words, the option of edge set quantification does not extend the expressive power of MSO on MDSes. To show this we use a criterion by Courcelle. He showed in [3] that for uniformly k-sparse classes of simple graphs the two logics MS 1 and MS 2 have the same expresive power. A class of graphs is uniformly k-sparse if for some fixed k the number of edges of each subgraph of a graph is at most k-times the number of vertices. Definition 6. A finite multi graph G is k-sparse, if there is some natural number k such that Card(EG ) ≤ k Card(VG ). A finite multi graph G is uniformly k-sparse if each subgraph of G is k-sparse. A class of finite multi graphs is uniformly k-sparse if there is some natural number k such that each multi graph of the class is uniformly k-sparse. On the base of the following little lemma it is easy to see that MDSes are uniformly 2-sparse. Lemma 5. Let G be a multi graph. If the maximal in degree of G is d then G is uniformly d-sparse. If the maximal out degree of G is d then G is uniformly d-sparse. Proof. We can count edges by counting end points or starting points of edges. I.e. indeg(v) = outdeg(v). Card(EG ) = v∈VG
v∈VG
If the maximum in (out) degree is d the above equation can be approximated by Card(EG ) ≤ d Card(VG ). See also [3], Lemma 3.1.
Corollary 1. The class of MDSes is uniformly 2-sparse. Proof. MDSes share with binary trees the property of having a maximal out degree of 2. Thus simple MDSes fulfil the criterion set out in [3]. Proposition 1. The logics MS 1 and MS 2 have the same expressive power over the class of simple MDSes. Proof. By Theorem 5.1 of [3], the same properties of multi graphs are expressible by MS 1 and MS 2 formulae for the class of finite simple 2-sparse multi graphs.
126
S. Kepser
Corollary 2. The MS 1 -theory of the class of simple MDSes is undecidable. Proof. Follows immediately from the above proposition and Theorem 1. The restriction to simple MDSes can be overcome on the basis of the following observation. Since we have only four colours of edges, simplicity can be defined in first-order logic. The following axiom does this. ∀x, y (x 00 y ∨ x 01 y) ⇐⇒ ¬(x 10 y ∨ x 11 y). Theorem 2. The MS 1 -theory of the class of MDSes is undecidable. Proof. Suppose the MS 1 -theory of the class of MDSes were decidable. Add the above axiom of simplicity to gain a decision procedures for the MS 1 -theory over simple MDSes. This contradicts Corollary 2. Theorem 3. The logics MS 1 and MS 2 have the same expressive power over the class of MDSes. Proof. Both theories have the same degree of undecidability.
7
Conclusion
We showed that both the MS 1 -theory and the MS 2 -theory over MDSes are undecidable – contrary to what Kracht conjectured. There was a good reason for Kracht’s conjecture, namely MS 1 is not much more powerful than PDL. So, how can this result be interpreted. We’d like to propose the following view. Courcelle showed that the property of being a minor is definable by an MSOdefinable transduction. But this property is not PDL-definable. It is not possible to code grids in a direct way with MDSes, basically because any set of parents is linearly ordered by dominance. But grids can be minors of MDSes. There is the question whether we can find a natural restriction on MDSes to bound their tree width to regain decidability of the MSO-theories. It is of course possible to just demand this or enforce it by e.g., demanding MDSes to be generable by context-free graph grammars. But these restrictions do not seem to have a motivation different from bounding the tree width and thus seem arbitrary. It would be much nicer if restrictions could be found that relate multi dominance to restrictions for movement.
References 1. Bodlaender, H.L.: A partial k-arboretum of graphs with bounded treewidth. Theoretical Computer Science 209, 1–45 (1998) 2. Courcelle, B.: Graph rewriting: An algebraic and logic approach. In: van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science, vol. B, ch. 5, pp. 193–242. Elsevier, Amsterdam (1990) 3. Courcelle, B.: The monadic second-order logic of graphs XIV: uniformly sparse graphs and edge set quantifications. Theoretical Computer Science 299(1-3), 1–36 (2003)
On Monadic Second-Order Theories of Multidominance Structures
127
4. Kracht, M.: Syntax in chains. Linguistics and Philosophy 24, 467–529 (2001) 5. Kracht, M.: On the logic of LGB type structures. Part I: Multidominance structures. In: Hamm, F., Kepser, S. (eds.) Logics for Linguistic Structures, pp. 105–142. Mouton de Gruyter, Berlin (2008) 6. Robertson, N., Seymour, P.: Graph minors II. Algorithmic aspects of treewidth. Journal of Algorithms 7(3), 309–322 (1986) 7. Seese, D.: The structure of the models of decidable monadic theories of graphs. Annals of Pure and Applied Logic 53, 169–195 (1991)
A
MDSes G5 and G6 5.1
5.2
5.3
4.1
4.2
3.1
2.1
1.1 Fig. 4. MDS G5
128
S. Kepser
6.1
6.2
6.3
6.4
5.1
5.2
5.3
4.1
4.2
3.1
2.1
1.1 Fig. 5. MDS G6
The Equivalence of Tree Adjoining Grammars and Monadic Linear Context-Free Tree Grammars Stephan Kepser1 and James Rogers2 1
Collaborative Research Centre 441 University of T¨ ubingen T¨ ubingen, Germany
[email protected] 2 Computer Science Department Earlham College Richmond, IN, USA
[email protected] Abstract. It has been observed quite early after the introduction of Tree Adjoining Grammars that the adjoining operation seems to be a special case of the more general deduction step in a context-free tree grammar (CFTG) derivation. TAGs look like special cases of a subclass of CFTGs, namely monadic linear CFTGs. More than a decade ago it was shown that the two grammar formalisms are indeed weakly equivalent, i.e., define the same classes of string languages. This paper now closes the remaining gap showing the strong equivalence for so-called non-strict TAGs, a variant of TAGs where the restrictions for head and foot nodes are slightly generalised.
1
Introduction
Tree Adjoining Grammars [5,6] (TAGs) are a grammar formalism introduced by Joshi to extend the expressive power of context-free string grammars (alias local tree grammars) in a small and controlled way to render certain known mildly context-sensitive phenomena in natural language. The basic operation in these grammars, the adjunction operation, consists in replacing a node in a tree by a complete tree drawn from a finite collection. Context-free Tree Grammars (CFTGs, see [4] for an overview) have been studied in informatics since the late 1960ies. They provide a very powerful mechanism of defining tree languages. Rules of a CFTG define how to replace non-terminal nodes by complete trees. It has been observed quite early after the introduction of TAGs that the adjoining operation seems to be a special case of the more general deduction step in a CFTG-derivation. TAGs look like special cases of subclasses of CFTGs. This intuition was strengthened by showing that the yield languages definable by TAGs are equivalent to the yield languages definable by monadic linear C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 129–144, 2010. c Springer-Verlag Berlin Heidelberg 2010
130
S. Kepser and J. Rogers
non-deleting CFTGs, as was shown independently by M¨ onnich [9] and Fujioshi & Kasai [3]. The question of the strong equivalence of the two formalisms remained unanswered. Rogers [10,11] introduced a variant of TAGs called non-strict TAGs. Nonstrict TAGs generalise the definition of TAGs by releasing the conditions that the root node an foot node of an elementary tree must bear equal labels and that the label of the node to be replaced must be equal to the root node of the adjoined tree. The first proposal of such an extension of TAGs was made by Lang [7]. The new variant of TAGs looks even more like a subclass of CFTGs. And indeed, non-strict TAGs and monadic linear CFTGs are strongly equivalent. This is the main result of the present paper. We would like to point out that there is a small technical issue connected with this result. Call a tree ranked iff for every node the number of its children is a function of its label. It is well known that CFTGs define ranked trees. TAGs on the other hand define unranked trees. A tree generated by a TAG may have a leaf node and an internal node labelled with the same label. Taking the definition of ranked trees strict, this is not possible with CFTG generated trees. Our view on this issue is the standpoint taken by practical informatics: a function is not just defined by its name, rather by its name and arity. Hence a three-place A can be distinguished from a constant A by their difference in arity, though the function – or label – name is the same. For every label, a TAG only introduces a finite set of arities. Hence we opt for extending the definition of a ranked alphabet to be a function form labels to finite sets of natural numbers. The equivalence result and a previous result by Rogers [11] provide a new characterisation of the class of tree languages definable by monadic linear CFTGs by means of logics. A tree language is definable by a MLCFTG if and only if it is the two-dimensional yield of an MSO-definable three-dimensional tree language. The paper is organised as follows. The next section introduces trees, contextfree tree grammars, tree-adjoining grammars, and three-dimensional trees. Section 3 introduces a special type of CFTGs called footed CFTGs. These grammars can be seen as the CFTG-counterpart of TAGs. Section 3 contains the equivalence of monadic linear CFTGs and footed CFTGs. Section 4 shows that footed CFTGs are indeed the CFTG-counterpart of TAGs providing the equivalence of both grammar types. The next section states the afore mentioned logical characterisation of the expressive power of MLCFTGs. Due to space restrictions, technical proofs had to be left out unfortunately.
2 2.1
Preliminaries Trees
We consider labelled finite ordered ranked trees. A tree is ordered if there is a linear order on the daughters of each node. A tree is ranked if the label of a node implies the number of daughter nodes. A tree domain is a finite subset of the set of strings over natural numbers that is closed under prefixes and left sisters. Formally, let N∗ denote the set of
TAGs and MLCFTGs
131
all finite sequences of natural numbers including the empty sequence and N+ be the of all finite sequences of natural numbers excluding the empty sequence. A set D ⊂fin N∗ is called a tree domain iff for all u, v ∈ N∗ : uv ∈ D ⇒ u ∈ D (prefix closure) and for all u ∈ N∗ , i ∈ N : ui ∈ D ⇒ ∀j < i : uj ∈ D (closure under left sisters). An element of a tree domain is an address of a node in the tree. It is called a position. Let Σ be a set of labels. A tree is a pair (D, λ) where D is a tree domain and λ : D → Σ is a tree labelling function. The set of all trees labelled with symbols from Σ is denoted TΣ . A tree language L ⊆ TΣ is just a subset of TΣ . A set Σ of labels is ranked if there is a function ρ : Σ → ℘fin (N) assigning each symbol an arity. If t = (D, λ) is a tree of a ranked alphabet Σ then for each position p ∈ D : ρ(λ(p)) = n ⇒ p(n − 1) ∈ D, pm ∈ / D for every m ≥ n. If X is a set (of symbols) disjoint from Σ, then TΣ (X) denotes the set of trees TΣ∪X where all elements of X are taken as constants. The elements of X are understood to be “variables”. Let X = {x1 , x2 , x3 , . . . } be a fixed denumerable set of variables. Let X0 = ∅ and, for k ≥ 1, Xk = {x1 , . . . , xk } ⊂ X. For k ≥ 0, m ≥ 0, t ∈ TΣ (Xk ), and t1 , . . . , tk ∈ TΣ (Xm ), we denote by t[t1 , . . . , tk ] the result of substituting ti for xi in t. Note that t[t1 , . . . , tk ] is in TΣ (Xm ). Note also that for k = 0, t[t1 , . . . , tk ] = t. 2.2
Context-Free Tree Grammars
We start with the definition of a context-free tree grammar quoting [1]. Definition 1. A context-free tree grammar is a quadruple G = (Σ, F , S, P ) where Σ is a finite ranked alphabet of terminals, F is a finite ranked alphabet of nonterminals or function symbols, disjoint with Σ, S ∈ F 0 is the start symbol, and P is a finite set of productions (or rules) of the form F (x1 , . . . , xk ) → τ , where F ∈ F k and τ ∈ TΣ∪F (Xk ). We use the convention that for k = 0 an expression of the form F (τ1 , . . . , τk ) stands for F . In particular, for F ∈ F 0 , a rule is of the form F → τ with τ ∈ TΣ∪F . We sometimes use little superscripts to indicate the arity of a nonterminal, like in F 3 . The terminus context-free tree grammar is abbreviated by CFTG. For a context-free tree grammar G = (Σ, F , S, P ) we now define the direct derivation relation. Let n ≥ 0 and let σ1 , σ2 ∈ TΣ∪F (Xn ). We define σ1 ⇒ σ2 G
if and only if there is a production F (x1 , . . . , xk ) → τ , a tree η ∈ TΣ∪F (Xn+1 ) containing exactly one occurrence of xn+1 , and trees ξ1 , . . . , ξk ∈ TΣ∪F (Xn ) such that σ1 = η[x1 , . . . , xn , F (ξ1 , . . . , ξk )] and σ2 = η[x1 , . . . , xn , τ [ξ1 , . . . , ξk ]].
132
S. Kepser and J. Rogers
In other words, σ2 is obtained from σ1 by replacing an occurrence of a subtree F (ξ1 , . . . , ξk ) by the tree τ [ξ1 , . . . , ξk ]. ∗ As usual, ⇒ stands for the reflexive-transitive closure of ⇒. For a contextG
∗
G
free tree grammar G, we define L(G) = {t ∈ TΣ | S ⇒ t}. L(G) is called the G
tree language generated by G. Two grammars G and G are equivalent, if they generate the same tree language, i.e., L(G) = L(G ). We define three subtypes of context-free tree grammars. A production F (x1 , . . . , xk ) → τ is called linear, if each variable x1 , . . . , xk occurs at most once in τ . Linear productions do not allow the copying of subtrees. A tree grammar G = (Σ, F , S, P ) is called a linear context-free tree grammar, if every rule in P is linear. All the CFTGs we consider in this paper are linear. Secondly, a rule F (x1 , . . . , xk ) → τ is non-deleting if each variable x1 , . . . , xk occurs in τ . A CFTG is non-deleting if each rule is non-deleting. Thirdly, a CFTG G = (Σ, F , S, P ) is monadic if F k = ∅ for every k > 1. Non-terminals can only be constants or of rank 1. Monadic linear context-free tree grammars are abbreviated MLCFTGs. Example 1. Let G1 = ({g 2 , a, b, c, d, e}, {S, A1}, S, P ) where P consists of the three rules S → A(e)
A(x) →
a
b A(x) →
g> >> >> >> g> d >> >> >> > x c
g ??? ?? ?? ? a A d
g ??? ?? ?? ? x c b G1 is monadic, linear, and non-deleting. The tree language generated by G1 is not regular. Its yield language is the non-context-free language {an bn ecn dn | n ≥ 1}.
TAGs and MLCFTGs
2.3
133
Tree Adjoining Grammars
We consider so-called non-strict Tree Adjoining Grammars. Non-strict TAGs were introduced by Rogers [11] as an extension of TAGs that reflects the fact that adjunction (or substitution) operations are fully controlled by obligatory and selective adjoining constraints. There is hence no need to additionally demand the equality of head and foot node labels or the equality of the labels of the replaced node with the head node of the adjoined tree. Citing [11], a non-strict TAG is a pair (E, I) where E is a finite set of elementary trees in which each node is associated with – a label – drawn from some alphabet, – a selective adjunction (SA) constraint – a subset of the set of names of the elementary trees, and – an obligatory adjunction (OA) constraint – Boolean valued and I ⊂ E is a distinguished non-empty set of initial trees. Each elementary tree has a foot node. Formally let Λ be a set of linguistic labels and N a be a finite set of labels disjoint from Λ (the set of names of trees). A tree is a pair (D, λ) where – D is a tree domain, – λ : D → Λ × ℘(N a) × {true, false} a labelling function. Hence a node is labelled by a triple consisting of a linguistic label, an SA constraint, and an OA constraint. We denote TΛ,N a the set of all trees. An elementary tree is a triple ((D, λ, f ) where (D, λ) is a tree and f ∈ D is a leaf node, the foot node. Definition 2. A non-strict TAG is a quintuple G = (Λ, Na, E, I, name) where – – – – –
Λ is a set of labels, Na is a finite set of tree names, E is a finite set of elementary trees, I ⊆ E is a finite set of initial trees, and name : E → Na is a bijection, the tree naming function.
An adjunction is the operation of replacing a node n with a non-empty SA constraint by an elementary tree t listed in the SA constraint. The daughters of n become daughters of the foot node of t. A substitution is like an adjunction except that n is a leaf and hence there are no daughters to be moved to the foot node of t. Formally, let t, t be two trees and G a non-strict TAG. Then t is derived from t in a single step (written t ⇒ t ) iff there is a position p ∈ Dt and an G
elementary tree s ∈ E with foot node fs such that – λt (p) = (L, SA, OA) with L ∈ Λ, SA ⊆ Na, OA ∈ {true, false}, – Dt = {q ∈ Dt | v ∈ N+ : q = pv} ∪ {pv | v ∈ Ds } ∪ {pfs v | v ∈ N+ , pv ∈ Dt },
134
S. Kepser and J. Rogers
⎧ ⎨ λt (q) if q ∈ Dt and v ∈ N∗ : q = pv, – λt (q) = λs (v) if v ∈ Ds and q = pv, ⎩ λt (pv) if v ∈ N+ , pv ∈ Dt , q = pfs v. We write t = adj(t, p, s) if t is the result of adjoining s in t at position p. As ∗ usual, ⇒ is the reflexive-transitive closure of ⇒. Note that this definition also G
G
subsumes substitution. A substitution is just an adjunction at a leaf node. A tree is in the language of a given grammar, if every OA constraint on the way is fulfilled, i.e., no node of the tree is labelled with true as OA constraint. SA and OA constraints only play a role in derivations, they should not appear as labels of trees of the tree language generated by a TAG. Let π1 be the first projection on a triple. It can be extended in a natural way to apply to trees by setting – Dπ1 (t) = Dt , and for each p ∈ Dt , – λπ1 (t) (p) = L if λt (p) = (L, SA, OA) for some SA ⊆ Na, OA ∈ {true, false}. Now L(G) =
∗ ∃s ∈ I such that s ⇒ t, G π1 (t) p ∈ Dt with λt (p) = (L, SA, true) for some L ∈ Λ, SA ⊆ Na
One of the differences between TAGs and CFTGs is that there is no such concept of a non-terminal symbol or node in TAGs. The thing that comes closest is a node labelled with an OA constraint set to true. Such a node must be further expanded. The opposite is a node with an empty SA constraint. Such a node is a terminal node, because it must not be expanded. Nodes labelled with an OA constraint set to false but a non-empty SA constraint may or may not be expanded. They can neither be regarded as terminal nor as non-terminal nodes. Example 2. Let G2 = ({g, a, b, c, d, e}, {1, 2}, E, {name−1(1)}, name) where E and name are as follows: (To enhance readability we simplified node labels for all nodes that have an empty SA constraint. For these nodes we only present the linguistic label, the information (∅, false) is omitted. The foot nodes are underlined.) 1:
g s KKK KK ss s KK ss KK s s KK ss s K s a (g, {2}, false) d KKK s s K s K KKK ss KKK sss s s s e c b
TAGs and MLCFTGs
2:
135
gK ss KKK s KK ss KK ss KK ss s KK s s a (g, {2}, false) d JJ t J t J t JJJ ttt JJ tt JJ t tt g c b
Note that this grammar is even a strict TA grammar. Note also that it generates the same tree language as the MLCFTG from Example 1, i.e., L(G2 ) = L(G1 ). 2.4
Three-Dimensional Trees
We introduce the concept of three-dimensional trees to provide a logical characterisation of the tree languages generable by a monadic linear CFTG. Multidimensional trees, their logics, grammars and automata are thoroughly discussed in [11]. Here, we just quote those technical definitions to provide our results. The reader who wishes to gain a better understanding of the concepts and formalisms connected with multi-dimensional trees is kindly referred to [11]. Formally, a three-dimensional tree domain T 3 ⊂fin (N∗ )∗ is a finite set of sequences where each element of a sequence is itself a sequence of natural numbers such that for all u, v ∈ (N∗ )∗ if uv ∈ T 3 then u ∈ T 3 (prefix closure) and for each u ∈ (N∗ )∗ the set {v | v ∈ N∗ , uv ∈ T 3} is a tree domain in the sense of Subsection 2.1. Let Σ be a set of labels. A tree-dimensional tree is a pair (T 3, λ) where T 3 is a three-dimensional tree domain and λ : T 3 → Σ is a (node) labelling function. For a node x ∈ T 3 we define its immediate successors in three dimensions as follows. x 3 y iff y = x · m for some m ∈ N∗ , i.e., x is the longest proper prefix of y. x 2 y iff x = u · m and y = u · mj for some u ∈ T 3, m ∈ N∗ , j ∈ N, i.e. x and y are at the same 3rd dimensional level, but x is the mother of y in a tree at that level. Finally, x 1 y iff = u · mj and y = u · m(j + 1) for some u ∈ T 3, m ∈ N∗ , j ∈ N, i.e. x and y are at the same 3rd dimensional level and x is the immediate left sister of y in a tree at that level. We consider the weak monadic second-order logic over the relations 3 , 2 , 1 . Explanations about this logic and its relationship to T3 grammars and automata can be found in [11].
3
The Equivalence between MLCFTGs and Footed CFTGs
The equivalence between MLCFTGs and TAGs is proven by showing that both grammar formalisms are equivalent to a third formalism, so-called footed CFTGs.
136
S. Kepser and J. Rogers
non-deleting MLCFTG
MLCFTG gOOO kk OOO kkk k OOO k kk k OOO k u kk k
spinal-formed CFTG ?
non-deleting collapse-free MLCFTG S
SSS SSS SSS SSS S)
footed CFTG
Fig. 1. The Equivalence of MLCFTGs and footed CFTGs
Definition 3. Let G = (Σ, F , S, P ) be a linear CFT grammar. A rule F (x1 , . . . , xk ) → t is footed if there exists a position p ∈ Dt such that p has exactly k daughters, for 0 ≤ i ≤ k − 1 : λ(pi) = xi+1 , and no position different from {p0, . . . , p(k − 1)} is labelled with a variable. The node p is called the foot node and the path from the root of t to p is called the spine of t. A CFTG G is footed if every rule of G is footed. Footed CFTGs are apparently the counterpart of non-strict TAGs in the world of context-free grammars. Before we show this, we present in this section that footed CFTGs are actually equivalent to MLCFTGs. This is done in several intermediate steps, which are sketched in Figure 1. Each arrow indicates that a grammar of one type can be equivalently recasted as a grammar of the target type. We first show how to convert a MLCFTG into an equivalent footed CFTG in several steps. The start is the observation by Fujiyoshi that deletion rules do not contribute to the expressive power of MLCFTGs. Proposition 1. [2] For every monadic linear context-free tree grammar there exists an equivalent non-deleting monadic linear context-free tree grammar. But even in a non-deleting MLCFTG there may still be rules that delete nodes in a derivation step. Let G = (Σ, F , S, P ) be a non-deleting MLCFTG. A rule A(x) → x in P is called a collapsing rule. A collapsing rule actually deletes the non-terminal node A in a tree. If a CFTG does not contain a collapsing rule, the grammar is called collapse-free. Note that by definition footed CFTGs are collapse-free, because there is no position p having daughters (cp. Def. 3). Note also that the example MLCFTG in Ex. 1 is non-deleting and collapse-free. The next proposition shows that collapsing rules can be eliminated from MLCFTGs. Proposition 2. For every non-deleting MLCFTG there exists an equivalent non-deleting collapse-free MLCFTG. The idea of the proof is to apply the collapsing rule to all right-hand sides. Thus it is no longer needed. Some care has to be taken, if there is another way to expand the non-terminal that can collapse.
TAGs and MLCFTGs
137
MLCFTGs are not necessarily footed CFTGs, even when they are nondeleting and collapse-free. The reason is the following. Every right-hand side of every rule of a non-deleting MLCFTG has exactly one occurrence of the variable x. But this variable may have sisters. I.e. there may be subtrees in the rhs which have the same mother as x. Such a rule is apparently not footed. And its rhs can hardly be used as a base for an elementary tree in a TAG. Fortunately, though, a non-deleting collapse-free MLCFTG can be transformed into an equivalent footed CFTG. The resulting footed CFTG is usually not monadic any more. But this does not constitute any problem when translating the footed CFTG into a TAG. Proposition 3. For every non-deleting collapse-free MLCFTG there exists an equivalent footed CFTG. The main idea of the proof is the following. Let B(x) → t be a non-footed grammar rule containing the subtree g(t1 , x, t2 ). The undesirable sister subtrees t1 and t2 are replaced by variables yielding a new rule B(x1 , x2 , x3 ) → t where t is the result of replacing the subtree g(t1 , x, t2 ) by g(x1 , x2 , x3 ). The new rule is footed, but now ternary, not monadic. So is the non-terminal B. The original sister subtrees t1 and t2 still have to be dealt with. Suppose there is a grammar rule D(x) → τ such that τ contains a subtree B(θ). In this rhs we replace B(θ) by B(t1 , θ, t2 ). Now the non-terminal B is also ternary in the rhs, and the modified grammar rule can be applied to it. And if we apply the modified grammar rule, the trees t1 and t2 are moved back to being sisters of θ and daughters of the node g below which they were originally found. Example 3. We convert the non-deleting collapse-free MLCFTG G1 from Example 1 into a footed CFTG G2 . S → A (B, e, C) A (x1 , x2 , x3 ) → gB || BBB | BB | BB || || g A D | BBB BB || | BB || B || x1 x2 x3 A (x1 , x2 , x3 ) → gB || BBB | BB | BB || || A D A } AAA } AA }} AA }} A }} g B C | BBB BB || | BB || B || x1 x2 x3
138
S. Kepser and J. Rogers
A B C D
→ → → →
a b c d
Having shown by now that there is an equivalent footed CFTG for every MLCFTG we will now turn to the inverse direction. This is also done via intermediate steps. The following definitions are quoted from [3, p. 62]. A ranked alphabet is head-pointing, if it is a tripe (Σ, ρ, h) such that (Σ, ρ) is a ranked alphabet and h is a function from Σ to N such that, for each A ∈ Σ, if ρ(A) ≥ 1 then 0 ≤ h(A) < ρ(A), otherwise h(A) = 0. The integer h(A) is called the head of A. Definition 4. Let G = (Σ, F , S, P ) be a CFTG such that F is a head-pointing ranked alphabet. For n ≥ 1, a production A(x1 , . . . , xn ) → t in P is spinal-formed if it satisfies the following conditions: – There is exactly one leaf in t that is labelled by xh(A) . The path from the root to that leaf is called the spine of t, or the spine when t is obvious. – For a node d ∈ Dt , if d is on the spine and λ(d) = B ∈ F with ρ(B) ≥ 1, then d · h(B) is a node on the spine. – Every node labelled by a variable in Xn \ {xh(A) } is a child of a node on the spine. A CFTG G = (Σ, F , S, P ) is spinal-formed if every production A(x1 , . . . , xn ) → t in P with n ≥ 1 is spinal-formed. The intuition behind this definition as well as illustrating examples can be found in [3, p. 63]. We will not quote them here, because spinal-formed CFTGs are just an equivalent form of CFTGs on the way to showing that footed CFTGs can be rendered by MLCFTGs. Proposition 4. For every footed CFTG there exists an equivalent spinal-formed CFTG. Note that a rule in footed CFTG fulfills the first and the third condition of a spinal-formed rule already. What has to be shown is that the set F of nonterminals can be made to be head-pointing. Since the rhs of a footed CFTG rule has a spine, the non-terminals on the spine can be made head-pointing by following the spine. For all other non-terminals we arbitrarily choose the first daughter to be the head daughter. Proposition 5. [3] For every spinal-formed CFTG there exists an equivalent MLCFTG. This is a corollary of Theorem 1 (p. 65) of [3]. The authors see this fact themselves stating on p. 65 immediately above Theorem 1:
TAGs and MLCFTGs
139
“It follows from Theorem 1 that the class of tree languages generated by spine grammars is the same as the class of tree languages generated by linear non-deleting monadic CFTGs, that is, CFTGs with nonterminals of rank 1 and 0 only, and with exactly one occurrence of x in every right-hand side of a production for a nonterminal of rank 1.” We are now done showing that MLCFTGs are equivalent to footed CFTGs. Theorem 1. A tree language is definable by a monadic linear CFTG if and only if it is definable by a footed CFTG.
4
The Equivalence between Footed CFTGs and TAGs
The aim of this section is to show that footed CFTGs are indeed the counterpart of non-strict TAGs. We first translate footed CFTGs into non-strict TAGs. Proposition 6. For every footed CFTG there exists an equivalent non-strict TAG. The basic idea here is that every right-hand side of every rule from the CFTG is an elementary tree. The new foot node is the node that is the mother of the variables in the rhs of a rule. Of course, the variables and the nodes bearing them have to be removed from the elementary tree. To construct the TAG, every rhs of the CFTG gets a name. Every non-terminal in a rhs receives an obligatory adjunction constraint. The selection adjunction constraint it receives is the set of names of those rhs that are the rhs of the rules that expand this non-terminal. The initial trees are the rhs of the rules that expand the start symbol of the CFTG. Formally, let CFTG G = (Σ, F , S, P ) be a footed CFTG. Let Na be a set of labels such that |Na| = |P |. Define a bijection name : Na → rhs(P ) assigning names in Na to right hand side of rules in P in some arbitrary way. For a nonterminal A ∈ F k we define the set RhsA = {name(r) | (A(x1 , . . . , xk ) → r) ∈ P }. We define a function el-tree : rhs(P ) → TΣ∪F ,Na by considering two cases. For (A(x1 , . . . , xk ) → t) ∈ P such that f ∈ Dt with λt (f i) = xi+1 set D = Dt \ {f i | 0 ≤ i ≤ k − 1} for each p ∈ D : (λt (p), ∅, false) if λt (p) ∈ Σ, λ(p) = (B, RhsB , true) if λt (p) = B ∈ F el-tree(t) = (D, λ, f )
140
S. Kepser and J. Rogers
For (A → t) ∈ P set D = Dt for each p ∈ D : λ(p) =
(λt (p), ∅, false) if λt (p) ∈ Σ, (B, RhsB , true) if λt (p) = B ∈ F
f = 0k for k ∈ N, 0k ∈ D, 0k+1 ∈ /D el-tree(t) = (D, λ, f )
We let G = (Σ, Na, {el-tree(r) | r ∈ rhs(P )}, {el-tree(S)}, name) be the non-strict TAG derived from G. Example 4. To explain the construction we transform the grammar G2 of Example 3. The names are Na = {1, 2, 3, 4, 5, 6, 7} with name defined as Na rhs 1 A(B, e, C) 2 g(A, g(x1 , x2 , x3 ), D) 3 g(A, A(B, g(x1 , x2 , x3 ), C), D) 4 a 5 b 6 c 7 d We obtain the following elementary trees. (Again we simplify node labels of type (L, ∅, false) to just L. Foot nodes are underlined.) (A, {2, 3}, true) 1: QQQ QQQ mmm m m QQQ m m m QQQ m Q mmm (B, {5}, true) e (C, {6}, true) gK 2: ss KKKK ss KKK s s KKK ss s s g (A, {4}, true) (D, {7}, true) 3: ll g RRRRR lll RRR l l RRR ll l l RRR l R lll (A, {4}, true) (A, {2, 3}, true) (D, {7}, true) QQQ m QQQ mmm m m QQQ mm m QQQ m mm g (B, {5}, true) (C, {6}, true) 4: a 5: b 6: c 7: d
TAGs and MLCFTGs
141
Tree 1 is the only initial tree. If we substitute the substitution nodes 4 – 7 into the other elementary trees, the grammar bears a remarkable similarity the the TA grammar G2 of Example 2. Tree 2 of G2 corresponds to tree 3, and tree 1 of G2 to the result of adjoining tree 2 into 1. We now show the inverse direction. The idea of the construction is to take the elementary trees as right-hand sides of rules in a footed CFTG to be constructed. The non-terminals that are expanded – and hence the left-hand sides of rules – are those nodes that have an SA constraint that contains the name of the elementary tree under consideration. The arity of the non-terminal is just the number of daughters of such a node. Proposition 7. For every non-strict TAG there exists an equivalent footed CFTG. Let G = (Σ, Na, E, I, name) be a non-strict TAG. The construction looks as follows. Let S ∈ / Σ be a new symbol (the new start symbol). Set F k = {(L, SA, v) | ∃t ∈ E∃p ∈ Dt : λt (p) = (L, SA, v), v ∈ {true, false}, SA = ∅, p(k − 1) ∈ Dt , pk ∈ / Dt }. Set F = {S} ∪ k≥0 F k the set of non-terminals. For an elementary tree t = (Dt , λt , f ) ∈ E we define rhs(t, k) by D = Dt ∪ {f j | 0 ≤ j ≤ k − 1} for each p ∈ D :
⎧ L if λt (p) = (L, ∅, false), L ∈ Σ, ⎪ ⎪ ⎨ (L, SA, v) if λt (p) = (L, SA, v), L ∈ Σ, λ(p) = SA = ∅, v ∈ {true, false}, ⎪ ⎪ ⎩ xj+1 if p = f j, 0 ≤ j ≤ k − 1
rhs(t, k) = (D, λ) Note that for k = 0 the tree domain D = Dt . Define P1 as {(L, SA, v)(x1 , . . . , xk ) → rhs(t, k) | (L, SA, v) ∈ F k , t ∈ E : name(t) ∈ SA} ∪ {S → rhs(i, 0) | i ∈ I} and P2 as {(L, SA, false)(x1 , . . . , xk ) → L(x1 , . . . , xk ) | ∃t ∈ E ∃p ∈ t : λt (p) = (L, SA, false), p(k − 1) ∈ Dt , pk) ∈ / Dt }. The set P of productions is P1 ∪ P2 . Let G = (F , Σ, S, P ) be a CFTG. A simple check of the definition of the productions shows that G is footed. Note that the rules in P1 are used for the derivation proper while those in P2 serve the purpose of stripping off the undesirable SA and OA constraint information coded in the non-terminals.
142
S. Kepser and J. Rogers
Example 5. To illustrate the construction we provide an example of transforming a non-strict TAG into an equivalent footed CFTG. The input TAG is G2 from Example 2. The set of non-terminals is F = {S, (g, {2}, false)3 }. Rule set P1 consists of two rules: (g, {2}, false)(x1 , x2 , x3 ) →
S→
r g LLLL rr LLL r rr LLL rr r LLL r r r a (g, {2}, false) d KKK ss KKK s s s KKK sss KK s s s g LL c b r LLL rr r L r LLL r LLL rrr rrr x1 x2 x3 s g KKK KK ss s KK ss KK s ss KK s K ss a (g, {2}, false) d K KKK s s s KKK ss KKK sss s K s s e c b
Rule set P2 consists of a single rule: (g, {2}, false)(x1 , x2 , x3 ) → g(x1 , x2 , x3 ) The footed grammar is given by ({S, (g, {2}, false)}, {g, a, b, c, d, e}, S, P1 ∪ P2 ). The above results are accumulated in the following two theorems. Theorem 2. A tree language is definable by a footed CFTG if and only if it is definable by a non-strict TAG. We can now present the main result of this paper. It is an immediate consequence of the theorem above and Theorem 1. Theorem 3. The class of tree languages definable by non-strict Tree Adjoining Grammars is exactly the class of tree languages definable by monadic linear context-free tree grammars.
5
A Logical Characterisation
The aim of this section is to show that the theorem above and results by Rogers [11] on TAGs can be combined to yield a logical characterisation of tree languages
TAGs and MLCFTGs
143
definable by monadic linear CFTGs. A tree language is generable by a MLCFTG iff it is the two-dimensional yield of an MSO-definable three-dimensional tree language. Let us briefly sketch the two-dimensional yield of a three-dimensional tree. Let (T 3, λ) be a tree-dimensional tree. A node p ∈ T 3 is an internal node, iff p = (p is not the root) and there is a p with p 3 p (p has an immediate successor in the 3rd dimension). For an internal node we define a fold-in operation that replaces the node by the subtree it roots. Consider the set S of immediate successors of p. By definition it is a two-dimensional tree domain. We demand it to have a foot node, i.e., a distinguished node f ∈ S that has no immediate successors in the second dimension. The operation replaces p by S such that the immediate successors of p in the second dimension become the immediate successors of f in the second dimension. After this short sketch of a two-dimensional yield of a three-dimensional tree we can now state the main theorem of this section. It provides a logical characterisation of the tree languages definable by MLCFTGs. Theorem 4. A tree language is generable by a monadic linear context-free tree grammar iff it is the two-dimensional yield of an MSO-definable threedimensional tree language. Proof. Rogers [11] showed in Theorems 5 and 13 that a tree language is generable by a non-strict TAG iff it is the two-dimensional yield of an MSO-definable threedimensional tree language. The theorem is an immediate consequence of Rogers’ result and our Theorem 3.
6
Conclusion
We showed that non-strict TAGs and monadic linear CFTGs are strongly equivalent. We thereby rendered an old intuition about TAGs to be true (at least for non-strict ones). The strong equivalence result yields a new logical characterisation of the expressive power of monadic linear CFTGs. A tree language is definable by a MLCFTG iff it is the two-dimensional yield of an MSO-definable three-dimensional tree language. It is known that there is a whole family of mildly context-sensitive grammar formalisms that all turned out to be weakly equivalent. It would be interesting to compare their relative expressive powers in terms of tree languages, because, finally, linguists are interested in linguistic analyses, i.e., tree languages, and not so much in unanalysed utterances. For string based formalisms, the notion of strong generative capacity has to be extended along the lines proposed by Miller [8]. The current paper is one step in a program of comparing the strong generative capacity of mildly context-sensitive grammar formalisms.
144
S. Kepser and J. Rogers
References 1. Engelfriet, J., Schmidt, E.M.: IO and OI. I. Journal of Computer and System Sciences 15(3), 328–353 (1977) 2. Fujiyoshi, A.: Linearity and nondeletion on monadic context-free tree grammars. Information Processing Letters 93(3), 103–107 (2005) 3. Fujiyoshi, A., Kasai, T.: Spinal-formed context-free tree grammars. Theory of Computing Systems 33(1), 59–83 (2000) 4. G´ecseg, F., Steinby, M.: Tree languages. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages. Beyond Words, vol. 3, pp. 1–68. Springer, Heidelberg (1997) 5. Joshi, A., Levy, L.S., Takahashi, M.: Tree adjunct grammar. Journal of Computer and System Sciences 10(1), 136–163 (1975) 6. Joshi, A., Schabes, Y.: Tree adjoining grammars. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, Handbook of Formal Languages. Beyond Words, vol. 3, pp. 69–123. Springer, Berlin (1997) 7. Lang, B.: Recognition can be harder than parsing. Computational Intelligence 10, 486–494 (1994) 8. Miller, P.H.: Strong Generative Capacity: The Semantics of Linguistic Formalism. CSLI Publications, Stanford (1999) 9. M¨ onnich, U.: Adjunction as substitution. In: Kruijff, G.J., Morill, G., Oehrle, R. (eds.) Formal Grammar 1997, pp. 169–178 (1997) 10. Rogers, J.: A Descriptive Approach to Language-Theoretic Complexity. CSLI Publications, Stanford (1998) 11. Rogers, J.: wMSO theories as grammar formalisms. Theoretical Computer Science 293(2), 291–320 (2003)
A Formal Foundation for A and A-bar Movement Gregory M. Kobele University of Chicago
[email protected] Abstract. It seems a fact that movement dependencies come in two flavours: “A” and “A-bar”. Over the years, a number of apparently independent properties have been shown to cluster together around this distinction. However, the basic structural property relating these two kinds of movement, the ban on improper movement (‘once you go bar, you never go back’), has never been given a satisfactory explanation. Here, I propose a timing-based account of the A/A-bar distinction, which derives the ban on improper movement, and allows for a simple and elegant account of some of their differences. In this account, “A” dependencies are those which are entered into before an expression is first merged into a structure, and “A-bar” dependencies are those an expression enters into after having been merged. The resulting system is mildly context-sensitive, providing therefore a restrictive account of possible human grammars, while remaining expressive enough to be able to describe the kinds of dependencies which are thought to be manifest.
It is common to describe the syntax of natural language in terms of expressions being related to multiple others, or moved from one position to another. Since [25], much effort has been put into determining the limitations on possible movements. A descriptively important step was taken by classifying movement dependencies into two basic kinds: those formed by the rule move NP, and those formed by move wh-phrase [7]. This bipartition of movement dependencies is a formal rendering of the observation that wh-movement, topicalization, and comparative constructions seem to have something in common, that they do not share with passive and raising constructions, which in turn have their own particular similarities. Whereas syntactic theories such as Head-Driven Phrase Structure Grammar [22] and Lexical-Functional Grammar [5] have continued to cash out this intuitive distinction between dependency types formally, in terms of a distinction between lexical operations and properly syntactic operations, this distinction has no formal counterpart in theories under the minimalist rubric [9]. This theoretical lacuna has led some [27] to explore the hypothesis that this perceived distinction between movement types is not an actual one; i.e. that the differences between Wh-construction types on the one hand and passive construction types on the other are not due to differences in the kinds of dependencies involved. A problem besetting those minimalists eager to maintain the traditional perspective on the difference between wh- and NP-movement C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 145–159, 2010. c Springer-Verlag Berlin Heidelberg 2010
146
G.M. Kobele
dependencies, is that there is no principled distinction between long-distance dependency types available in the theory; a theory with one kind of long-distance dependency does not lie well on the procrustean bed of one with two. The contribution of this paper is to provide a non-ad hoc minimalist theory with two kinds of movement dependencies, which have just the kind of properties which have become standardly associated with move NP- and move wh-phrase-related phenomenon, respectively. It is important to note that it is not a particular analysis of a particular language which I will claim has these properties, but rather the theoretical framework itself. Once we are in posession of a theoretical framework in which we have two movement dependency forming operations that interact in the appropriate way, we are in a position to determine whether the old intuitions about movement dependencies coming in two types were right; we can compare the relative elegance of analyses written in one framework to those written in the other. In §1 I describe the kinds of properties which are constitutive of the empirical basis for the bifurcation of movement into two types. Recent minimalist accounts of some of these properties [16,4] will form the conceptual background of my own proposal, developed in §2. The formal architecture of minimalist grammars [28] is presented in §2.1, and it is extended in §2.2 in accord with my proposal. In §2.3, I present an analysis of passivization in English (drawing on the smuggling account proposed in [10]) written within the framework of §2.2.
1
On A and A-bar Movements
Many differences between NP and wh-phrase movement have been suggested in the literature, such as whether they license parasitic gaps, whether they can move out of tensed clauses, whether they bar the application of certain morpho-phonological processes, and whether they incur crossover violations (see e.g. [18]). (These questions uniformly receive a negative answer with respect to NP movements, and a positive one with respect to wh-phrase movements.) A perusal of these properties makes clear that they are highly construction- and analysis-specific. In other words, a theoretical framework cannot derive these differences between NP and wh-phrase movement simpliciter, but may at most derive them relative to particular analyses of these constructions. The only analysis independent property of NP and wh-phrase movement types is the so-called ‘ban on improper movement’, which states that NP movement of an expression may not follow its movement as a wh-phrase. This relational property of NP (henceforth: ‘A’) and wh-phrase (henceforth: ‘A-bar’) movements is widely accepted, and was motivated by the desire to rule out sentences such as (1) below. (1). *[S John seems [S t [S t wanted to sleep ] ] ] In (1), the first movement (to SPEC-S) is an A-bar movement, and the second (to the matrix clause subject position) an A movement. The unacceptability of (1) contrasts with the well-formed (2) below, which one can interpret as suggesting that it is the second movement in (1), from SPEC-S to the subject position in the matrix clause, which leads to the deviance of (1).
A Formal Foundation for A and A-bar Movement
147
(2). [S Who does [S Mary believe [S t [S t wanted to sleep ] ] ] ] In the government and binding (GB) framework (as described in [8]) and the minimalist program (MP) (as in [9]), the ban on improper movement must simply be stated as such; movement from an A position may target an A-bar position, but movement from an A-bar position may only target other A-bar postions (see [21] for a particularly articulated view). In LFG and HPSG, where A movements are taken to be resolved lexically and A-bar movements resolved grammatically, the ban on improper movement follows from the architecture (grammatically complex expressions are simply not the kinds of things that lexical processes apply to). In the grammatical architecture I will develop in §2, A movements are those which occur before, and A-bar movements those which occur after, an expression has been first merged into a structure. The ban on improper movement is then just a simple consequence of the structure of derivations. Strictly speaking, the ban on improper movement is the only property of movement types which a grammatical framework can be said to derive. However, the following more analysis-specific property of movement types listed above will be shown to follow naturally from the architecture of the system in §2. A and A-bar movements differ systematically as to whether they create new binding possibilities.1 Consider sentences (3) and (4) below. In (3), the reflexive pronoun himself cannot be bound by the quantified noun phrase every boy, whereas in (4), after movement, it can. (3). *It seems to himself that every boy is wonderful. (4). Every boy seems to himself to be wonderful. This situation contrasts with the one laid out in (5) below, where we see that a wh-moved expression is not able to bind the reflexive pronoun. Sentence (6) shows that it is indeed the failed attempt at binding that results in the ungrammaticality of (5), as this movement is otherwise fine. (5). *Which boy does it seem to himself that Mary loves? (6). Which boy does it seem that Mary loves? The difference between these movement types can be summed up in the following diagram, with A movement of XP being able to, while A-bar movement of XP being unable to, bind the pronoun pro: XP . . . [ . . . pro . . . t . . . ] Attempts to account for these phenomena have been numerous in the GB framework, and have continued into the MP (see [26] for an accessible typology). One option is to rule out rebinding by A-bar movements by denying the 1
This is called ‘crossover’ in the literature [23]. Strong crossover is when the bound expression c-commands the source position of the movement, and weak crossover is when the bound expression is properly contained in such a c-commanding phrase. Weak crossover violations have been argued to be ameliorable under certain conditions [17].
148
G.M. Kobele
ability to bind pronouns from A-bar positions, and another is to require that no closer potential binders may intervene between an A-bar trace and its antecedent. Given the framework developed below, we are in a position to stipulate that an expression may bind only those expressions that it c-commands when first merged (in other words, that binding is determined by c-command in the derivation tree).
2
Trace Deletion, Derivationally
Without a formal apparatus to hang an account of differences between movement on, researchers in the GB tradition have attempted to capture the difference between A and A-bar movements in terms of properties of source positions: traces. It was discovered that under a certain network of assumptions, A-bar traces behaved as R-expressions, and A traces as anaphors. In the MP, it has been suggested for diverse reasons that A-bar traces should be treated formally as copies of the moved expression, while A traces should be treated formally as unstructured objects [11,16]. This is the idea upon which this paper builds. But currently there is nothing more than arbitrary stipulation (why are some traces copies, and others not? why does the ban on improper movement hold?). To excavate the idea, we should get clear on what, exactly, traces are (for). In mainstream minimalism, movement chains are licensed derivationally: only well-formed chains are built in the first place. Therefore, traces are not needed for evaluating well-formedness of a syntactic representation (their role in government and binding theory). Instead, traces (qua copies) play a role primarily at the interfaces, in particular the syntax-semantics interface, where they determine the positions in which an expression may take scope, as per [9]. The distinction between structured and unstructured traces (i.e. copies versus ‘traditional’ traces) is intended to indicate the possibility or not of reconstruction (with expressions being reconstructible into structured trace positions, but not into unstructured trace positions). A ‘copy’ indicates that an expression is present at a particular location in the structure for the purposes of reconstruction, while (unstructured) traces indicate that it is not. The intuition is simply that an expression may be interpreted in any position in which it is present; it is present in its A-bar positions, but not (necessarily) in its A positions. This is easier to understand if we think not about derived structures, but about the derivation itself: talk of ‘copies’ versus ‘traces’ is recast in terms of whether (copies) or not (traces) the object which is entering into these various dependencies is already present in the derivation at the time the dependency in question is entered into. The basic formal idea behind this intuition is to incorporate both transformations, as well as ‘slash-feature percolation’ [13] into a single formalism. Then we may have derivations involving slash-features, in which the object entering into the dependency in question is not present at the time the dependency is established:
A Formal Foundation for A and A-bar Movement
149
1. [V write ] 2. [V P/DP was written ] In addition, we may have derivations using transformations, in which the object entering into the dependency in question is present at the time the dependency is established: 1. [S that [S book was written ] ] 2. [N book [S that [S t was written ] ] ] The present derivational reconstruction of the representational traces versus copies account of the A/A-bar distinction has the distinct advantage of giving a unified and intuitive account of various properties of A and A-bar movement. In particular, the ban on improper movement is forced upon us in this timing-based perspective on long-distance dependency satisfaction. In the next section I show how to incarnate this derivational perspective on A and A-bar movement in a formal system. In so doing we gain a better understanding not only of the mechanisms involved, but also of the various analytical options which the mechanisms put at our disposal. 2.1
Minimalist Grammars
Minimalist grammars [28] provide a formal framework within which the ideas of researchers working within the minimalist program can be rigorously explored. A minimalist grammar is given by a four-tuple V, Cat, Lex, F , where – V , the alphabet, is a finite set – Cat, the set of features, is the union of the following pair of disjoint sets: • sel × Bool, where for ∗ x, 0 ∈ sel × Bool, we write =x, and call it a selector feature ∗ x, 1 ∈ sel × Bool, we write x, and call it a selectee feature • lic × Bool, where for ∗ y, 0 ∈ lic × Bool, we write +y, and call it a licensor feature ∗ y, 1 ∈ lic × Bool, we write -y, and call it a licensee feature – Lex, the lexicon, is a finite set of pairs v, δ, for v ∈ V ∪ {}, and δ ∈ Cat∗ – F = {merge, move} is the set of structure building operations Minimalist expressions are traditionally given in terms of leaf-labelled, doubly ordered (projection and precedence) binary trees. The leaves are labelled with pairs of alphabet symbols (V ∪ {}) and feature sequences (Cat∗ ). A typical expression is given in figure 1, where the precedence relation is indicated with the left-right order, and the projection relation is indicated with less-than () signs. The projection relation allows for the definition of the important concepts ‘head-of’ and ‘maximal projection’. Intuitively, one arrives at the leaf which is the head of a complex expression by always descending into the daughter which is least according to the projection relation. In the tree in figure 1, its head is
150
G.M. Kobele
M ary, -k
< f eed,
< the,
dog,
Fig. 1. A minimalist expression
will, +ks, which is also (trivially) the head of its root’s left daughter. The head of the root’s right daughter is f eed, . Given a tree t with head v, δ, we write t[δ] to indicate that the head of t has features δ. A proper subtree t of tree t is a maximal projection just in case the sister ts of t is such that ts < t in t. If t is a subtree of a tree t, we may write t as Ct . Ct then refers to the tree like t but with the subtree t replaced by the subtree t . Work by [19] has shown that the operations of merge and move can be completely supported by data structures far less structured than doubly ordered leaf labelled binary trees.2 Accordingly, [29] provide a simplified expression type for minimalist grammars; an expression is a sequence φ0 , φ1 , . . . , φn , where each φi is a pair ν, δ, for ν ∈ V ∗ , and δ ∈ Cat+ . The intuition is that each φi , 1 ≤ i ≤ n, represents the phonetic yield of a moving subtree, and that φ0 represents the phonetic yield of the rest of the tree. Let t1 [=xδ1 ] and t2 [xδ2 ] be two minimalist trees with head-features beginning with =x and x respectively. Then the result of merging together t1 [=xδ1 ] and t2 [xδ2 ] is shown in figure 2.3
< t1 [δ1 ]
t2 [δ2 ]
Fig. 2. merge(t1 [=xδ1 ], t2 [xδ2 ])
From the perspective of the more concise chain-based representation, merge is broken up into two subcases, depending on whether or not the second argument 2 3
[15] has shown that these trees are also unnecessary for semantic interpretation. The merge operation presented here is non-standard in that it only allows for merger into a complement position (i.e. the merged expression follows the expression to which it is merged). I adopt this simplification only for expository purposes; nothing important hinges on this.
A Formal Foundation for A and A-bar Movement
151
will move (i.e. whether δ2 = or not). merge1(ν1 , =xδ1 ,φ1 , . . . , φm ; ν2 , x, ψ1 , . . . , ψn ) = ν1 ν2 , δ1 , φ1 , . . . , φm , ψ1 , . . . , ψn merge2(ν1 , =xδ1 ,φ1 , . . . , φm ; ν2 , xδ2 , ψ1 , . . . , ψn ) = ν1 , δ1 , φ1 , . . . , φm , ν2 , δ2 , ψ1 , . . . , ψn Let Ct[-yδ2 ][+yδ1 ] be a minimalist tree with head features beginning with +y, which contains a maximal (wrt projection) subtree t[-yδ2 ] with head features beginning with -y. Then the result of applying the move operation to Ct[-yδ2 ][+yδ1 ] is shown in figure 3 (where λ = , ).
> t[δ2 ]
Cλ[δ1 ]
Fig. 3. move(Ct[-yδ2 ][+yδ1 ])
Turning once more to the more concise chain-based representation, move is broken up into two subcases, depending on whether or not the moving subtree will move again (i.e. whether δ2 = or not). move1(ν1 , +yδ1 ,φ1 , . . . , ν2 , -y, . . . , φm ) = ν2 ν1 , δ1 , φ1 , . . . , φm move2(ν1 , +yδ1 ,φ1 , . . . , ν2 , -yδ2 , . . . , φm ) = ν1 , δ1 , φ1 , . . . , ν2 , δ2 , . . . , φm Since at least [25] it has been observed that movement cannot relate arbitrary tree positions, but rather that there are constraints on which positions a moved item can be construed as originating from. The canonical constraint on movement in minimalist grammars is the SMC [28], intended to be reminiscient of the shortest move constraint of [9].4 Intuitively, the SMC demands that if an expression can move, it must move. This disallows cases in which two or more moving subexpressions ‘compete’ for the same +y feature. The SMC is implemented as a restriction on the domain of move: move(ν, +yδ, φ1 , . . . , φm ) is defined iff exactly one φi = νi , δi is such that δi begins with -y 4
[12] investigate other constraints on movement in minimalist grammars.
152
2.2
G.M. Kobele
Trace Deletion in Minimalist Grammars
Minimalist grammars as presented above allow only for ‘derivational copies’; an expression is present in the derivation at every point in which it enters into a syntactic dependency. This is because we first merge an expression into the derivation, and then satisfy further dependencies by moving it around. In order to allow for ‘derivational traces’, we need an expression to start satisfying dependencies before it is part of the derivation. The mechanism adopted to allow this somewhat paradoxical sounding state of affairs bears strong similarities to ‘slash feature percolation’ in GPSG, as well as to hypothetical reasoning in logic. The intuition is that we will allow ourselves, upon encountering an expression t[z, 0δ1 ], to assume the existence of an expression with a matching feature z, 1. This allows us to continue the derivation as if we had successfully checked the first feature of t. However, assumptions, like other forms of credit, must eventually be paid back. This takes here the form of inserting an expression which actually has the features we had theretofore assumed we had, discharging the assumptions. To implement hypothetical reasoning, we introduce another pair of operations, assume and discharge. Informally, assume eliminates features of an expression, and keeps a record of the features so eliminated. An example is given in figure 4, where td represents the information that a d feature was hypothesized.
< smile, v
td
Fig. 4. assume(smile, =d v)
To eliminate assumptions, we introduce the discharge operation, which ‘merges’ two expressions together, using the second to satisfy en masse some of the features previously eliminated via assume in the first. An example is shown in figure 5, where the dotted lines indicate the checking relationships between the connected features. > < M ary, will, +k s
x is a variant of =x, one which in addition triggers movement of the selected phrase’s head. For details, see [28,20,15].
156
G.M. Kobele
ix. , =>V =d v x. book, n Lexical item viii allows the object to check its case (-k) within the extended projection of the verb (the head movement is to get the word order right). It is optional, as passivization requires the object to check its case outside of the verb phrase in English. Lexical item ix is the head which selects the external argument of the verb phrase, and changes the category of the verbal projection to ‘little v’ (v). The sentence the man will write the book has the following derivation. 1. assume(vii) write, V, d, -k 2. merge(viii,1) write, +k V, d, -k 3. merge(i,x) the book, d -k 4. discharge(2,3) the book write, V 5. merge(ix,4) write the book, =d v 6. assume(5) write the book, v, d, -k 7. merge(v,6) will write the book, +k s, d, -k 8. merge(i,vi) the man, d -k 9. discharge(7,8) the man will write the book, s To derive passive sentences, we require the following five lexical items. xi. xii. xiii. xiv. xv.
-en, =>V V -part , =v +k x by, =x +part pass , =v +part pass be, =pass v
Using these lexical items, we may derive the sentence The book will be written by the man in the following manner. 1. assume(ix) , =d v, V, -part 2. assume(1) , v, d, -k, V, -part
A Formal Foundation for A and A-bar Movement
157
3. merge(xii,2) , +k x, d, -k, V, -part 4. merge(i,vi) the man, d -k 5. discharge(3,4) the man, x, V, -part 6. merge(xiii,5) by the man, +part pass, V, -part 7. assume(vii) write, V, d, -k 8. merge(xi,7) written, V -part, d, -k 9. discharge(6,8) written by the man, pass, d, -k 10. merge(xv,9) be written by the man, v, d, -k 11. merge(v,10) will be written by the man, +k s, d, -k 12. merge(i,x) the book, d -k 13. discharge(11,12) the book will be written by the man, s Note that if expression 8 were merged directly in step 1 instead of having been assumed in 1 and then discharged in 9, the derivation would have crashed at step 5, as the man and (the hypothesis of) the book would have been competing for the same position. We may derive the relative clause book that will be written by the man by first recycling steps 1–11 of the previous derivation, and then continuing in the following manner. 12. merge(ii,x) book, d -k -rel 13. discharge(11,12) will be written by the man, s, book, -rel 14. merge(iii,13) that will be written by the man, +rel n, book, -rel 15. move(14) book that will be written by the man, n
158
3
G.M. Kobele
Conclusions
I have demonstrated how a straightforward modification of the minimalist grammar framework yields a formal architecture for the description of language in which there exist two kinds of movement dependencies, which obey the ban on improper movement. It is simple and natural to connect the structure of movement chains to semantic interpretation in a way that derives the crossover differences between A and A-bar movement.7 Further semantic asymmetries, such as the hypothesis that A movement systematically prohibits reconstruction while A-bar movement does not [9,16], are also easily incorporable.8 One major difference between the hybrid minimalist framework presented here and the intuitive conception of A and A-bar movement present in the GB/MP literature, is the lack of a decision procedure for determining when an expression stops A moving and starts A-bar moving. (This is related to the fact that this decision procedure in the GB/MP relies on a complex network of assumptions which in the MG framework are non-logical, such as universal clausal structure, and crosslinguistic identity of features.) As mentioned in footnote 8, this freedom allows us to pursue hypotheses about language structure that we otherwise could not. It remains to be seen whether these novel hypothesis types prove enlightening, as it does the broader A/A-bar distinction.
References 1. Amblard, M.: Calculs de repr´esentations s´emantiques et syntaxe g´en´erative: les grammaires minimalistes cat´egorielles. Ph.D. thesis, Universit´e Bordeaux I (2007) 2. Bhatt, R.: The raising analysis of relative clauses: Evidence from adjectival modification. Natural Language Semantics 10(1), 43–90 (2002) 3. Bianchi, V.: The raising analysis of relative clauses: A reply to Borsley. Linguistic Inquiry 31(1), 123–140 (2000) 4. Boeckx, C.A.: A note on contraction. Linguistic Inquiry 31(2), 357–366 (2000) 5. Bresnan, J.: Lexical-Functional Syntax. Blackwell, Oxford (2001) 6. B¨ uring, D.: Crossover situations. Natural Language Semantics 12(1), 23–62 (2004) 7. Chomsky, N.: On Wh-movement. In: Culicover, P.W., Wasow, T., Akmajian, A. (eds.) Formal Syntax, pp. 71–132. Academic Press, New York (1977) 8. Chomsky, N.: Lectures on Government and Binding. Foris, Dordrecht (1981) 9. Chomsky, N.: The Minimalist Program. MIT Press, Cambridge (1995) 7
An expression may bind all and only those expressions it c-commands in the derivation tree. Note that this is a formal rendering of what [6] calls ‘Reinhart’s Generalization’ (after [24]): Pronoun binding can only take place from a c-commanding A position.
8
Two possibilities suggest themselves. First, we might allow an expression to reconstruct into any position through which it has moved (i.e. between where it is first merged/discharged and where it is last moved) [16]. Another option is to abandon the idea that there are dedicated A and A-bar positions, and force an expression to be interpreted exactly where it is first merged/discharged.
A Formal Foundation for A and A-bar Movement
159
10. Collins, C.: A smuggling approach to the passive in English. Syntax 8(2), 81–120 (2005) 11. Fox, D.: Reconstruction, binding theory, and the interpretation of chains. Linguistic Inquiry 30(2), 157–196 (1999) 12. G¨ artner, H.M., Michaelis, J.: A note on the complexity of constraint interaction: Locality conditions and minimalist grammars. In: Blache, P., Stabler, E.P., Busquets, J.V., Moot, R. (eds.) LACL 2005. LNCS (LNAI), vol. 3492, pp. 114–130. Springer, Heidelberg (2005) 13. Gazdar, G.: Unbounded dependencies and coordinate structure. Linguistic Inquiry 12(2), 155–184 (1981) 14. Kayne, R.: The Antisymmetry of Syntax. MIT Press, Cambridge (1994) 15. Kobele, G.M.: Generating Copies: An investigation into structural identity in language and grammar. Ph.D. thesis, University of California, Los Angeles (2006) 16. Lasnik, H.: Chains of arguments. In: Epstein, S.D., Hornstein, N. (eds.) Working Minimalism in Current Studies in Linguistics, vol. 32, pp. 189–215. MIT Press, Cambridge (1999) 17. Lasnik, H., Stowell, T.: Weakest crossover. Linguistic Inquiry 22, 687–720 (1991) 18. Mahajan, A.: The A/A-bar Distinction and Movement Theory. Ph.D. thesis, Massachusetts Institute of Technology (1990) 19. Michaelis, J.: On Formal Properties of Minimalist Grammars. Ph.D. thesis, Universit¨ at Potsdam (2001) 20. Michaelis, J.: Notes on the complexity of complex heads in a minimalist grammar. In: Proceedings of the Sixth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+6), Venezia (2002) 21. M¨ uller, G., Sternefeld, W.: Improper movement and unambiguous binding. Linguistic Inquiry 24(3), 461–507 (1993) 22. Pollard, C.J., Sag, I.A.: Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago (1994) 23. Postal, P.M.: Cross-over phenomena. Holt, Rinehart & Winston, New York (1971) 24. Reinhart, T.: Anaphora and Semantic Interpretation. University of Chicago Press, Chicago (1983) 25. Ross, J.R.: Constraints on Variables in Syntax. Ph.D. thesis, Massachusetts Institute of Technology (1967), published in 1986, as Infinite Syntax, Ablex 26. Ruys, E.G.: Weak crossover as a scope phenomenon. Linguistic Inquiry 31(3), 513–539 (2000) 27. Sportiche, D.: Reconstruction, binding and scope, ms. UCLA (2005) 28. Stabler, E.P.: Derivational minimalism. In: Retor´e, C. (ed.) LACL 1996. LNCS (LNAI), vol. 1328, pp. 68–95. Springer, Heidelberg (1997) 29. Stabler, E.P., Keenan, E.L.: Structural similarity within and among languages. Theoretical Computer Science 293, 345–363 (2003) 30. de Vries, M.: The Syntax of Relativization. Ph.D. thesis, Universiteit van Amsterdam (2002)
Without Remnant Movement, MGs Are Context-Free Gregory M. Kobele University of Chicago
[email protected] Abstract. Minimalist grammars offer a formal perspective on a popular linguistic theory, and are comparable in weak generative capacity to other mildly context sensitive formalism. Minimalist grammars allow for the straightforward definition of so-called remnant movement constructions, which have found use in many linguistic analyses. It has been conjectured that the ability to generate this kind of configuration is crucial to the super-context-free expressivity of minimalist grammars. This conjecture is here proven.
In the minimalist program of [2], the well-formedness conditions on movementtype dependencies of the previous GB Theory [1] are reimplemented derivationally, so as to render ill-formed movement chains impossible to assemble. For example, the c-command restriction on adjacent chain links is enforced by making movement always to the root of the current subtree–a position ccommanding any other. One advantage of this derivational reformulation of chain well-formedness conditions is that so-called ‘remnant movement’ configurations, as depicted on the left in figure 1, are easy to generate. Remnant movement occurs when, due to previous movement operations, a moving expression does not itself have a grammatical description. Here we imagine that the objects derivable by the grammar in figure 1 include the black triangle and the complex of white and black triangles, but not the white triangle to the exclusion of the black triangle. From an incremental bottom-up perspective, the structure on the left in figure 1 first involves moving the grammatically specifiable black triangle, but then the non-directly grammatically describable white triangle moves. This is to be contrasted with the superficially similar configuration on the right in figure 1, in which, again from an incremental bottom-up perspective, both
Fig. 1. Remnant Movement (left) vs Non-Remnant Movement (right) C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 160–173, 2010. c Springer-Verlag Berlin Heidelberg 2010
Without Remnant Movement, MGs Are Context-Free
161
movement steps are of grammatically specifiable objects (the first step (here, the dotted line) involves movement of the complex of white and black triangles, and the second step (the solid line) involves movement of the black triangle). In particular, the dependencies generated by remnant movement are ‘crossing’, while those of the permissible type are nested (in the intuitive sense made evident in the figure). The formalism of Minimalist Grammars (MGs) [20] was shown in [15] to be mildly context-sensitive (see also [10]). The MGs constructed in the proof use massive remnant movement to derive the non-context-free patterns, inviting the question as to whether this is necessary. Here we show that it is. MGs without remnant movement derive all and only the context-free languages. This result holds even when the SMC (a canonical constraint on movement, see [7]) is relaxed in such a way as to render the set of well-formed derivation trees non-regular. In this case, the standard proof [15] that MGs are mildly context-sensitive no longer goes through.
1
Mathematical Preliminaries
We assume familiarity with basic concepts of formal language theory. We write 2A for the power set of a set A, and, for f : A → B a partial function, dom(f ) denotes the subset of A on which f is defined. Given a set Σ, Σ ∗ denotes the set of all finite sequences of elements from Σ, including the empty sequence . Σ + is the set of all finite sequences over Σ of length greater than 0. For u, v ∈ Σ ∗ , u v is their concatenation. Often we will simply indicate concatenation via juxtaposition. A ranked alphabet is a set Σ together with a function rank : Σ → N assigning to each ‘function symbol’ in Σ a natural number indicating the arity of the function it denotes. If Σ is a ranked alphabet, we write Σi for the set {σ ∈ Σ : rank(σ) = i}. If σ ∈ Σi , we write σ (i) to indicate this fact. Let Σ be a ranked alphabet, the set of terms over Σ is written TΣ , and is defined to be the smallest set containing each σ ∈ Σ0 , and for each σ ∈ Σn , and t1 , . . . , tn ∈ TΣ , the term σ(t1 , . . . , tn ). For X any set, and Σ a ranked alphabet, Σ ∪ X is also a ranked alphabet, where (Σ ∪ X)0 = Σ0 ∪ X, and (Σ ∪ X)i = Σi for all i > 0. We write TΣ (X) for TΣ∪X . A unary context over Σ is C ∈ TΣ ({x}), such that x occurs exactly once in C. Given a unary context C and term T , we write C[t] to denote the result of substituting t in for x in C (x[t] = t, σ(t1 , . . . , tn )[t] = σ(t1 [t], . . . , tn [t])). A bottom-up tree automaton is given by a quadruple Q, Σ, →, QF , where Q is a finite set of states, Qf ⊆ Q is the set of final states, Σ is a ranked alphabet, and →⊂f in Σ × Q∗ → Q. A bottom-up tree automaton defines a relation ⇒: TΣ (Q) × TΣ (Q). If C is a unary context over Σ ∪ Q, and σ (n) , q1 , . . . , qn → q, then C[σ(q1 , . . . , qn )] ⇒ C[q]. The tree language accepted by a bottom-up tree automaton A is defined as L(A) = {t ∈ TΣ : ∃q ∈ QF . t ⇒∗ q}. A set of trees is regular iff it is the language accepted by some bottom-up tree automaton.
162
2
G.M. Kobele
Minimalist Grammars
We use the notation of [22]. An MG over an alphabet Σ is a triple G = Lex, sel, lic where sel and lic are finite non-empty sets (of ‘selection’ and ‘licensing’ feature types), and for F = {=s, s : s ∈ sel} ∪ {+l, -l : l ∈ lic}, Lex ⊂f in Σ ∗ × F∗ . Given binary function symbols Δ2 := {mrg1, mrg2, mrg3} and unary Δ1 := {mv1, mv2}, a derivation is a term in der(G) = TΔ2 ∪Δ1 ∪Lex , where elements of Lex are treated as nullary symbols. An expression is a finite sequence φ0 , φ1 , . . . , φn of pairs over Σ ∗ × F∗ ; the first component φ0 represents the yield and features of the expression (qua tree) minus any moving parts, and the remaining components represent the yield and features of the moving parts of the expression. Thus an expression of the form φ0 = σ, γ represents a tree with no moving pieces; such an expression is called a com∗ ∗ + plete expression of category γ. Eval : der(G) → 2(Σ ×F ) is a partial function mapping derivations to the sets of expressions they are derivations of. Given ∈ Lex, Eval() = {}, and Eval(mrgi (d1 , d2 )) and Eval(mvi (d)) are defined as {mergei (e1 , e2 ) : ej ∈ Eval(dj )} and {movei (e) : e ∈ Eval(d)} respectively, where the operations mergei and movei are defined below. In the following, σ, τ ∈ Σ ∗ , γ, δ ∈ F∗ , and φi , ψj ∈ Σ ∗ × F∗ . σ, =cγ ∈ Lex τ, c, ψ1 , . . . , ψn merge1 σ τ, γ, ψ1 , . . . , ψn σ, =cγ, φ1 , . . . , φm τ, c, ψ1 , . . . , ψn merge2 τ σ, γ, φ1 , . . . , φm , ψ1 , . . . , ψn σ, =cγ, φ1 , . . . , φm τ, cδ, ψ1 , . . . , ψn merge3 σ, γ, φ1 , . . . , φm , τ, δ, ψ1 , . . . , ψn σ, +cγ, φ1 , . . . , φi−1 , τ, -c, φi+1 , . . . , φm move1 τ σ, γ, φ1 , . . . , φi−1 , φi+1 , . . . , φm σ, +cγ, φ1 , . . . , φi−1 , τ, -cδ, φi+1 , . . . , φm move2 σ, γ, φ1 , . . . , φi−1 , τ, δ, φi+1 , . . . , φm The SMC is a restriction on the domains of move1 and move2 which render these relations functional. no φj = σj , γj is such that γj = -cγj unless j = i
(SMC)
The (string) language generated at a category c (for c ∈ sel) by a MG G is defined to be the yields of the complete expressions of category c:1 Lc (G) := {σ : ∃d ∈ der(G). σ, c ∈ Eval(d)}. 1
Implicit in [15] is the fact that for any c, domc (Eval) = {d : ∃σ. σ, c ∈ Eval(d)} is a regular tree language. This is explicitly shown in [13].
Without Remnant Movement, MGs Are Context-Free
3
163
A Ban on Remnant Movement
In order to implement a ban on remnant movement, we want to implement a temporaray island status on moving expressions: nothing can move out of a moving expression until it has settled down (‘please wait until the train has come to a complete stop before exiting’). Currently, an expression e = φ0 , φ1 , . . . , φk has the form just given, where φ0 is the ‘head’ of the expression, and the other φi are ‘moving parts’. Importantly, although we view such an expression as a compressed representation of a tree, there is no hierarchical relation among the φi . In order to implement a ban against remnant movement, we need to indicate which of the moving parts are contained in which others. We represent this information by retaining some of the relative dominance relations in the represented tree: e = φ0 , T1 , . . . , Tn , where each tree Ti pairs a moving part with a (possibly empty) sequence of trees (the set of trees T is the smallest set X such that X = (Σ ∗ × F∗ ) × X ∗ ). We interpret such a structure as a moving part (the features of which are represented by φi ) which itself may contain i moving subparts (T1i , . . . , Tm ). By allowing these moving subparts to become accessible for movement only after the features of φi have been exhausted, we rule out the crossing, remnant movement type dependencies. The revised cases of the operations merge and move, PBC-merge and PBC-move,2 are given below. The function P BC-Eval interprets derivations d ∈ der(G) in ‘PBC-mode,’ such that P BC-Eval() = {}, P BC-Eval(mvi (d)) = {P BC-movei (e) : e ∈ P BC-Eval(d)}, and P BC-Eval(mrgi (d1 , d2 )) = {P BC-mergei (e1 , e2 ) : ej ∈ P BC-Eval(dj )}. In the below, σ, τ are strings, γ, δ are finite sequences of syntactic features, Si , Tj are trees of the form σ, γ, S1 , . . . , Sn . σ, =cγ ∈ Lex τ, c, T1 , . . . , Tn P BC-merge1 σ τ, γ, T1 , . . . , Tn σ, =cγ, S1 , . . . , Sm τ, c, T1 , . . . , Tn P BC-merge2 τ σ, γ, S1 , . . . , Sm , T1 , . . . , Tn σ, =cγ, S1 , . . . , Sm τ, cδ, T1 , . . . , Tn P BC-merge3 σ, γ, S1 , . . . , Sm , τ, δ, T1 , . . . , Tn σ, +cγ, S1 , . . . , Si−1 , τ, -c, T1 , . . . , Tn , Si+1 , . . . , Sm P BC-move1 τ σ, γ, S1 , . . . , Si−1 , T1 , . . . , Tn , Si+1 , . . . , Sm σ, +cγ, S1 , . . . , Si−1 , τ, -cδ, T1 , . . . , Tn , Si+1 , . . . , Sm P BC-move2 σ, γ, S1 , . . . , Si−1 , τ, δ, T1 , . . . , Tn , Si+1 , . . . , Sm 2
The ‘PBC’ is named after the proper binding condition of [6], which filters out surface structures in which a trace linearly precedes its antecedent. If the antecedent of a trace left behind by a particular movement step is defined to be the element (trace or otherwise) in the target position of that movement, the present modification to the rules merge and move exactly implement the PBC in the minimalist grammar framework.
164
G.M. Kobele
We will continue to require that these rules satisfy (a version of) the SMC.3 Following [12], we define the SMC over P BC-move as follows: no Tj = σj , γj , T1j , . . . , Tnj is such that γj = -cγj unless j = i (PBC-SMC) The string language generated in PBC-mode at a category c is defined as BC usual: LP (G) := {σ : ∃d ∈ der(G).σ, c ∈ P BC-Eval(d)}. c Observe that the rule P BC-merge3 introduces new tree structure, temporarily freezing the moving pieces within its second argument. The rules P BC-move1 and P BC-move2 enforce that only the root of a tree is accessible to movement operations, and that its daughter subtrees become accessible to movement only once the root has finished moving. Note also that the set of well-formed derivation trees in PBC-mode (the set dom(P BC-Eval)) is not a regular tree language (this is due to the laxness of the PBC-SMC). To see this, consider the MG G1 = Lex, {x, y}, {A}, where Lex contains the four lexical items below. a::=x x -A f::x c::=y +A y e::=x y Derivations of complete expressions of category y begin by starting with f, and repeatedly merging tokens of a. Then e is merged, and for each a, a c is merged, and a move step occurs. In particular, although the yields of these trees form the context-free language cn ean f, the number of mrg3 nodes must be equal to the number of mv1 nodes. It is straightforward to show that no finite-state tree automaton can enforce this invariant. Our main result is that minimalist grammars under the PBC mode of derivation (i.e. using the rules just given above) generate exactly the class of contextfree languages.
4
MGs with Hypotheses
Because the elimination of remnant movement guarantees that, viewed from a bottom-up perspective, we will finish moving a containing expression before we need to deal with any of its subparts, we can re-represent expressions using ‘slashfeatures’, as familiar from GPSG [8]. Accordingly, we replace (P BC-)merge3 , 3
There are two natural interpretations of the SMC on expressions e = Φ, T1 , . . . , Tn . First, one might require that no two φi and φj , share the same first feature, regardless of how deeply embedded within trees they may be. This perspective views the tree structure as irrelevant for the statement of the SMC. Another reasonable option is to require only that no two φi and φj share the same first feature, where φi and φj are the roots of trees Ti and Tj respectively. This perspective views the tree structure of moving parts as relevant to the SMC, and allows for a kind of ‘smuggling’ [3], as described in [12]. The results of this paper are independent of which of these two interpretations of the SMC we adopt. We adopt the second, because it is more interesting (the derivation tree sets no longer constitute regular tree languages).
Without Remnant Movement, MGs Are Context-Free
165
which introduces a to-be-moved expression, with a new (non-functional) operation, assume, which introduces a ‘slash-feature’, or hypothesis. A hypothesis takes the form of a pair of feature strings δ, γ. The interpretation of a hypothesis δ, γ, is such that δ records the originally postulated ‘missing’ feature sequence (and thus is unchanging over the lifetime of the hypothesis), whereas γ represents the remaining features of the hypothesis, which are checked off as the derivation progresses. M ove1 , which re-integrates a moving part into the main expression, is replaced with another new operation, discharge. Discharge replaces ‘used up’ hypotheses with the expressions that they ‘could have been.’ These expressions may themselves contain hypothesized moving pieces. Derivations of minimalist grammars with hypothetical reasoning in this sense are terms d ∈ (2) (2) (1) Hyp-der(G) over a signature {mrg1 , mrg2 , assm(1) , mv2 , dschrg(2) }∪Lex, and Hyp-Eval partially maps such terms to expressions in the by now familiar manner. In the below, σ, τ are strings over Σ, γ, δ, ζ are finite sequences of syntactic features, φi , ψj are pairs of the form δ, γ. σ, =cγ ∈ Lex τ, c, ψ1 , . . . , ψn merge1 σ τ, γ, ψ1 , . . . , ψn σ, =cγ, φ1 , . . . , φm τ, c, ψ1 , . . . , ψn merge2 τ σ, γ, φ1 , . . . , φm , ψ1 , . . . , ψn σ, =cγ, φ1 , . . . , φm assume σ, γ, φ1 , . . . , φm , cδ, δ σ, +cγ, φ1 , . . . , φi−1 , δ, -c, φi+1 , . . . , φm τ, δ, ψ1 , . . . , ψn discharge τ σ, γ, φ1 , . . . , φi−1 , ψ1 , . . . , ψn , φi+1 , . . . , φm σ, +cγ, φ1 , . . . , φi−1 , ζ, -cδ, φi+1 , . . . , φm move2 σ, γ, φ1 , . . . , φi−1 , ζ, δ, φi+1 , . . . , φm We subject the operations move2 and discharge to a version of the SMC: no φj = ζj , γj is such that γj = -cγj unless j = i
(Hyp-SMC)
The language of a minimalist grammar G at category c using hypothetical reasoning is defined to be: LHyp (G) := {σ : ∃d ∈ Hyp-der(G). σ, c ∈ Hyp-Eval(d)} c The operation discharge constrains the kinds of assumptions introduced by assume which can be part of a well-formed derivation to be those which are of the form cδ, δ, where there is some lexical item σ, γcδ. As there are finitely many lexical items, there are thus only finitely many useful assumptions given a particular lexicon. It will be implicitly assumed in the remainder of this paper that assume is restricted so as to generate only useful assumptions. We henceforth index assm nodes with the features of the hypotheses introduced (writing thus assmcγ for an assume operation introducing the hypothesis cγ, γ).
166
G.M. Kobele
Theorem 1. For any G, and any c ∈ selG , the set domc (Hyp-Eval) = {d : ∃σ. σ, c ∈ Hyp-Eval(d)} is a regular tree language. Proof. Construct a nondeterministic bottom-up tree automaton whose states are (|lic| + 1)-tuples of pairs of suffixes of lexical feature sequences. The Hyp-SMC allows us to devote each component of such a sequence beyond the first to the (if it exists, unique) hypothesis beginning with a particular -c feature, and thus we assume to be given a fixed enumeration of lic. The remarks above guarantee that there are only a finite number of such states needed. Given an expression, φ0 , φ1 , . . . , φn , the state representing it has as its ith component the pair , if there is no φj beginning with the ith -c feature, and the unique φj beginning with the ith -c feature otherwise. The 0th component of a state is always of the form , γ, where γ is the feature sequence of φ0 . As we are interested in derivations of complete expressions of category c, the final state is , c, , , . . . , , . The transitions of the automaton are defined so as to preserve this invariant: at a lexical item = σ, γ, the automaton enters the state , γ, , , . . . , , , and at an internal node σ (n) (q1 , . . . , qn ), the automaton enters the state q just in case there are expressions e1 , . . . , en represented by states q1 , . . . , qn which are mapped by the operation denoted by σ to an expression e represented by state q. We use the facts that linear homomorphisms preserve recognizability and that the yield of a recognizable set of trees is context-free [4] in conjunction with theorem 1 to show that minimalist grammars using hypothetical reasoning define exactly the context-free languages. Theorem 2. For any G, and any c ∈ selG , LHyp (G) is context-free. c Proof. Let G and c be given. By theorem 1, D = domc (Hyp-Eval) is recognizable. Let E = f [D], where f is the homomorphism defined as follows (f maps nullary symbols to themselves): σ(f (e2 ), f (e1 )) if σ ∈ {mrg2 , dschrg} f (σ(e1 , . . . , en )) = σ(f (e1 ), . . . , f (en )) otherwise Inspection of f reveals that it is merely putting sister subtrees in the order in which they are pronounced (` a la Hyp-Eval) and thus, for any d ∈ D, Hyp-Eval(d) contains σ, c iff yield(f (d)) = σ. As f is linear, E is recognizable, (G) is context-free. and thus yield(E) = LHyp c
5
Relating the PBC to Hypothetical Reasoning
To show that minimalist grammars in PBC mode are equivalent to minimalist grammars with hypothetical reasoning we will exhibit an Eval-preserving bijection between complete derivation trees of both formalisms.4 The gist of the 4
A complete derivation tree is just one which is the derivation of a complete expression. I will in the following use the term in conjunction with derivations in der(G) to refer exclusively to expressions derived in PBC-mode.
Without Remnant Movement, MGs Are Context-Free
mv1
dschrg
mrg1 c
167
mrg1 c
mv1
dschrg
mrg1 c
mrg1 c
mrg3 e
e
mrg3 a
mrg1 a
assmx-A assmx-A a
f
mrg1 a
f
Fig. 2. Derivations in G1 of afcace
transformation is best provided via an example. Consider the trees in figure 2, which are derivations over the MG G1 in PBC mode and using hypothetical reasoning respectively of the string afcace. The dotted lines in the derivation trees in figure 2 indicate the implicit dependencies between the unary operations and other expressions in the derivation. For example, mv nodes are connected via a dotted line to the subtree which ‘moves’. Similarly, assmγ nodes are connected via a dotted line to the expression which ultimately discharges the assumption they introduced. The subtree connected via dotted line to a mv1 node I will call the subtree targeted by that move operation, and is that subtree whose leftmost leaf introduces the -c feature checked by this move step, and which is the right child of a mrg3 node. Note that if the derivation is well-formed (i.e. is in the domain of P BC-Eval) there is a unique subtree targeted by every mv1 node. The right daughter of a dschrg node connected via dotted line to a assmγ node is called the hypothesis discharged by that discharge operation, and is connected to the assmγ node which introduces the hypothesis which is discharged at its parent node. Again, if the derivation is well-formed, the assmγ node in question is the unique such. Note, however, that it is only in the case of complete derivation trees that to every assmγ node there corresponds the hypothesis-discharging discharge node. The major difference between PBC and Hyp MG derivations is that expressions entering into multiple feature checking relationships during the course of the derivation are introduced into the derivation at the point their first feature checking relationship takes place in the case of PBC (and MG derivations more generally), and at the point their last feature checking relationship obtains in the case of Hyp MGs. The relation between the two trees in figure 2, and more generally between PBC derivations and hypothetical derivations, is that the subtree connected via dotted line to mv1 becomes the second argument of dschrg, and the second argument of mrg3 becomes the subtree connected to assm via a dotted line. This is shown in figure 3.
168
G.M. Kobele
mv1
≈
dschrg
mrg3
≈
assm
Fig. 3. Relating PBC derivations and hypothetical derivations
We define a relation Trans ⊂ der(G) × Hyp-der(G) in the following manner: 1. Trans(mv1 (d), dschrg(d1 , d2 )), where, for d is the (unique) subtree targeted by this instance of mv1 , trans(d, d1 ) and Trans(d , d2 ) 2. Trans(mrg3 (d1 , d2 ), assmγ (d)), where P BC-Eval(d2 ) = {φ0 , φ1 , . . . , φn }, φ0 = σ, γ, and Trans(d1 , d) 3. Trans(σ(d1 , . . . , dn ), σ(d1 , . . . , dn )), where Trans(di , di ), for all 1 ≤ i ≤ n By inspection of the above case-wise definition, it is easy to see that Theorem 3. Trans is a function. The point of defining Trans as per the above is to use it to show that the structural ‘equivalence’ sketched in figure 3 preserves relevant aspects of weak generative capacity. Expressions denoted by derivations in both ‘formalisms’ have been represented here as sequences. However, only the type of the first element of such sequences (a pair σ, γ ∈ Σ ∗ × F∗ ) is identical across formalisms (the other elements are trees with nodes pairs of the same type in the PBC MGs, but are pairs of feature sequences in Hyp MGs). Accordingly, the relation I will show Trans to preserve is the identity of the first element of the yield of the source and target derivation trees. Theorem 4. For d ∈ der(G), such that {φ0 , T1 , . . . , Tn } = P BC-eval(d), if ψ0 , ψ1 , . . . , ψk ∈ Hyp-Eval(Trans(d)), then φ0 = ψ0 , n = k, and for 1 ≤ i ≤ n, i and ψi = ζi , γi . Ti = σi , γi , T1i , . . . , Tm Proof. By induction. For the base case, let d be a lexical item. P BC-Eval(d) and Hyp-Eval(d) are both equal to {d}, which by case 3 of the definition of Trans, is equal to Hyp-Eval(Trans(d)). Now let d1 and d2 be appropriately related to Trans(d1 ) and Trans(d1 ) respectively. There are five cases to consider (mrgi , mvj , for 1 ≤ i ≤ 3 and 1 ≤ j ≤ 2). 1. Let d = mrg1 (d1 , d2 ). Then by case 3 of the definition of Trans, Trans(d) = mrg1 (Trans(d1 ), Trans(d2 )). P BC-Eval(d) is defined if and only if both P BC-Eval(d1 ) = {σ, =cγ} and P BC-Eval(d1 ) = {τ, c, T1 , . . . , Tn }, in which case it is {σ τ, γ, T1 , . . . , Tn }. By the induction hypothesis, we conclude that Hyp-Eval(Trans(d1 )) = {σ, =cγ} and Hyp-Eval(Trans(d2 )) = {τ, c, ψ1 , . . . , ψn }, which are in the domain of merge1 , and thus, by inspection of the definition of this latter, that d and Trans(d) are appropriately related as well.
Without Remnant Movement, MGs Are Context-Free
169
2. The case where d = mrg2 (d1 , d2 ) is not interestingly different from the above. 3. Let d = mrg3 (d1 , d2 ). Then by case 2 of the definition of Trans, Trans(d) = assmγ (Trans(d1 )). As d1 and Trans(d1 ) are appropriately related (by the induction hypothesis), and as both merge3 and assumeγ define the first component of their result to be the same as the first component of their leftmost argument minus the first feature, d and Trans(d) are appropriately related as well. 4. Let d = mv1 (d1 ), and let d2 be the unique subtree targeted by this instance of mv1 . For P BC-Eval(d) to be defined, P BC-Eval(d1 ) must be equal to {σ, +cγ, S1 , . . . , Si−1 , τ, -c, T1 , . . . , Tn , Si+1 , . . . , Sm }. In this case, P BC-Eval(d2 ) = {τ, δ-c, T1 , . . . , Tn }. By the induction hypothesis, Hyp-Eval(Trans(d1 )) = {σ, +cγ, φ1 , . . . , δ, -c, φi+1 , . . . , φm }, and in addition Hyp-Eval(Trans(d2 )) = {τ, δ-c, ψ1 , . . . , ψn }. Thus, we can see that the discharge operation is defined on these arguments, and is equal to {τ σ, γ, φ1 , . . . , φi−1 , ψ1 , . . . , ψn , φi+1 , . . . , φm }. Applying the operation move1 to P BC-Eval(d1 ) we obtain the unit set consisting of the single element τ σ, γ, S1 , . . . , Si−1 , T1 , . . . , Tn , Si+1 , . . . , Sm , and thus establish that d is appropriately related to Trans(d). 5. Finally, let d = mv2 (d1 ). By case 3, Trans(d) = mv2 (Trans(d1 )). If P BC-Eval(d) is defined, then there is a unique moving component Ti of P BC-Eval(d1 ) which an appropriate first feature. By the induction hypothesis, there is a unique corresponding ψi in Hyp-Eval(Trans(d1 )), allowing move2 (Hyp-Eval(Trans(d1 ))) to be defined, and us to see that d and Trans(d) are appropriately related in this case too. Note that whenever d ∈ der(G) is complete, so too is Trans(d) ∈ Hyp-der(G). Corollary 1. Trans preserves completeness. Furthermore, by inspecting the cases of the proof above, we see that the hypothesis introduced by a particular assmγ node which is the translation of a mrg3 node, is discharged at the dschrg node which is the translation of the mv1 node which targets the right daughter of that mrg3 node. From theorem 4 follows the following BC Corollary 2. For every G, and any feature sequence γ, LP (G) ⊆ LHyp (G). γ γ
To prove the inclusion in the reverse direction, I will show that for every complete d ∈ Hyp-der(G) there is a d ∈ der(G) such that Trans(d) = d . I define a function snarT which takes a pair consisting of a derivation tree d ∈ Hyp-der(G) and a set M of pairs of strings over {0, 1}∗ and derivation trees in der(G). We interpret a pair p, d ∈ M as stating that we are to insert tree d as a daughter of the node at address p. (Recall that in translating from Hyp MG derivation trees to PBC trees we need to ‘lower’ expressions introduced at a dschrg node into the position in which they were assumed.) Given a set of such pairs M , I denote by (i) M (for i ∈ {0, 1}) the set {p, d : ip, d ∈ M }. I will use this notation to keep track of where the trees in M should be inserted into the translated structure. Basically, when an item , d ∈ M , it indicates that it should be inserted
170
G.M. Kobele
as a daughter of the current root. I will use the notation M () to denote the unique d such that , d ∈ M , if one exists. 1. for d = dschrg(d1 , d2 ), and p the address in d of the assmγ node whose hypothesis is discharged at this dschrg node, we define snarT(d, M ) = mv1 (snarT(d1 , (0) (M ∪ {p, snarT(d2 , (1) M )}))) 2. for d = assmγ (d1 ), snarT(d, M ) = mrg3 (snarT(d1 , (0) M ), M ()) 3. snarT(σ(d1 , . . . , dn ), M ) = σ(snarT(d1 , (0) M ), . . . , snart(dn , (n−1) M )) Note that, although snarT is not defined on all trees in Hyp-der(G) (case 2 is undefined whenever there is no (unique) , d ∈ M ), it is defined on all complete d ∈ Hyp-der(G). Theorem 5. For all complete d ∈ Hyp-der(G), snarT(d, ∅) ∈ der(G). Proof. Case 2 is the only potential problem (as is undefined whenever M () is). However, in a complete derivation tree, every assmγ node is dominated by a dschrg node, at which is discharged the hypothesis introduced by this former. Moreover, no dschrg node discharges the hypothesis of more than one assmγ node. Thus, we are guaranteed in a complete derivation tree that at each occurrence of an assmγ node M () is defined. That the range of snarT is contained in der(G) is verified by simple inspection of its definition. Of course, we want not just that snarT map derivations in Hyp-der(G) to ones in der(G), but also that a derivation d in der(G) to which a complete derivation d in Hyp-der(G) is mapped by snarT maps back to d via Trans. This will allow us to conclude the converse of corollary 2. Theorem 6. For all complete d ∈ Hyp-der(G), d = Trans(snarT(d, ∅)). Proof. In order to have a strong enough inductive hypothesis, we need to prove something stronger than what is stated in the theorem. Let d ∈ Hyp-der(G), and M be a partial function with domain {0, 1}∗ and range der(G), such that p is the address of an assmγ node in d without a corresponding dschrg node iff there is some d such that M (p) = d . (In plain English, M tells us how to translate ‘unbound’ assmγ nodes in d.) Then d = Trans(snarT(d, M )). Note that the statement of the theorem is a special case, as for d complete there are no unbound assmγ nodes, and thus M can be ∅. For the base case, let d be a lexical item (and thus complete). Then by case 3 of the definition of snarT, snarT(d, ∅) = d, and by case 3 of the definition of Trans, Trans(d) = Trans(snarT(d)) = d. Now let d1 , d2 , be as per the above such that for appropriate M1 , M2 , Trans(snarT(d1 , M1 )) = d1 and Trans(snarT(d2 , M2 )) = d2 . There are again five cases to consider. 1. Let d = mrg1 (d1 , d2 ), and M an assignment of trees in der(G) to unbound assmγ nodes in d. Then Trans(snarT(d, M )) is, by case 3 of the definition of snarT, Trans(mrg1 (snarT(d1 , (0) M ), snarT(d2 , (1) M ))). By
Without Remnant Movement, MGs Are Context-Free
2. 3.
4.
5.
171
case 3 of the definition of Trans, this is easily seen to be identical to mrg1 (Trans(snarT(d1 , (0) M )), Trans(snarT(d2 , (1) M ))). As, for any i, (i) M is an assignment of trees to unbound assmγ nodes in di , the inductive hypothesis applies, and thus Trans(snarT(d, M )) = d, as desired. The case where d = mrg2 (d1 , d2 ) is not interestingly different from the above. Let d = assmγ (d1 ), and let M assign trees to its unbound assmγ nodes (in particular, M () is defined). Then by case 2 of the definition of snarT, Trans(snarT(d, M )) = Trans(mrg3 (snarT(d1 , (0) M ), M ())). Now, according to case 2 of the definition of Trans, this is seen to be identical to assmγ (Trans(snarT(d1 , (0) M ))), which according to our inductive hypothesis is simply assmγ (d1 ) = d. Let d = dschrg(d1 , d2 ), and let M assign trees in der(G) to all and only unbound assmγ nodes in d. By case 1 of the definition of snarT, we have that Trans(snarT(d, M )) is equal to Trans(mv1 (snarT(d1 , (0) (M ∪ {0p, snarT(d2 , (1) M )})))), where 0p is the address of the assmγ node in d bound by the dschrg node at its root. Next, we apply the first case of the definition of Trans. This gives us dschrg(Trans(snarT(d1 , (0) (M ∪ {0p, snarT(d2 , (1) M )}))), Trans(d )), where d is the unique subtree targeted by the mv1 node at the root of the translated expression. This is the right daughter of the mrg3 node which the assmγ node at position p in d1 translates as, namely, snarT(d2 , (1) M ). As (0) M assigns trees to all unbound assmγ nodes in d1 except for the one at location p, (0) (M ∪ {0p, snarT(d2 , (1) M )}) assigns trees to all of d1 ’s unbound assmγ nodes. Therefore, the induction hypothesis applies, and Trans(snarT(d, M )) is seen to be identical to dschrg(d1 , d2 ). Finally, let d = mv2 (d1 ), and M as assignment of trees to unbound assmγ nodes in d. By case 3, snarT(d, M ) = mv2 (Trans(snarT(d1 , M ))), which, by the induction hypothesis, is equal to d.
Theorem 4 allowed us to conclude that for every d ∈ der(G) deriving a complete expression, there was a complete d ∈ Hyp-der(G) deriving the same complete expression (whence corollary 2). From theorem 6 we are able to conclude the reverse as well. BC Corollary 3. For every G, and any feature sequence γ, LP (G) = LHyp (G). γ γ
As Hyp MGs were shown in theorem 2 to be context free, we conclude that MGs subject to the proper binding constraint are as well.
6
Conclusion
We have demonstrated that movement by itself is not enough to describe noncontext free languages; the super-CFness of the MG formalism is essentially tied
172
G.M. Kobele
to remnant movement. This result confirms the intuition of several (cf. [15,21]), and seems related to the results of [18] in the context of the formalism of the GB theory presented therein. Stabler [21] conjectures that: Grammars in MG can define languages with more than 2 counting dependencies only when some sentences in those languages are derived with remnant movements. As we have shown here that MGs without remnant movement can only define context-free languages, we have proven Stabler’s conjecture. However, we can in fact show a strengthened version of this conjecture to be true. Beyond the mere existence of remnant movement, where at item moves from which has already been moved another, we can identify hierarchies of such movement, depending on whether the item moved out of the ‘remnant mover’ is itself a remnant, and if so, whether the item moved out of that item is a remnant, and so on. We could place an upper bound of k on the so-defined degree of remnant movement we were willing to allow by, using the tree-structured representation of moving subpieces from our definition of PBC-MGs, allowing the move operations to target -c features of up to depth k in the tree. In this case, however, we could simply enrich the complexity of our hypotheses in the corresponding Hyp-MGs by a finite amount, which would not change their generative capacity. Thus, in order to derive non-context free languages, MGs must allow for movement of remnants of remnants of remnants. . . , in other words, a MG can define languages with more than two counting dependencies only when there is no bound k such that every sentence in the language is assigned a structure with remnant degree less than k. Given that MGs can analyze non-CF patterns only in terms of unbounded remnant movement, one question these results make accessible is which such patterns in human languages are naturally so analyzed? Perhaps the most famous of the supra CF constructions in natural language is given by the relation between embedded verb clusters and their arguments in Swiss German [19]. [14] have provided an elegant analysis of verbal clusters in Germanic and Hungarian using remnant movement.5 Patterns of copying in natural language [5,17,11] on the other hand, do not seem particularly naturally treated in terms of unbounded remnant movement. [11] shows how the addition of ‘copy movement’ (non-linear string manipulation operations) to the MG formalism allows for a natural treatment of these patterns, one that is orthogonal to the question of whether our grammars for natural language should use bounded or unbounded remnant movement. 5
Not every linguist working in this tradition agrees that verb clusters are best treated in terms of remnant movement. [9] argues that remnant movement approaches to verb clustering are inferior to one using head movement. Adding head movement to MGs without remnant movement allows the generation of non-context free languages [16].
Without Remnant Movement, MGs Are Context-Free
173
References 1. Chomsky, N.: Lectures on Government and Binding. Foris, Dordrecht (1981) 2. Chomsky, N.: The Minimalist Program. MIT Press, Cambridge (1995) 3. Collins, C.: A smuggling approach to the passive in English. Syntax 8(2), 81–120 (2005) 4. Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree automata techniques and applications (2002), http://www.grappa.univ-lille3.fr/tata 5. Culy, C.: The complexity of the vocabulary of Bambara. Linguistics and Philosophy 8(3), 345–352 (1985) 6. Fiengo, R.: On trace theory. Linguistic Inquiry 8(1), 35–61 (1977) 7. G¨ artner, H.M., Michaelis, J.: Some remarks on locality conditions and minimalist grammars. In: Sauerland, U., G¨ artner, H.M. (eds.) Interfaces + Recursion = Language?, Studies in Generative Grammar, vol. 89, pp. 161–195. Mouton de Gruyter, Berlin (2007) 8. Gazdar, G., Klein, E., Pullum, G., Sag, I.: Generalized Phrase Structure Grammar. Harvard University Press, Cambridge (1985) 9. Haider, H.: V-clustering and clause union - causes and effects. In: Seuren, P., Kempen, G. (eds.) Verb Constructions in German and Dutch, pp. 91–126. John Benjamins, Amsterdam (2003) 10. Harkema, H.: Parsing Minimalist Languages. Ph.D. thesis, University of California, Los Angeles (2001) 11. Kobele, G.M.: Generating Copies: An investigation into structural identity in language and grammar. Ph.D. thesis, University of California, Los Angeles (2006) 12. Kobele, G.M.: A formal foundation for A and A-bar movement in the minimalist program. In: Kracht, M., Penn, G., Stabler, E.P. (eds.) Mathematics of Language, vol. 10. UCLA (2007) 13. Kobele, G.M., Retor´e, C., Salvati, S.: An automata theoretic approach to minimalism. In: Rogers, J., Kepser, S. (eds.) Proceedings of the Workshop Model-Theoretic Syntax at 10; ESSLLI 2007, Dublin (2007) 14. Koopman, H., Szabolcsi, A.: Verbal Complexes. MIT Press, Cambridge (2000) 15. Michaelis, J.: On Formal Properties of Minimalist Grammars. Ph.D. thesis, Universit¨ at Potsdam (2001) 16. Michaelis, J.: Notes on the complexity of complex heads in a minimalist grammar. In: Proceedings of the Sixth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+6), Venezia (2002) 17. Michaelis, J., Kracht, M.: Semilinearity as a syntactic invariant. In: Retor´e, C. (ed.) LACL 1996. LNCS (LNAI), vol. 1328, pp. 37–40. Springer, Heidelberg (1997) 18. Rogers, J.: A Descriptive Approach to Language-Theoretic Complexity. CSLI Publications, Stanford (1998) 19. Shieber, S.M.: Evidence against the context-freeness of natural language. Linguistics and Philosophy 8, 333–343 (1985) 20. Stabler, E.P.: Derivational minimalism. In: Retor´e, C. (ed.) LACL 1996. LNCS (LNAI), vol. 1328, pp. 68–95. Springer, Heidelberg (1997) 21. Stabler, E.P.: Remnant movement and complexity. In: Bouma, G., Hinrichs, E., Kruijff, G.J.M., Oehrle, R. (eds.) Constraints and Resources in Natural Language Syntax and Semantics, ch. 16, pp. 299–326. CSLI Publications, Stanford (1999) 22. Stabler, E.P., Keenan, E.L.: Structural similarity within and among languages. Theoretical Computer Science 293, 345–363 (2003)
The Algebra of Lexical Semantics Andr´ as Kornai Institite for Quantitative Social Science, Harvard University, 1737 Cambridge St, Cambridge MA 02138 Hungarian Academy of Sciences, Computer and Automation Research Institute, 13-17 Kende u, H-1111 Budapest
[email protected] http://kornai.com
Abstract. The current generative theory of the lexicon relies primarily on tools from formal language theory and mathematical logic. Here we describe how a different formal apparatus, taken from algebra and automata theory, resolves many of the known problems with the generative lexicon. We develop a finite state theory of word meaning based on machines in the sense of Eilenberg [11], a formalism capable of describing discrepancies between syntactic type (lexical category) and semantic type (number of arguments). This mechanism is compared both to the standard linguistic approaches and to the formalisms developed in AI/KR.
1
Problem Statement
In developing a formal theory of lexicography our starting point will be the informal practice of lexicography, rather than the more immediately related formal theories of Artificial Intelligence (AI) and Knowledge Representation (KR). Lexicography is a relatively mature field, with centuries of work experience and thousands of eminently usable work products in the form of both mono- and multilingual dictionaries. In contrast to this, KR is a rather immature field, with only a few decades of work experience, and few, if any, usable products. In fact, our work continues the trend toward more formalized lexicon-building that started around the Longman Dictionary (Boguraev and Briscoe [6]) and the Collins-COBUILD dictionary (Fillmore and Atkins [14]), but takes it further in that our focus is with the mathematical foundations rather than the domain-specific algorithms. An entry in a standard monolingual dictionary will have several components, such as the etymology of the word in question; part of speech/grammatical category information; pronunciation guidelines in the form of phonetic/phonological transcription; paradigmatic forms, especially if irregular; stylistic guidance and examples; a definition, or several, for different senses of the word; and perhaps even a picture, particularly for plants, animals, and artifacts. It is evident from C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 174–199, 2010. c Springer-Verlag Berlin Heidelberg 2010
The Algebra of Lexical Semantics
175
the typeset page that the bulk of the information is in the definitions, and this is easily verified by estimating the number of bits required to encode the various components. Also, definitions are the only truly obligatory component, because a definition will be needed even for words lacking in exceptional forms (these are the majority) or an interesting etymology, with a neutral stylistic value, predictable part of speech (most words are nouns), and an orthography sufficiently indicative of pronunciation. There is little doubt that definitions are central to the description of words, yet we have far richer and better formalized theories of etymology, grammatical category, morphological structure, and phonological transcription than we have theories of word meaning. Of necessity, work such as Dowty [8] concentrates on elucidating the semantic analysis of those terms for which the logic has the resources: since Montague’s intensional logic IL includes a time parameter, in depth analysis of temporal markers (tense, aspect, time adverbials) becomes possible. But as long as the logic lacks analogous resources for space, kinship terms, sensory inputs, or obligations, this approach has no traction, and heaping all these issues on top of what was already a computationally intractable logic calculus has not proven fruitful. First Order Logic (FOL) is a continental divide in this regard. From a mathematical perspective, FOL is a small system, considering that the language of set theory requires only one binary relation, ∈, and it is evident both from the Peano and the ZF axioms that you will need all well-formed formulas (or at least the fragment that has no atomic sentence lying in the scope of more than three quantifiers, see Tarski and Givant [41]) to do arithmetic. Therefore, those who believe that mathematics is but a small, clean, well-organized segment of natural language will search for the appropriate semantics somewhere upwards of FOL – this is the Montague Grammar (MG) tradition, where higher order intensional logic is viewed as essential. There is already significant work in trying to restrict the power of the Turing-complete higher order intensional apparatus to FOL (Blackburn and Bos [5]) and here we take this further, moving to formalisms that fall at the low end of the complexity scale, well below FOL. At that point, much of what mathematical logic offers is not applicable, and methods of algebra have more traction, as will be discussed in Section 2 in more detail. It is widely accepted that “people who put knowledge into computers need mathematical logic, including quantifiers, as much as engineers need calculus” (McCarthy [32]) but we claim that these tools are neither available in natural language (as noted repeatedly by the inventors of modern mathematical logic from Frege and Russell to Tarski) nor are they required for the analysis of natural language text – in the MG-style analysis it is the needs of the computer programmer that are being catered to at the expense of modeling the actual cognitive capabilities of the native speaker. This is not to say that such needs, especially for the engineer building knowledge-based systems, are not real, but our thesis is that the formalism appropriate for natural language semantics is too weak to supply this, being capable of natively supporting only a far weaker form of analogical reasoning discussed in Section 4.
176
A. Kornai
In this paper we offer a formal theory of lexical definitions. A word that is to be defined will be given in italics; its definition will use for the most part unary atoms, given in typewriter font and to a lesser extent binary atoms, given is small caps; its phonological representation (which we will also call its printname) will be marked by underscoring. Aside from the fancy typography, this is very much in keeping with linguistic tradition where a sign is conceived of as an ordered pair of meaning and form. (The typographical distinctions will pay off only in making the formal parts easier to parse visually – in running text, we will also use italics for emphasis and for the introduction of technical terms.) While we will have little to say about pronunciation, paradigmatic forms, style, or etimology here, the fact that these are important to the practice of lexicography is always kept in mind, and we will make an effort to indicate, however programmatically, how these are to be subsumed under the overall theory presented here. Given the widely accepted role of the lexicon in grammatical theory as the storage place of last resort, containing all that is idiosyncratic, arbitrary, and language-particular, the question must be asked: why should anyone want to dive in this trashcan? First, we need to see clearly that the lexicon is not trash, but rather it is the essential fuel of all communicative effort. As anyone trying to communicate in a language they mastered only at a tourist level will know, lack of crisp grammar is rarely a huge barrier to understanding. If you can produce the words, native speakers will generally be forgiving if the conjugation is shaky or the proper auxiliary is missing. But if you don’t have the words for beef stew or watch repairman, knowing that the analytic present perfect combines stage-level and individual-level predication and thus gives rise to an inchoative meaning will get you nowhere. A more rigorous estimate of the information content of sentences confirms our everyday experience. The word entropy of natural language is about 12-16 bits/word (see Kornai [26]:7.1 for how this depends on the language in question). √ The number of binary parse trees over n nodes is Cn ∼ 4n / πn1.5 or less than 2 bits per word. Aronoff[4] describes in some detail how the Masoretes used only 2 bits (four levels of symbols) to provide a binary parse tree for nearly every Biblical verse – what we learned of coding since would now enable us to create an equally sparse system that is sufficiently detailed to cover every possible branching structure with slightly less than two bits on the average. Definitions of logical structure other than by parse tree are possible, but they do not alter the picture significantly: logical structure accounts for no more than 12-16% of the information conveyed by a sentence, a number that actually goes down with increased sentence length. Another equally important reason why we need to develop a formal theory of word meaning is that without such a theory it is impossible to treat logical arguments like God cannot create a mountain without creating a valley which are based on the meaning of the predicates rather than on the meaning of the logical connectives. Why is this argument correct, even if we assume an omnipotent God? Because mountain means something like land higher than surrounding
The Algebra of Lexical Semantics
177
land so for there to be a mountain there needs to be a lower reference land, if there was no such reference ‘valley’ the purported mountain wouldn’t actually be a mountain. For St. Thomas Aquinas the argument serves to demonstrate that even God is bound by the laws of logic, and for us it serves as a reminder that the entire Western philosophical tradition from Aristotle to the Schoolmen considered word meaning an essential part of logic. We should add here that the same is true of the Eastern tradition, starting with Confucius’ theory of cheng ming (rectification of names) – for example, one who rules by force, rather than by the decree of heaven, is a tyrant, not a king (see Graham [16]:29). Modern mathematical logic, starting with De Morgan, could succeed in identifying a formal framework that can serve as a foundation of mathematics without taking the meaning of the basic elements into account because mathematical content differs from natural language content precisely in being lodged in the axioms entirely. However, for machine understanding of natural language text, lacking a proper theory of the meaning of words is far more of a bottleneck than the lack of compositional semantics, as McCarthy [31], and the closely related work on naive physics (Hayes [18]) already made clear. What does a theory of the lexicon have to provide? First, adequate support for the traditional lexicographic tasks such as distinguishing word senses, deciding whether two words/senses are synonymous or perhaps antonymous, whether one expression can be said to be a paraphrase of another, etc. Second, it needs to connect to a theory of the meaning of larger (non-lexicalized) constructions including, but not necessarily limited to, sentential syntax and semantics. Third, it should provide a means of linking up meanings across languages, serving as a translation pivot. Fourth, it should be coupled to some theory of inference that enables, at the very least, common sense reasoning about objects, people, and natural phenomena. Finally, the theory should offer learning algorithms whereby the representation of meanings can be acquired by the language learner. In this paper we disown the problem of learning, how an English-speaking child associates water with the sensory input (see Keller [21]), as it belongs more in cognitive science and experimental psychology than in mathematical linguistics, and the problem of pattern recognition: when is a person fat? It is possible to define this as the outcome of some physical measurements such as the Body Mass Index, but we will argue at some length that this is quite misguided. This is not to say that there is no learning problem or pattern recognition problem, but before we can get to these we first need a theory of what to learn and recognize. This is not the place to survey the history of lexical semantics, and we confine ourselves to numerical estimates of coverage on the core vocabulary. The large body of analytic work on function words such as connectives, modals, temporals, numerals, and quantifiers covers less than 5% of core vocabulary, where 90% are content words. Erring on the side of optimism and assuming that categories of space, case in particular, can be treated similarly, would bring this number up to 6%, but not further, since the remaining large classes of function words, in particular gender and class markers, are clearly non-logical. Another large body of research approaches natural kinds by means of species and genera. But in
178
A. Kornai
spite of its venerable roots, starting with Aristotle’s work on eidopoios diaphora, and its current popularity, including WordNet, EuroWordNet, and AsiaWordNet on the one hand and Semantic Web description logic (OWL) on the other, this method covers less than 10% of core vocabulary. This is still a big step forward in that it is imposing a formal theory on some content words, by means of a technique, default inheritance along is a links, that is missing from standard logic, including the high-powered modal intensional logics commonly used in sentential semantics. Perhaps surprisingly, the modern work on verb classification including Gruber [17], Dowty [9], Levin [29], FrameNet (Fillmore [12]), and VerbNet (Kipper et al [24]) has far broader scope, covering about 25% of core vocabulary. Taking all these together, and assuming rather generously that all formal problems concerning these systems have been resolved, this is considerably less than half of the core vocabulary, and when it comes to the operations on these elements, all the classical and modern work on the semantics associated with morphological operations (P¯ an.ini, Jakobson, Kiparsky) covers numerically no more than 5-10% of the core operations. That the pickings of the formal theory are rather slim is especially clear if we compare its coverage to that of the less formally stated, but often strikingly insightful work in linguistic semantics, in particular to the work of Wierzbicka, Lakoff, Fauconnier, Langacker, Talmy, Jackendoff, and others often broadly grouped together as ‘cognitively inspired’. We believe that part of the reason why the formal theory has so little traction is that it aims too high, largely in response to the well-articulated needs of AI and KR.
2
The Basic Elements
In creating a formal model of the lexicon the key difficulty is the circularity of traditional dictionary definitions – the first English dictionary, Cawdrey [7] already defines heathen as gentile and gentile as heathen. The problem has already been noted by Leibniz (quoted in Wierzbicka [45]): Suppose I make you a gift of a large sum of money saying you can collect it from Titius; Titius sends you to Caius; and Caius, to Maevius; if you continue to be sent like this from one person to another you will never receive anything. One way out of this problem is to come up with a small list of primitives, and define everything else in terms of these. There are many efforts in this direction (the early history of the subject is discussed in depth in Eco [10]) but the modern efforts begin with Ogden’s [35] Basic English. The KR tradition begins with the list of primitives introduced by Schank [40], and a more linguistically inspired list is developed by Wierzbicka and the NSM school. But it is not at all clear how Schank or Wierzbicka would set about defining new words based on their lists (the reader familiar with their systems should try to apply them to any term that is not on their lists such as liver). As a result, in cognitive science many
The Algebra of Lexical Semantics
179
have practically given up on meaning decomposition as hopeless. For example Mitchell et al. [33] distinguish words from one another by measuring correlation with their core words in the Google 5-gram data. Such correlations certainly do not constitute a semantic representation in the deductive sense we are interested in, but it requires no artful analysis, indeed, it requires no human labor at all, to come up with numerical values for any new word. Here we sketch a more systematic approach that exploits preexisting lexicographic work, in particular dictionary definitions that are already restricted to a smaller wordlist such as the Longman Defining Vocabulary (LDV) or Ogden’s Basic English (BE). These already have the proven capability to define all other words in the Longman Dictionary of Contemporary English (LDOCE) or the Simple English wikipedia at least for human readers, though not necessarily in sufficient detail and precision for reasoning by a machine. Any defining vocabulary D subdivides the problem of defining the meaning of (English) words in two. First, the definition of other vocabulary elements in terms of D, which is our focus of interest, and second, defining D itself, based perhaps on primary (sensory) data or perhaps on some deeper scientific understanding of the primitives. A complete solution to the dictionary definition problem must go beyond a mere listing D of the defining vocabulary elements: we need both a formal model of each element and a specification of lexical syntax, which regulates how elements of D combine with each other (and possibly with other, already defined, elements) in the definition of new words. We emphasize that our goal is to provide an algebra of lexicography rather than a generative lexicon (Flickinger [15], Pustejovsky [36]) of the sort familiar from generative morphology. A purely generative approach would start from some primitives and some rules or constraints which, when applied recursively, provide an algorithm that enumerates the lexicon. The algebraic approach is more modest in that it largely leaves open the actual contents of the lexicon. Consider the semantics of noun-noun compounds. As Kiparsky [22] notes, ropeladder is ‘ladder made of rope’; manslaughter is ‘slaughter undergone by man’; and testtube is ‘tube used for test’, so the overall semantics can only specify that N1 N2 is ‘N2 that is V -ed by N1 ’, i.e. the decomposition is subdirect (yields a superset of the target) rather than direct, as it would be in a fully compositional generative system. Another difference between the generative and the algebraic approach is that only the former implies commitment to a specific set of primitives. To the extent that work on lexical semantics often gets bogged down in a quest for the ultimate primitives, this point is worth a small illustrative example. Consider the Table 1. Multiplication in Z3 e ab e e ab aabe bbe a
180
A. Kornai
cyclic group Z3 on three points given by the elements e, a, b and the preceding multiplication table. The unit element e is unique (being the one and only y satisfying yx = xy = x for all x) but not necessarily irreducible in that if a and b are given, both ab and ba could be used to define it. Furthermore, if a is given, there is no need for b in that aa already defines this element, so the group can be presented simply as a, aa, aaa = e i.e. a is the ‘generator’ and a3 = e is the ‘defining relation’ (as these terms are used in group theory). Note, however, that the exact same group is equally well presented by using b as the generator and b3 = e as the defining relation – there is no unique/distinguished primitive as such. This non-uniqueness is worth keeping in mind when we discuss possible defining vocabularies. In algebra, similar examples abound: for example in a linear space any basis is just as good as any other to define all vectors in the space. For a lexical example, consider the Hungarian verbal stem toj and the derived toj´ o ‘hen’, toj´ as ‘egg’, and tojni ‘to lay egg’. It is evident that eggs are what hens lay, hens are what lay eggs, and laying of eggs is what hens do. In Hungarian, the interdependence of the definitions is made clear by the fact that all three forms are derived from the same stem by productive processes, -´ o is a noun-forming deverbal suffix denoting the agent, -´ as denotes the action or the result, and -ni is the infinitival suffix. But the same arbitrariness in the choice of primitives can be just as evident in less transparent examples, where the common stem is lacking: for example in English hen and egg it is quite unclear which one is logically prior. Consider prison ‘place where inmates are kept by guards’, guard ‘person who keeps inmates in prison’, and inmate ‘person who is kept in prison by guards’. One could easily imagine a language where prison guards are called keepers, inmates keepees, and the prison itself a keep. The mere fact that in English the semantic relationship is not signaled by the morphology does not mean that it’s not there – to the contrary, we consider it an accident of history, beyond the reach of explanatory theory, that the current nominal sense of keep, ‘fortress’ is fortified place to keep the enemy out rather than to keep prisoners in. What is, then, a reasonable defining vocabulary D? We propose to define one from the outside in, by analyzing the LDV or BE rather than building from the inside out from the putative core lists of Schank or Wierzbicka. This method guarantees that at any given point of reducing D to some smaller D’ we remain capable of defining all other words, not just those listed in LDOCE (some 90k items) or the Simple English wikipedia (over 30k entries) but also those that are definable in terms of these larger lists (really, the entire unabridged vocabulary of English). In the computational work that fuels the theoretical analysis presented here we begin with our own version of the LDV, called 4lang, which includes Latin, Hungarian, and Polish translations in the intended senses, both because we do not wish to lose sight of the longer term goal of translation and as a clear means of disambiguation for concepts whose common semantic root, if there ever was one, is no longer transparent, e.g. interest ‘usura’ v. interest ‘studium’.
The Algebra of Lexical Semantics
181
Clearly, a similarly disambiguated version of the BE vocabulary, or any other reasonable starting point could just as well be used. We perform the analysis of the starting D in several chunks, many corresponding to what old-fashioned lexicographers would call a semantic field (Trier [42]), conceptually related terms that are likely candidates to be defined in terms of one another such as color terms, legal terms, and so on. We will not attempt to define the notion of semantic fields in a rigorous fashion, but use an operational definition based on Roget’s Thesaurus. For example, for color terms we take about 30 stanzas from Roget 420 Light to Roget 449 Disappearance, (numbering follows the 1911 edition of Roget’s as this is available as a Project Gutenberg etext #10681) and for religious terms we take 25 stanzas Roget 976 Deity to Roget 1000 Temple. Since the chunking is purely pragmatic, we need not worry about the issues that plague semantic fields: for our purposes it matters but little where the limits of each field are, whether the resulting collections of words and concepts are properly named, or whether some kind of hierarchy can or should be imposed on them – all that matters is that each form a reasonable unit of workable size, perhaps a few dozen to a few hundred stanzas. We will mostly use the Religion field to illustrate our approach, not because we see it as somehow privileged but rather because it serves as a strong reminder of the inadequacy of the physicalist approach. In discussing color, we may be tempted to dispense with a defining vocabulary D in favor of a more scientifically defined core vocabulary, but in general such core expressions, if truly restricted to measurable qualia, have very limited traction over much of human social activity. The main fields defined through Roget are size R031 – R040a and R192 – R223; econ R775 – R819; emotion/attitude R820 – R936 except 845-852 and 922-927; esthetics R845 – R852; law/morals R937 – R975 plus R922 – 927. In this process, about a quarter of the LDV remains unaffiliated. For Religion we obtain the list anoint, believe, bless, buddhism, buddhist, call, ceremony, charm, christian, christianity, christmas, church, clerk, collect, consecrated, cross, cure, devil, dip, doubt, duty, elder, elect, entrance, fairy, faith, faithful, familiar, fast, father, feast, fold, form, glory, god, goddess, grace, heaven, hinduism, holy, host, humble, jew, kneel, lay, lord, magic, magician, mass, minister, mosque, move, office, people, praise, pray, prayer, preserve, priest, pure, religion, religious, reverence, revile, rod, save, see, service, shade, shadow, solemn, sound, spell, spirit, sprinkle, temple, translate, unity, word, worship. (Entries are lowercased for ease of automated stemming etc.) Two problems are evident from such a list. First, there are several words that do not fully belong in the semantic field, in that the sense presented in Roget’s is different from the sense in the LDV: for example port is not a color term and father is not a religious term in the primary sense used in the LDV. Such words are manually removed, since defining the religious sense of father or the color sense of port would in no way advance the cause of reducing the size of D. Programmatic removal is not feasible at this stage: to see what the senses are, and thus to see that the core sense is not the one used in the field, would require a working theory of lexical semantics of the sort we are developing here. Once such
182
A. Kornai
a theory is at hand, we may use it to verify the manual work performed early on, but this is only a form of error checking, rather than learning something new about the domain. Needless to say, father still needs to be defined or declared a primitive, but the place to do this is among kinship terms not religious terms. If a word is kept, this does not mean that it is unavailable outside the semantic field, clearly Bob worships the ground Alice walks on does not mean anything religious. However, for words inside the field such as worship even usage external to the field relies on the field-internal metaphor, so the core/defining sense of the word is the one inside. Conversely, if usage does not require the field-internal metaphor, the word/sense need not be treated as part of the size reduction effort: for example, This book fathered a new genre does not mean (or imply) that the object will treat the subject with reverence, so father can be left out of the religion field. Ideally, with a full sense-tagged corpus one could see ways of making such decisions in an automated fashion, but in reality creating the corpus would require far more manual work than making the decisions manually. Since the issue of different word senses will come up many times, some methodological remarks are in order. Kirsner [25] distinguishes two polarly opposed approaches. The polysemic approach aimed at maximally distinguishing as many senses as they appear distinct, e.g. bachelor1 ‘unmarried adult man’, bachelor2 ‘fur seal without a mate’, bachelor3 ‘knight serving under the banner of another knight’, and bachelor4 ‘holder of a BA degree’. The monosemic approach (also called Saussurean and Columbia School approach by Kirsner, who calls the polysemic approach cognitive) searches for a single, general, abstract meaning, and would subsume at least the first three senses above in a single definition, ‘unfulfilled in typical male role’. This is not the place to fully compare and contrast the two approaches (Kirsner’s work offers an excellent starting point), but we note here a significant advantage of the monosemic approach, namely that it makes interesting predictions about novel usage, while the predictions of the polysemic approach border on the trivial. To stay with the example, it is possible to envision novel usage of bachelor to denote a contestant in a game who wins by default (because no opponent could be found in the same weight class or the opponent was a no-show). The polysemic theory would predict that not just seals but maybe also penguins without a mate may be termed bachelor true but not very revealing. The choice between monosemic and polysemic analysis need not be made on a priori grounds: even the strictest adherent of the polysemic approach would grant that bachelor’s degree refers, at least historically, to the same kind of apprenticeship as bachelor knight. Conversely, even the strictest adherent of the monosemic approach must admit that the relationship between ‘obtaining a BA degree’ and ‘being unfulfilled in a male role’ is no longer apparent to contemporary language learners. That said, we still give methodological priority to the monosemic approach because of the original Saussurean motivation: if a single form is used, the burden of proof is on those who wish to posit separate meanings (see Ruhl [39]). An important consequence of this methodological stance is
The Algebra of Lexical Semantics
183
that we will rarely speak of metaphorical usage, assuming instead that the core meaning already extends to such cases. A second problem, which has notable impact on the structure of the list, is the treatment of natural kinds. By natural kinds here we mean not just biologically defined kinds as ox or yak, but also culturally defined artifact types like tuxedo or microscope – as a matter of fact the cultural definition has priority over the scientific definition when the two are in conflict. The biggest reason for the inclusion of natural kinds in the LDV is not conceptual structure but rather the eurocentric viewpoint of LDOCE: for the English speaker it is reasonable to define the yak as ox-like, but for a Tibetan defining the ox as yak-like would make more sense. There is nothing wrong with being eurocentric in a dictionary of an Indoeuropean language, but for our purposes neither of these terms can be truly treated as primitive. So far we discussed the lexicon, the repository of linguistic knowledge about words. Here we must say a few words about the encyclopedia, the repository of world knowledge. While our goal is to create a formal theory of lexical definitions, it must be acknowledged that such definitions can often elude the grasp of the linguist and slide into a description of world knowledge of various sorts. Lexicographic practice acknowledges this fact by providing, somewhat begrudgingly, little pictures of flora, fauna, or plumbers’ tools. A well-known method of avoiding the shame of publishing a picture of the yak is to make reference to Bos grunniens and thereby point the dictionary user explicitly to some encyclopedia where better information can be found. We will collect such pointers in a set E, and use curly braces to set them typographically apart from references to lexical content. When we say that light is defined as {flux of photons in the visible band}, what this really means is that light must be treated as a primitive. There is a physical theory of light which involves photons, a biophysical theory of visual perception that involves sensitivity of the retina to photons of specific wavelengths, but we are not interested in these theories, we are just offering a pointer to the person who is. From the linguistic standpoint light is a primitive, irreducible concept, one that people have used for millennia before the physical theory of electromagnetic radiation, or even the very notion of photons, was available. Ultimately any system of definitions must be rooted in primitives, and we believe the notion light is a good candidate for such a primitive. From the standpoint of lexicography only two things need to be said: first, whether we intend to take the nominal or the verbal meaning as our primitive, and second, whether we believe that the primitive notion light is shared across the oppositions with dark and with heavy or whether we have two different senses of light. In this particular case, we choose the second solution, treating the polysemy as an accident of English rather than a sign of deep semantic relationship, but the issue must be confronted every time we designate an element as primitive. The issue of how to assign grammatical category (also called part of speech or POS) to the primitives will be discussed in Section 3, but we note here in advance that we keep the semantic part of the representation constant across verbs, their substantive forms, and their cognate objects.
184
A. Kornai
The same point needs to be made in regards to ontological primitives like time. While it is true that the time used in the naive physics model is discrete and asynchronous, this is not intended as some hypothesis concerning the ultimate truth about physical time, which appears continuous (except possibly at a Planck scale) and appears distinct from space and matter (but is strongly intertwined with these). We take the appropriate method for deciding such matters to be physical experimentation and theory-making, and we certainly do not propose to find out the truth of the matter by reverse-engineering the lexica of natural languages. Since the model is not intended as a technical tool for the analysis of synchrony or continuous time, we do not wish to burden it with the kind of mechanisms, such as Petri nets or real numbers, that one would need to analyze such matters. Encyclopedic knowledge of time may of course include reference to the real numbers or other notions of continuous time, but our focus is not with a deep understanding of time as with tense marking in natural language, and it is the grammatical model, not the ontology, that carries the burden of recapitulating this. For the sake of concreteness we will assume a Reichenbachian view, distinguishing four different notions of time: (i) speech time, when the utterance is spoken, (ii) perspective time, the vantage point of temporal deixis, (iii) reference time, the time that adverbs refer to, and (iv) event time, the time the named event unfolds. Typically, these are intervals, possibly open-ended, more rarely points (degenerate intervals) and the hope is that we can eventually express the temporal semantics of natural language in terms of interval relations such as ‘event time precedes reference time’ (see Allen [1], [2], Kiparsky [23]). The formal apparatus required for this is considerably weaker than that of FOL. One important use of external pointers worth separate mention is for proper names. By sun we mean primarily the star nearest to us. The common noun usage is secondary, as is clear from the historical fact that people before Giordano Bruno didn’t even know that the small points of light visible on the night sky were also suns. That we have a theory of the Sun as {the nearest star} where the, near, -est, and star are all members of the LDV is irrelevant from a lexicographic standpoint – what really matters is that there is a particular object, ultimately identified by deixis, that is a natural kind on its own right. The same goes for natural kinds such as oxygen or bacteria that may not even have a naive lexical theory (it is fair to say that all our knowledge about these belongs in chemistry and the life sciences) and about cultural kinds such as tennis, television, british, or october. In 3.3 we return to the issue of how to formalize those cases when purely lexical knowledge is associated with natural kinds, e.g. that tennis is a game played with a ball and rackets, that November follows October, or that bacteria are small living things that can cause disease, but we wish to emphasize at the outset that there is much in the encyclopedia that our formalism is not intended to cover, e.g. that the standard atomic weight of oxygen is 15.9994(3). Lest the reader feel that any reference to some external encyclopedia is tantamount to shirking of lexicographic duty it is worth keeping in mind that natural and cultural kinds amount to less than 6% of the LDV.
The Algebra of Lexical Semantics
185
Returning to the field of religion, when we define Islam as religion centered on the teachings of {Mohamed}, the curly braces acknowledge the fact Mohamed (and similarly Buddha, Moses, or Jesus Christ) will be indispensable in any effort aimed at defining Islam (Buddhism, Judaism, or Christianity). The same is true for Hinduism, which we may define as being centered on revealed teachings ({´ sruti}), but of course to obtain Hinduism as the definiendum the definiens must make it clear that it is not any old set of revealed teachings that are central to it but rather the Vedas and the Upanishads. One way or another, when we wish to define such concepts as specific religions, some reference to specific people and texts designated by proper names is unavoidable. Remarkably, once the names of major religious figures and the titles of sacred texts are treated as pointers to the encyclopedia, there remains nothing in the whole semantic field that is not definable in terms of non-religious primitives. In particular, god can be defined as being, supreme where supreme is simply about occupying the highest position in a hierarchy (being a being has various implications, see Section 3.1, but none of these are particularly religious). The same does not hold for the semantic field of color, where we find irreducible entries such as light. Needless to say, our interest is not with exegesis (no doubt theologians could easily find fault with the particular definitions of god and the major religions offered here) but with the more mundane aspects of lexicography. Once we have buddhism, christianity, hinduism, islam, and judaism defined, buddhist, christian, hindu, muslim, and jew fall out as adherent of buddhism, ..., judaism for the noun denoting a person, and similarly for the adjectives buddhist, christian, hindu, islamic, jewish which get defined as of or about buddhism,..., judaism. We are less concerned with the theological correctness of our definitions than with the proper choice of the base element: should we take the -ism as basic and the -ist as derived, should we proceed the other way round, or should we, perhaps, derive both (or, if the adjectival form is also admitted, all three) from a common root? Our general rule is to try to derive the morphologically complex from the morphologically simplex, but exceptions must be made e.g. when we treat jew as derived (as if the word was *judaist). These are well handled by some principle of blocking (Aronoff [3]), which makes the non-derived jew act as the printname for *judaist. Another, seemingly mundane, but in fact rather thorny issue is the treatment of bound morphemes. The LDV includes, with good reason, some forty suffixes -able, -al, -an, -ance, -ar, -ate, -ation, -dom, -en, -ence, -er, -ess, -est, -ful, hood, -ible, -ic, -ical, -ing, -ion, -ish, -ist, -ity, -ive, -ization, -ize, -less, -like, -ly, -ment, -ness, -or, -ous, -ry, -ship, -th, -ure, -ward, -wards, -work, -y and a dozen prefixes counter-, dis-, en-, fore-, im-, in-, ir-, mid-, mis-, non-, re-, un-, vice-, well-. This affords great reduction in the size of D, in that a stem such as avoid now can appear in the definiens in many convenient forms such as avoidable, avoidance, avoiding as the syntax of the definition dictates. Including affixes is also the right decision from a cross-linguistic perspective, as it is evident that notions that are expressed by free morphemes in one language, such as possession
186
A. Kornai
(English my, your, ...), are expressed in many other languages by affixation. But polysemy can be present in affixes as well: for example, English and Latin have four affixes -an/anus, -ic/ius, -ical/icus, and -ly/tus where Hungarian and Polish have only one -i/anin and we have to make sure that no ambiguity is created in the definitions by the use of polysemous affixes. Altogether, affixes and affixlike function words make up about 8-9% of the LDV, and the challenge they pose to the theory developed here is far more significant than that posed by natural kinds in that their proper analysis involves very little, if any, reference to encyclopedic knowledge. Finally, there is the issue of the economy afforded by primitive conceptual elements that have no clear exponent in the LDV. For example, we may decide that we feel sorrow when something bad happens to us, gloating when it happens to others, happiness when something good happens to us, and resentment when it happens to others. (The example is from Hobbs [19], and there is no claim here or in the original that these are the best or most adequate emotional responses. Even if we agree that they are not, this does not affect the following point, which is about the economy of the system rather than about morally correct behavior.) Given that good, bad, and happen are primitives we will need in many corners of the system, we may wish to rely on some sociological notion of in-group and out-group rather than on the pronouns us and them in formalizing the above definitions. This has the clear advantage of remaining applicable independent of the choice of in-group (be it family, tribe, nation, colleagues, etc) and of indexical perspective (be it ours or theirs). Considerations of economy dictate that we use abstract elements as long as we can reduce the defining vocabulary D by more than one item: whether we prefer to use in-group, out-group or us, them as primitives is more a matter of taste than a substantive issue. If two solutions D and D’ have the same size, we have no substantive reason to prefer one to the other. That said, for expository convenience we will still prefer non-technical to technical and Anglo-Saxon to latinate vocabulary in our choice of primitives. To summarize what we have so far, for the sake of concreteness we identified a somewhat reduced version of the LDV, less than 2,000 items, including some bound morphemes and natural kinds, as our defining vocabulary D, but we make no claim that this is in any way superior to some other base list D’ as long as D’ is not bigger than D.
3
The Formal Model
The key issue is not so much the membership of D as the mechanism that regulates how its elements are put together. Here we depart from the practice of the LDOCE, which uses natural language paraphrases, in favor of a fully formal theory. In 3.1 we introduce the elements of this theory which we will call lexemes. In 3.2 we turn to the issue of how these elements are combined with one another. The semantics of the representations is discussed in 3.3. The formalism is introduced gradually, establishing the intuitive meaning of the various components before the fully formal definitions are given.
The Algebra of Lexical Semantics
3.1
187
Lexemes
We will call the basic building blocks of our system lexemes because they offer a formal reconstruction of the informal notion of lexicographic lexemes. Lexemes are well modularized knowledge containers, ideally suited for describing our knowledge of words (as opposed to our encyclopedic knowledge of the world, which involves a great deal of non-linguistic knowledge such as motor skills or perceptual inputs for which we lack words entirely). Lexemes come in two main varieties, unary lexemes which correspond to most nouns, adjectives, verbs, and content words in general (including most transitive and higher arity verbs as well) will be written in typewriter font, and binary lexemes, corresponding to adpositions, case markers, and other linkers, will be written small caps. Ignoring the printnames, the base of unary lexemes consists of an unordered (conjunctive) list of properties, e.g. the dog is four-legged, animal, hairy, barks, bites, faithful, inferior; the fox is four-legged, animal, hairy, red, clever. Binary lexemes are to be found only among the function words: for example at(x,y) ‘x is at location y’, has(x,y) ‘x possesses y’, cause(x,y) etc. In what follows these will be written infix, which lets us do away with variables entirely. (Thus the notation already assumes that there are no true ditransitives, a position justified in more detail in Kornai [27].) Binary lexemes have two defining lists of properties, one list pertaining to their first (superordinate) argument and another to their second (subordinate) argument – these two are called the base of the lexeme. We illustrate this on the predicate has, which could be the model for verbs such as owns, has, possesses, rules, etc. The differences between John has Rover and Rover has John are best seen in the implications (defaults) associated with the superordinate (possessor) and subordinate (possessed) slots: the former is assumed to be independent of the latter, the latter is assumed to be dependent on the former, the former controls the latter (and not the other way around), the former can end the possession relationship unilaterally, the latter can not, etc. The list of definitional properties is thus partitioned in two: those that belong to the superordinate argument are collected in the head partition, those belonging to the subordinate argument are listed on the dependent partition. The lexical entries in question may also include pointers to sensory data, biological, visual, or other extralinguistic knowledge about dogs and foxes. We assume some set E of external pointers (which may even be two-way in the sense that external sensory data may trigger access to lexical content) to handle these, but here E will not be used for any purpose other than delineating linguistic from non-linguistic concerns. How about the defining elements that we collected in D? These are no different, their definitions can refer to other lexemes that correspond to their essential properties. So definitions can invoke other definitions, but the circularity causes no foundational problems, as argued above. Following Quillian [37], semantic networks are generally defined in terms of some distinguished links: is a to encode facts such as dogs are animals, and attr to encode facts such that they are hairy. Here neither the genus nor the attribution relation is encoded explicitly. Rather, everything that appears on the distinguished (head) partition is attributed (or predicated) directly, and is a is
188
A. Kornai
defined simply by containment of the essential properties. Elementary pieces of link-tracing logic, such as a is a b ∧ b is a c ⇒ a is a c or a is a b ∧ b has c ⇒ a has c follow without any stipulation if we adopt this definition, but the system becomes more redundant: instead of listing only essential properties of dogs we need to list all the essential properties of the supercategories such as animals as well. Altogether, the use of is a links leads to better modularized knowledge bases, and for this reason we retain them as a presentation device, but without any special status: for us dog is a animal is just as valid as dog is a hairy and dog is a barks. From the KR perspective the main point here is that there is no mixing of strict and default inheritance, in fact there no strict portion of the system (except possibly in the encyclopedic part which need not concern us here). If we know that animals are alive then we know that donkeys are alive. If we know that being alive implies life functions such as growth, metabolism, and replication this implication will again be inherited by animals and thus by mules as well. The encyclopedic knowledge that mules don’t replicate has to be learned separately. Once acquired, this knowledge will override the default inheritance, but we are equally interested in the naive world-view where such knowledge has not yet been acquired. Only the naive lexical knowledge will be encoded by primitives directly: everything else must be given indirectly, by means of a pointer or set of pointers to encyclopedic knowledge. The most essential information that the lexicon has about tennis is that it is a game, all the world knowledge that we have about it, the court, the racket, the ball, the pert little skirts, and so forth, are stored in a non-lexical knowledge base. This is also clear from the evidence from word-formation: clearly table tennis is a kind of tennis, yet it requires no court, has a different racket, ball, and so forth. The clear distinction between essential (lexical) and accidental (encyclopedic) knowledge has broad implications for the contemporary practice of Knowledge Representation, exemplified by systems like CyC (Lenat and Guha [28]) or Mindpixel in that the current homogeneous knowledge bases need to be refactored, splitting out a small, lexical base that is entirely independent of domain. The syntax of well-formed lexemes can be summarized in a Context-Free Grammar (V, Σ, R, S) as follows. The nonterminals V are the start symbol S, the binary relation symbols B, and the unary relation symbols collected in U . Variables ranging over V will be taken from the end of the Latin alphabet, v, w, x, y, z. The terminals are the grouping brackets ‘[’ and ‘]’, the derivation history parentheses ‘(’ and ‘)’, and we introduce a special terminating operator ‘;’ to form a terminal v; from any nonterminal v. The rule S → U |B|λ handles the decision to use unary or binary lexemes, or perhaps none at all. The operation of attribution is captured in the rule schema w → w; [S ∗ ] which produces the list defining w. This requires the CFG to be extended in the usual sense that regular expressions are permitted on the right hand side, so the rule really means w → w; []|w; [S]|w; [SS]|... Finally, the operation of predication is handled by u → u; (S) for unary, and v → Sv; S for binary nonterminals. All lexemes are built up recursively by these rules.
The Algebra of Lexical Semantics
3.2
189
Combining the Lexemes
The first level of combining lexemes is morphological. At the very least, we need to account for productive derivational morphology, the prefixes and suffixes that are part of D, but in general we expect a theory that is just as capable of handling cases not easily exemplified in English such as binyanim. Compounding, to the extent predictable, also belongs here, and so does nominalization, especially as definitions make particularly heavy use of this process. The same is true for inflectional morphology, where the challenge is not so much English (though the core set -s, ’s, -ing, -ed must be covered) as languages with more complex inflectional systems. Since certain categories (e.g. gender and class system) can be derivational in one language but inflectional in another, what we really require is coverage of all productive morphology. This is obviously a tall order, and within the confines of this paper all we can do is to discuss one example, deriving insecure, from in- and secure, as this will bring many of the characteristic features of the system in play. Irrespective of whether secure is primitive (we assume it is not), we need some mechanism that takes the in- lexeme, the secure lexeme, and creates an insecure lexeme whose definition and printname are derived from those of the inputs. To forestall confusion we note here that not every morphologically complex word will be treated as derived. For example, it is clear, e.g. from the strong verb pattern, that withstand is morphologically complex, derived from with and stand (otherwise we would expect the past tense to be *withstanded rather than withstood), yet we do not attempt to describe the operation that creates it. We are content with listing withstand, understand, and other complex forms in the lexicon, though not necessarily as part of D. Similarly, if we have a model capable of accounting for insecure in terms of more primitive elements, we are not required to overapply the technique to inscrutable or ineffable just because these words are also morphologically complex and could well be, historically, the residue of in- prefixation to stems no longer preserved in the language. Our goal is to define meanings, and the structural decomposition of every lexeme to irreducible units is pursued only to the extent it advances this goal. Returning to insecure, the following facts should be noted. First, that the operation resides entirely in in- because secure is a free form. Second, that a great deal of the analysis is best formulated with reference to lexical categories (parts of speech): for example, in- clearly selects for an adjectival base and yields an adjectival output (the category of in- is A/A), because those forms such as income or indeed that are formed from a verbal or nominal base lack the negative meaning of in- that we are concerned with (and are clearly related to the preposition in rather than the prefix in/im that is our target here). Third, that the meaning of the operation is exhaustively characterized by the negation: forms like infirm where the base firm no longer carries the requisite meaning still carry a clear negative connotation (in this case, ‘lacking in health’ rather than ‘lacking in firmness’). In fact, whatever meaning representation we assign to the lexically listed element insecure must also be available for the non-lexical (syntactically derived) not secure.
190
A. Kornai
In much of model-theoretic semantics (the major exception is the work of Turner [43], [44]) preserving the semantic unity of stems like secure which can be a verb or an adjective, or stems like divorce which can be both nouns and verbs, with no perceptible meaning difference between the two, is extremely hard because of the differences in signature. Here it is clear that the verb is derived from the adjective: clearly, the verb to secure x means ‘make x (be) secure’, so when we say that in- selects for an adjectival base, this just means the part of the POS structure of secure that permits verbal combinatorics is filtered out by application of the prefix. The adjective secure means ‘able to withstand attack’. Prefixation of in- is simply the addition of the primitive neg to the semantic representation and concatenation plus assimilation in the first, cf. in+secure and im+precise. (We note here, without going into details, that the phonological changes triggered by the concatenation are also entirely amenable to treatment in finite state terms.) As far as the invisible deadjectival verb-forming affix (paraphrased as make) that we posited here to obtain the verbal form, this does two things: first, it brings a subject slot x, and second, it contributes a change of state predicate – before, there wasn’t an object y, and now there is. The first effect, which requires making a distinction between an external (subject) and internal (direct object, indirect object, etc) arguments, follows a long tradition of syntactic analysis going back at least to Williams [46], and will just be assumed without argumentation here, but the latter is worth discussing in greater detail, as it involves a key operation among lexemes, substitution, to which we turn now. Some form of recursive substitution of definitions in one another is necessary both for work aimed at reducing the size of the DV and for attempts to define non-D elements in terms of the primitives listed in D. When we add an element of negation (here given simply as neg, and a reasonable candidate for inclusion in D) to a definition such as ‘able to withstand attack’, how do we know that the result is ‘not able to withstand attack’ rather than ‘able to not withstand attack’ or even ‘able to withstand not attack’ ? The question is particularly acute because the head just contains the defining properties as elements of a set, with no order imposed. (We note that this is a restriction that we could trivially give up in favor of ordered lists, but only at a great price: once ordered lists are admitted the system would become Turing-complete just as HPSG.) Another way of asking the same question is to ask how the system deals with iterated substitutions, for even if we assume that able and attack are primitives (they are listed in the LDV), surely withstand is not, x withstands y means something like ‘x does not change from y’ or even ‘x actively opposes y’. Given our preference for a monosemic analysis we take the second of these as our definition, but this makes the problem even more acute: how do we know that the negation does not attach to the actively portion of the definition? What is at stake here is the single most important property of definitions, that the definiens can be substituted for the definiendum in any context. Since many processes, such as making a common noun definite, which are performed by syntactic means in English, will be performed by inflectional means in
The Algebra of Lexical Semantics
191
other languages such as Rumanian, complete coverage of productive morphology in the world’s languages already implies coverage of a great deal of syntax in English. Ideally, we would wish to take this further, requiring coverage of syntax as a whole, but we could be satisfied with slightly less, covering the meaning of syntactic constructions only to the extent they appear in dictionary definitions. Remarkably, almost all problem cases in syntax are already evident in this restricted domain, especially as we need to make sure that constructions and idioms are also covered. There are forms of grammar which assume all syntax to be a combination of constructions (Fillmore and Kay [13]), and the need to cover the semantics of these is already clear from the lexical domain: for example, a mule is animal, cross between horses and donkeys, stubborn, ... Clearly, a notion such as ‘cross between horses and donkeys’ is not a reasonable candidate for a primitive, so we need a mechanism for feeding back the semantics of nonce constructions into the lexicon. This leaves only the totally non-lexicalized, purely grammatical part of syntax out of scope, cases such as topicalization and other manipulation of given/new structure, as dictionary definitions tend to avoid communicative dynamics. But with this important caveat we can state the requirement that lexical semantics cover not just the lexical, but also the syntactic combination of morphemes, words, and larger units. 3.3
The Semantics of Lexemes
Now that we have seen the basic elements (lexemes) and the basic mode of combination (attribution, modeled as listing in the base of a lexeme), the question will no doubt be asked: how is this different from Markerese (Lewis [30])? The answer is that we will interpret our lexemes in model structures, and make the combination of lexemes correspond to operations on these structures, very much in the spirit of Montague [34]. Formally, we have a source algebra A that is freely generated from some set of primitives D by means of constructions listed in C. An example of such a construction would be x is to y as z is to w which is used not just in arithmetic (proportions) but also in everyday analogy: Paris is to London as France is to England, but in-prefixation would also be a construction of its own. We will also have an algebra M of machines, which will serve as our model structures, and a mapping σ of semantic interpretation that will assign elements of M both to elements of D and to elements of A formed from these in a compositional manner. This can be restated even more compactly in terms of category theory: members of D, plus all other elements of the lexicon, plus all expressions constructed from these, are the objects of some category L of linguistic expressions, whose arrows are given by the constructions and the definitional equations, members of M, and the mappings between them, make up the category M , and semantic interpretation is simply a functor S from L to M . The key observation, which bears repeating at this point, is that S underdetermines the semantics of lexicalized expressions: if noun-noun compounding (obviously a productive construction of English) has the semantics ‘N2 that is
192
A. Kornai
V -ed by N1 ’ all the theory gives us is that ropeladder is a kind of ladder that has something to do with rope. What we obtain is ladder, rope rather than the desired ladder, material, rope. Regrettably, the theory can take us only so far – the rest has to be done by diving into the trashcan and cataloging historical accidents. Lexemes will be mapped by S on finite state automata (FSA) that act on partitioned sets of elements of D ∪ D ∪ E (the underlined forms are printnames). Each partition contains one or more elements of D ∪ E or the printname of the lexeme (which is, as a matter of fact, just another pointer, to phonetic/phonological knowledge, a domain that we happen to have a highly developed theory of). By action we mean a relational mapping, which can be one to many or many to one, not just permutation. These FSA, together with the mapping associating actions to elements of the alphabet, are machines in the standard algebraic sense (Eilenberg [11]), with one added twist: the underlying set, called the base of the machine, is pointed (one element of it is distinguished). The FSA is called the control, the distinguished point is called the head of the base. Without control, a system composed of bases would be close to a semantic network, with activations flowing from nodes to nodes (Quillian [38]). Without a base, the control networks would just form one big FSA, a primitive kind of deduction system, so it is the combination of these two facets that give machines their added power and flexibility. Since the definitional burden is carried in the base, and the combinatorial burden in the control, the formal model has the resources to handle the occasional mismatch between syntactic type (part of speech) and semantic type (as defined by function-argument structure). Let us now survey lexemes in order of increasing base complexity. If the base is empty, it has no relations, so the only FSA that can act on it is the null graph (no states and no transitions). This is called the null lexeme. If the set has one member, the only relations it can have is the identity 1 and the empty relation 0, which combine in the expected manner (0·0 = 0·1 = 1·0 = 0, 1·1 = 1). Note that the identity corresponds to the empty string usually denoted λ or . Since 1n = 1, the behavior of the machine can only take four forms, depending on whether it contains 0, 1, both, or neither, the last case being indistinguishable from the null lexeme over any size base. If the behavior is given by the empty string alone, we will call the lexeme 1 with the usual abuse of notation, independent of the size of the base set. If the behavior is given by the empty relation alone, we will call the lexeme 0, again independent of the size of the base set. Slightly more complex is the lexeme that contains both 0 and 1, which is rightly thought of as the union of 0 and 1, giving us the first example of an operation on lexemes. To fix the notation, in Table 2 we present the multiplication table of the semigroup R2 that contains all relations over two elements (for ease of typesetting the rows and columns corresponding to 0 and 1 are omitted). The remaining elements are denoted a, b, d, u, p, q, n, p, q , a , b , d , u , t – the prime is also used to denote an involution over the 16 elements which is not a semigroup
The Algebra of Lexical Semantics
193
Table 2. Multiplication in R2
a b d u p q n p’ q’ a’ b’ d’ u’ t
ab a0 0b 0d u0 p0 ad ud 0 p’ ub u p’ pd pb a p’ p p’
d u d 0 0 u 0 a b 0 p’ 0 d a b a 0 p b u b p p’ a p’ u d p p’ p
p a u a u p a p p u p p p p p
q n p’ q d d 0 u b 0 a d q’ b b t p’ p’ q q d q’ 1 p’ 0 p p’ q’ q’ b q’ d’ p’ t u’ p’ t a’ p’ q b’ p’ t t p’
q’ 0 q’ q 0 0 q q t q’ t q q’ t t
a’ d q’ q b p’ q u’ t q’ t u’ a’ t t
b’ d’ u’ t q a q q u q’ b q’ a q d q q’ u q’ q’ t p t t q q q q d’ b’ a’ t p t p’ t q’ q’ q’ q’ d’ t a’ t t b’ t t t d’ t t b’ t u’ t t t t t
homomorphism (but does satisfy x = x). Under this mapping, 0 = t and 1 = n, the rest follows from the naming conventions. To specify an arbitrary lexeme over a two-element base we need to select an alphabet as a subset of these letters, an FSA that generates some language over (the semigroup closure of) this alphabet, and fix one of the two base elements as the head. (To bring this definition in harmony with the one provided by Eilenberg we would also need to specify input and output mappings α and ω but we omit this step here.) Because any string of alphabetic letters reduces to a single element according to the semigroup multiplication, the actual behavior of the FSA is given by selecting one of the 216 subsets of the alphabet [0, 1, a, . . . , t], so over a twoelement base there can be no more than 65,536, and in general over an n-element 2 base no more than 2n non-isomorphic lexemes, since over n elements there will 2 2 be n ordered pairs and thus 2n relations. While in principle the number of nonisomorphic lexemes could grow faster than exponentially in n, in practice the base can be limited to three (one partition for the printname one for subject and one for object) so the largest lexeme we need to countenance will have its alphabet size limited to 512. This is still very large, but the upper bound is very crude in that not all conceivable relations over three elements will actually be used, there may be operators that affect subject and object properties at the same time but there aren’t any that directly mix grammatical and phonological properties. Most nominals, adjectives, adadjectives, and verbs will only need one content partition. Relational primitives such as x at y ‘x is at location y’; x has y ‘x is in possession of y’; x before y ‘x temporally precedes y’ will require two content partitions (plus a printname). As noted earlier, transitive and higher arity verbs will also generally require only one content partition: eats(x,y) may look superficially similar to has(x,y) but will receive a very different analysis. At this point, variables serve only as a convenient shorthand: as we shall see
194
A. Kornai
shortly, specifying the actual combinatorics of the elements does not require parentheses, variables, or an operation of variable binding. Formally we could use more complex lexemes for ditransitives like give or show, or verbs with even higher arity such as rent, but in practice we will treat these as combinations of primitives with smaller arity. e.g. x gives y to z as x cause(z has y). (We will continue using both variables and natural language paraphrases as a convenient shorthand when this does not affect the argument we are making.) Let us now turn to operations on lexemes. Given a set L of lexemes, each n-ary operation is a function from Ln to L. As is usual, distinguished elements of L such as null, 0, and 1 are treated as nullary operations. The key unary operations we will consider are step, denoted ’; invstep, denoted ` ; and clean, denoted -. ’ is simply an elementary step of the FSA (performed on edges) which acts as a relation on the partition X. As a result of step R, the active state moves from x0 to the image of x0 under R. The inverse step does the opposite. The key binary operation is substitution, denoted by parens. The head of the dependent machine is built into the base of the head machine. For a simple illustration, recall the definition of mule as animal, cross between horses and donkeys, stubborn,... So far we said that one partition of the mule lexeme, the head, simply contains the conjunction (unordered list) of these and similar defining (essential) properties. Now assume, for the sake of the argument, that animal is not a primitive, but rather a similar conjunction living, capable of locomotion,... Substitution amounts to treating some part of the definiens as being a definiendum on its own right, and the substitution operation replaces the atomic animal on the list of essential properties defining mule by a conjunction living, capable of locomotion,... The internal bracketing is lost, what we have at the end of this step is simply a longer list living, capable of locomotion, cross between horses and donkeys, stubborn,... By repeated substitution we may remove living, stubborn, etc. – the role of the primitives in D is to guarantee that this process will terminate. But note that the semantic value of the list is not changed if we leave the original animal in place: as long as animals are truly defined as living things capable of locomotion, we have set-theoretical identity between animal, living, capable of locomotion and living, capable of locomotion (cf. our second remark above). Adding or removing redundant combinations of properties makes no difference. Let us now consider the next term, cross between horses and donkeys. By analyzing what cross means we can obtain statements father(donkey,mule) and mother(horse,mule). We will ignore all the encyclopedic details (such as the fact that if the donkey is female and the horse male the offspring is called a hinny not a mule) and concentrate on the syntax: how can we describe a statement such as ∀x mule(x) ∃y, z horse(y)& female(y) & donkey(z) & male(z) & parent(x, y) & parent(x, z) without recourse to variables? First, note that the Boolean connective & is entirely unnecessary, since everything is defined by a conjunction of properties – at
The Algebra of Lexical Semantics
195
best what is needed is to keep track of which parent has what gender, a matter that is generally handled by packing this information in a single lexical entry. Once we explain ∀x mule(x) ∃y horse(y) female(y) parent(x, y) the rest will be easy. Again, note that it makes no difference whether we consider a female horse or mare which is a parent or a horse which is a female parent or mother, these combinations will map out the exact same set. Whether primitives such as mother, mare or being are available is a matter of how we design D. Either way, further quantification will enter the picture as soon as we start to unravel parent, a notion defined (at least for this case) by ‘gives genetic material to offspring’ which in turn boils down to ‘causes offspring to have genetic material’. Note that both the quantification and the identity of the genetic material are rather weak: we don’t know whether the parent gives all its genetic material or just part of it, and we don’t know whether the material is the same or just a copy. But for the actual definition none of these niceties matter: what matters is that mules have horse genes and donkey genes. As a matter of fact, this simple definition applies to hinnies as well, which is precisely the reason why people who lack significant encyclopedic knowledge about this matter don’t keep the two apart, and even those who do will generally agree that a hinny is a kind of a mule, and not the other way around (just as bitches are a kind of a dog, i.e. the marked member of the opposition). After all these substitution steps what remains on the list of essential mule properties includes complex properties such as has(horse genes) and capable of locomotion but no variable is required as long as we grant that in any definiens the superordinate (subject) slot of has is automatically filled by the definiendum. Readers familiar with the Accessibility Hierarchy of Keenan and Comrie [20] and subsequent work may jump to the conclusion that one way or another the entire hierarchy (handled in HPSG and related theories by an ordered list) will be necessary, but we attempt to keep the mechanism under much tighter control. In particular, we assume no ternary relations whatsoever, so there are no such things as indirect objects, let alone obliques, in definitions. To get further with capable of locomotion we need to provide at least a rudimentary theory of being capable of doing something, but here we feel justified in assuming that can, change, and place are primitives, so that can(change(place)) is good enough. Notice that what would have been the subject variables, who has the capability, who performs the change, and who has the place, are all implicitly bound to the same superordinate entity, the mule. To make further progress on horse genes we also need a theory of compound nouns: what are horse genes if not genes characteristic of horses, and if they are indeed characteristic of horses how come that mules also have them, and in an essential fashion to boot? The key to understanding horse gene and similar compounds such as gold bar is that we need to supply a predicate that binds the two terms together, what classical grammar calls ‘the genitive of material’ that we will write as made of. A full analysis of this notion is beyond the limits of this paper, but we note that the central idea of made of is production, generation: the bar is produced from/of/by gold, and the genes in question are
196
A. Kornai
produced from/of/by horses. This turns the Kripkean idea of defining biological kinds by their genetic material on its head: what we assume is that horse genes are genes defined by their essential horse-ness rather than horses are animals defined by carrying the essence of horse-ness in their genes. (Mules are atypical in this respect, in that their essence can’t be fully captured without reference to their mixed parentage.)
4
Conclusions
In the Introduction we listed some desiderata for a theory of the lexicon. First, adequate support for the traditional lexicographic tasks such as distinguishing word senses, deciding whether two words/senses are synonymous or perhaps antonymous, whether one expression can be said to be a paraphrase of another, etc. We see how the current proposal does this: two lexemes are synonymous iff they are mapped on isomorphic machines. Since finer distinctions largely rest in the eidopoios diaphora that we blatantly ignore, there are many synonyms: for example we define both poodle and greyhound as dog. Second, we wanted the theory of lexical semantics to connect to a theory of the meaning of larger (non-lexicalized) constructions including, but not necessarily limited to, sentential syntax and semantics. The theory proposed here meets this criterion maximally, since it uses the exact same mechanism to describe meaning starting from the smallest morpheme to the largest construction (but not beyond, as communicative dynamics is left untreated). Third, we wanted the theory to provide a means of linking up meanings across languages, serving as a translation pivot. While making good on this promise is obviously beyond the scope of this paper, it is clear that in the theory proposed here such a task must begin with aligning the primitives D developed for one language with those developed for another, a task we find quite doable at least as far as the major branches of IE (Romance, Slavic, and Germanic) are concerned. Finally, we said that the theory should be coupled to some theory of inference that enables, at the very least, common sense reasoning about objects, people, and natural phenomena. We don’t claim to have a full solution, but we conclude this paper with some preliminary remarks on the main issues. The complexities of the logic surrounding lexemes are not exactly at the same points where we find complexities in mathematical logic. In particular truth, which is treated as a primitive notion in mathematical logic, will be treated as a derived concept here, paraphrased as ‘internal model corresponds in essence to external state of affairs’. This is almost the standard correspondence theory of truth, but the qualification ‘in essence’ takes away much of the deductive power of the standard theory. The mode of inferencing supported here is not sound. For example, consider the following rule: if A’ is part of A and B’ is the same part of B and A is bigger than B, then A’ is bigger than B’. Let’s call this the Rule of Proportional Size, RPS. A specific instance would be that children’s feet are smaller than adults’ feet since children are smaller than adults. Note that the rule is only statistically true: we can well imagine e.g. a bigger building with smaller rooms. Note also that both the premises and the conclusion
The Algebra of Lexical Semantics
197
are defeasible: there may be some children who are bigger than some adults to begin with, and we don’t expect the rule to hold for them (this is a (meta)rule of its own, what we will call Specific Application), and even if the premises are met the conclusion need not follow, the rule is not sound. Nevertheless, we feel comfortable with these rules, because they work most of the time, and when they don’t a specific failure mode can always be found e.g. we will claim that the small building with the larger rooms, or the large building with the smaller rooms, is somehow not fully proportional, or that there are more rooms in the big building, etc. Also, such rules are statistically true, and they often come from inverting or otherwise generalizing rules which are sound, e.g. the rule that if we build A from bigger parts A’ then B is built from parts B’, A will be bigger than B. (This follows from our general notion of size which includes additivity.) Once we do away with the soundness requirement for inference rules, we are no longer restricted to the handful of rules which are actually sound. We permit our rule base to evolve: for example the very first version of RPS may just say that big things have big parts (so that children’s legs also come out smaller than adults’ arms, something that will trigger a lot of counterexamples and thus efforts at rule revision), the restriction on it being the same part may only come later. Importantly, the old rule doesn’t go away just because we have a better new rule. What happens is that the new rule gets priority in the domain it was devised for, but the old rule is still considered applicable elsewhere.
Acknowledgements We thank Tibor Beke (UMass Lowell) for trenchant criticism of earlier versions.
References 1. Allen, B., Gardiner, D., Frantz, D.: Noun incorporation in Southern Tiwa. IJAL 50 (1984) 2. Allen, J., Ferguson, G.: Actions and events in interval temporal logic. Journal of logic and computation 4(5), 531–579 (1994) 3. Aronoff, M.: Word Formation in Generative Grammar. MIT Press, Cambridge (1976) 4. Aronoff, M.: Orthography and linguistic theory: The syntactic basis of masoretic Hebrew punctuation. Language 61(1), 28–72 (1985) 5. Blackburn, P., Bos, J.: Representation and Inference for Natural Language. In: A First Course in Computational Semantics, CSLI, Stanford (2005) 6. Boguraev, B.K., Briscoe, E.J.: Computational Lexicography for Natural Language Processing, Longman (1989) 7. Cawdrey, R.: A table alphabetical of hard usual English words (1604) 8. Dowty, D.: Word Meaning and Montague Grammar. Reidel, Dordrecht (1979) 9. Dowty, D.: Thematic proto-roles and argument selection. Language 67, 547–619 (1991) 10. Eco, U.: The Search for the Perfect Language. Blackwell, Oxford (1995)
198
A. Kornai
11. Eilenberg, S.: Automata, Languages, and Machines, vol. A. Academic Press, London (1974) 12. Fillmore, C., Atkins, S.: Framenet and lexicographic relevance. In: Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain (1998) 13. Fillmore, C., Kay, P.: Berkeley Construction Grammar (1997), http://www.icsi.berkeley.edu/~ kay/bcg/ConGram.html 14. Fillmore, C., Atkins, B.: Starting where the dictionaries stop: The challenge of corpus lexicography. Computational approaches to the lexicon, 349–393 (1994) 15. Flickinger, D.P.: Lexical Rules in the Hierarchical Lexicon. PhD Thesis, Stanford University (1987) 16. Graham, A.: Two Chinese Philosophers, London (1958) 17. Gruber, J.: Lexical structures in syntax and semantics. North-Holland, Amsterdam (1976) 18. Hayes, P.: The naive physics manifesto. Expert Systems (1979) 19. Hobbs, J.: Deep lexical semantics. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 183–193. Springer, Heidelberg (2008) 20. Keenan, E., Comrie, B.: Noun phrase accessibility and universal grammar. Linguistic inquiry 8(1), 63–99 (1977) 21. Keller, H.: The story of my life. Dover, New York (1903) 22. Kiparsky, P.: From cyclic phonology to lexical phonology. In: van der Hulst, H., Smith, N. (eds.) The structure of phonological representations, vol. I, pp. 131–175. Foris, Dordrecht (1982) 23. Kiparsky, P.: On the Architecture of P¯ an.ini’s grammar. ms., Stanford University (2002) 24. Kipper, K., Dang, H.T., Palmer, M.: Class based construction of a verb lexicon. In: AAAI-2000 Seventeenth National Conference on Artificial Intelligence, Austin, TX (2000) 25. Kirsner, R.: From meaning to message in two theories: Cognitive and Saussurean views of the Modern Dutch demonstratives. Conceptualizations and mental processing in language, 80–114 (1993) 26. Kornai, A.: Mathematical Linguistics. Springer, Heidelberg (2008) 27. Kornai, A.: The treatment of ordinary quantification in English proper. Hungarian Review of Philosophy 51 (2009) (to appear) 28. Lenat, D.B., Guha, R.: Building Large Knowledge-Based Systems. Addison-Wesley, Reading (1990) 29. Levin, B.: English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago (1993) 30. Lewis, D.: General semantics. Synthese 22(1), 18–67 (1970) 31. McCarthy, J.: An example for natural language understanding and the ai problems it raises. Formalizing Common Sense: Papers by John McCarthy. Ablex Publishing Corporation 355 (1976) 32. McCarthy, J.J.: Human-level AI is harder than it seemed in 1955 (2005), http://www.formalstanford.edu/ jmc/slides/wrong/wrong-sli/ wrong-slihtml 33. Mitchell, T.M., Shinkareva, S., Carlson, A., Chang, K., Malave, V., Mason, R., Just, M.: Predicting human brain activity associated with the meanings of nouns. Science 320(5880), 1191–1195 (2008) 34. Montague, R.: Universal grammar. Theoria 36, 373–398 (1970) 35. Ogden, C.: Basic English: a general introduction with rules and grammar. K. Paul, Trench, Trubner (1944)
The Algebra of Lexical Semantics
199
36. Pustejovsky, J.: The Generative Lexicon. MIT Press, Cambridge (1995) 37. Quillian, M.R.: Semantic memory. In: Minsky (ed.) Semantic information processing, pp. 227–270. MIT Press, Cambridge (1967) 38. Quillian, M.R.: Word concepts: A theory and simulation of some basic semantic capabilities. Behavioral Science 12, 410–430 (1968) 39. Ruhl, C.: On monosemy: a study in lingusitic semantics. State University of New York Press (1989) 40. Schank, R.C.: Conceptual dependency: A theory of natural language understanding. Cognitive Psychology 3(4), 552–631 (1972) 41. Tarski, A., Givant, S.: A formalization of set theory without variables. Amer. Mathematical Society (1987) 42. Trier, J.: Der Deutsche Wortschatz im Sinnbezirk des Verstandes. C. Winter (1931) 43. Turner, R.: Montague semantics, nominalisations and Scott’s domains. Linguistics and Philosophy 6, 259–288 (1983) 44. Turner, R.: Three theories of nominalized predicates. Studia Logica 44(2), 165–186 (1985) 45. Wierzbicka, A.: Lexicography and conceptual analysis. Karoma, Ann Arbor (1985) 46. Williams, E.: On the notions lexically related and head of a word. Linguistic Inquiry 12, 245–274 (1981)
Phonological Interpretation into Preordered Algebras Yusuke Kubota and Carl Pollard The Ohio State University, Columbus OH 43210, USA
Abstract. We propose a novel architecture for categorial grammar that clarifies the relationship between semantically relevant combinatoric reasoning and semantically inert reasoning that only affects surface-oriented phonological form. To this end, we employ a level of structured phonology that mediates between syntax (abstract combinatorics) and phonology proper (strings). To notate structured phonologies, we employ a lambda calculus analogous to the φ-terms of [8]. However, unlike Oehrle’s purely equational φ-calculus, our phonological calculus is inequational, in a way that is strongly analogous to the functional programming language LCF [10]. Like LCF, our phonological terms are interpreted into a Henkin frame of posets, with degree of definedness (‘height’ in the preorder that interprets the base type) corresponding to degree of pronounceability; only maximal elements are actual strings and therefore fully pronounceable. We illustrate with an analysis (also new) of some complex constituent-order phenomena in Japanese.
1
Introduction
Standard denotational semantics of functional programming languages (FPLs) follows [10] in interpreting programs and their parts into a Henkin frame of (complete) posets whose orders correspond to degree of definedness. Maximal members of the poset that interpret the base types are the interpretations of values, the irreducible terms returned by terminating programs. One reasons about meanings of program constructs in an inequational logic that extends the familiar equational logic for reasoning about meanings of lambda-terms. In a separate development initiated by [11] and [5], building on [7], multimodal categorial grammars (MMCGs) are phonologically interpreted by labelling syntactic derivations not just with meaning terms, but also with terms which take their denotations in an algebra whose operations model modes of phonological combination. (Following Oehrle, we call these φ-terms, and write a; m; A for syntactic type A labelled with φ-term a and meaning term m.) Combining these two ideas, we propose φ-term labelling for MMCGs with interpretation into preordered algebras whose ‘modes of phonological combination’ are binary operations monotonic in both arguments. Unlike the denotational semantics of FPLs, though, deducibility of a ≤ a’ means not that a’ is at least as defined as a, but rather that a’ is at least as pronounceable as a, in the sense that any syntactic derivation that can be phonologically interpreted as a can also C. Ebert, G. J¨ ager, and J. Michaelis (Eds.): MOL 10/11, LNAI 6149, pp. 200–209, 2010. c Springer-Verlag Berlin Heidelberg 2010
Phonological Interpretation into Preordered Algebras
201
be phonologically interpreted as a’. And the maximal elements of the algebra interpreting the base type phon—the interpretations of the phonological values, are the ones that can be ‘returned’ in the sense of being actual pronunciations. Besides conceptual clarity, the main advantage of this approach is the wholesale elimination from the syntactic type logic of structural axioms or rules that govern semantically inert word order variation; all can be replaced by a single interface schema asserting that if a ≤ a’ is provable in the inequational theory associated with the φ-calculus and a; m; A is derivable in the syntactic calculus, then so is a’; m; A. The paper is organized as follows. Section 2 recalls basic definitions about preordered algebras. Section 3 reviews relevant aspects of the semantics of the programming language PCF. Section 4 sketches our system of φ-labelling in the context of an MMCG for a Japanese fragment. And section 5 uses this grammar to analyze some complex facts about word order in Japanese.
2
Preordered Algebras
A preorder is a reflexive transitive relation, and an order is an antisymmetric preorder. A preordered set (resp. poset) is a set P together with a preorder (resp. order), but usually we just call a preordered set a preorder. Any preorder induces an equivalence relation defined by p ≡ q iff p q and q p. A function from one preorder to another is monotonic (resp. antitonic) if it preserves (resp. reverses) the preorder, and tonic if it is either monotonic or antitonic. Any collection of preorders generates a Henkin frame by closure under exponentiation, where the exponential P → Q of two preorders is taken to be the set of monotonic functions from P to Q with the pointwise preorder. A preordered (resp. monotonic) algebra is a preorder together with a collection of tonic (resp. monotonic) operations. A simple example of a monotonic algebra is a presemigroup, with one binary operation which is associative up to equivalence (u.t.e.), i.e. (p q) r ≡ p (q r). A less familiar example is a monotonic algebra with one operation < which is only left-associative, i.e. (p < q) < r p < (q < r). A cpo is a poset with a bottom (⊥) element in which every chain has a least upper bound (lub). A function from one cpo to another is continuous if it preserves lubs of chains. Continuous functions can be shown to be monotonic. The cpo’s form a Henkin frame with the sets of continuous functions as exponentials. Every continuous function f from a cpo to itself has a least fixed point, equal to n n≥0 f (⊥). A flat cpo is one where p q implies p = q or p = ⊥.
3
PCF
As an alternative to early untyped FPLs, [10] proposed a typed lambda calculus, later dubbed LCF (logic of computable functions), with base types nat and bool . LCF includes constants for truth values, natural numbers, successor, predecessor, and zero-test, as well as McCarthy conditionals and recursion operators. [9]
202
Y. Kubota and C. Pollard
designed a simple LCF-based FPL, PCF, in which programs are modelled as closed terms of a base type, and computation as a form of call-by-name (CBN) evaluation. The irreducible terms returned by terminating programs are called values. Evaluation is confluent, but because of the recursion operators, program termination is not guaranteed. LCF/PCF is equipped with an interpretation I (denotational semantics) into a Henkin frame whose domains are cpo’s, with functional terms interpreted as continuous functions. Base types are interpreted as flat domains, with the bottoms being the meanings of nonterminating programs. For two functional terms s and t of the same type, I(s) I(t) means that, thought of as partial functions, I(t) is at least as defined as I(s). One reasons about the meanings of (parts of) PCF programs using an inequational logic whose formulas are inequalities s ≤ t between terms of the same type. Predictably, the axioms and rules of the inequational logic include β-conversion, Reflexivity and Transitivity (sound because the domains are cpo’s, hence preorders), as well as a form of Monotonicity (of function application only with respect to the first (the function) argument,1 sound because the order on functions is defined pointwise).
4
MMCG with Preordered Phonology
Following Morrill and Solias 1993, we assign phonological and semantic labels to types in syntactic derivations, e.g. a; m; A; such triples are called signs. Derivations themselves are written in the familiar natural-deduction style with hypotheses introduced at the top. Besides the usual logical rules (Introduction and Elimination rules for the various flavors of / and \), the syntactic calculus includes an Interface rule, which asserts that if a ≤ a’ is deducible in the inequational theory associated with the calculus of phonological terms (hereafter, φ-calculus, to be described below), then any sign with φ-component a has a counterpart with the same meaning and the same syntactic category but with φ-component a’. The Interface rule, together with the inequational φ-theory that governs the various modes of phonological combination and their interactions with each other, obviate the need for any structural rules in the syntax. In short: the φ-calculus governs semantically inert surface-oriented word order variation; the logical rules of the syntax govern the semantically relevant syntactic combinatorics; and the Interface rule is the channel of communication between the syntax and the phonology that enables signs to become pronounceable. The rules of the syntactic calculus are as follows: (1)
1
a. Forward Slash Elimination a; m; A/i B b; n; B /i E a ◦i b; m(n); A
Corresponding to CBN evaluation.
b. Backward Slash Elimination b; n; B a; m; B \i A \i E b ◦i a; m(n); A
Phonological Interpretation into Preordered Algebras
(2)
a. Forward Slash Introduction .. . .. .
.. . .. .
[p ; x; A]n .. .. . . b ◦i p ; m; B b; λx.m; B/i A
.. . .. .
b. Backward Slash Introduction .. . .. .
.. . .. . /i In
203
.. . .. .
[p ; x; A]n .. .. . . p ◦i b; m; B b; λx.m; A\i B
.. . .. .
.. . .. . \i In
(3) Interface rule a; m; A PI a’; m; A (where a ≤ a’ is a theorem in the inequational theory of φ-terms) Although most of the generalizations regarding word order are taken care of in the φ-calculus, the left- vs. right-slash distinction is retained for syntactic types. The semantic and phonological annotations on the rules are mostly as one would expect. The semantics is just function application and lambda abstraction for Elimination and Introduction rules, respectively. For Forward and Backward Slash Elimination, the phonology of the derived expression is obtained by combining the phonologies of the functor and the argument in the right order and in the right mode. And the Forward (respectively, Backward) Slash Introduction rule says roughly that a linguistic expression whose phonology is b alone is of category B/A (resp. A\B), given the hypothetical proof that b concatenated with p (whose category is A) to its right (resp. left) is of category B. The phonology for the derived expression in the Introduction rules might look somewhat non-standard, in that, unlike the semantics, it does not seem to involve lambda abstraction. Instead, the phonology of the hypothetically assumed expression (which is a variable of type phon) is simply stripped off from the phonology of the whole expression. To motivate this, we can think of the derived expression in the Introduction rule as actually having the following phonology: (4) λp.[b ◦i p ]() where is the identity element (interpreted as a null string). But in the inequational φ-theory (see below), this is provably equal to b (by β-conversion followed by one of the inequalities for the null phonology). That is, unlike in semantics, when the variable is bound, the resulting abstract is immediately applied to the null-phonology term, so that nothing can be ‘reconstructed’ to a gap in the surface string. The Introduction rules as stated can be thought of as implicitly compiling in these reduction steps together with an application of the Interface rule to produce a conclusion with the simplified phonology b. This is as it should be, given the typical way in which hypothetical reasoning is used in linguistic analyses in categorial grammar. That is, hypothetical
204
Y. Kubota and C. Pollard
reasoning is used in the analyses of phenomena that involve some kind of displaced constituency (such as long-distance dependencies and verb raising), and the phonologies of the hypothesized expressions should appear in the displaced position on the surface string rather than reconstructed in the original position. For this reason, the original ‘gap’ of the displaced element is filled by a null element as soon as the hypothetical reasoning step is taken. (We will illustrate how this works in an actual linguistic analysis with the Japanese fragment in the next section.) As stated above, the Interface rule is the channel between syntax and phonology, and the actual relation of relative pronounceability among phonologies of linguistic expressions is defined by the preorder imposed on the domain that interprets the phonological base type in the Henkin frame that models the inequational φ-theory. Thus this theory is analogous to the inequational theory used to reason about relative definedness of PCF programs, and closely resembles it in form. Specifically, the φ-calculus is a typed lambda calculus with one base type phon , constants of type phon (lexical phonologies and null phonology), and constants of type phon phon phon (modes of φ-composition). As with PCF, the formulas of the inequational theory are of the form a ≤ b, but now the inequated terms denote not program constructs with varying degrees of definedness, but rather phonological entities with varying degrees of pronounceability. Also as with PCF, the axioms (or reduction rules) include β-conversion, Reflexivity (5a), and Transitivity (5b), and a form of Monotonicity (for all the φ-modes, in both arguments) (6). Of course PCF also has parochial axioms (reduction rules for the built-in arithmetic operators, McCarthy conditionals, and fixed-point operators); and the φ-calculus does too, namely word-order variation rules for the specific modes ((8) with i and j being the same mode), rules of interaction between modes ((8) with i and j being different modes), and a rule that allows any of the ‘flavored’, unpronounceable modes to be replaced by the ‘plain vanilla’ fully pronounceable mode (concatenation) (9). In the model, phon is interpreted as a monotonic algebra whose maximal members (values) are the pronounceable phonologies (strings). φ-term reduction always terminates, but is non-confluent, corresponding to semantically inert word order variation. (5) Structured phonologies form a preorder. a.
a≤a
REFL
b. a ≤ b b ≤ c TRANS a≤c
(6) The ◦i are monotonic in both arguments. a ≤ a’ b ≤ b’ MON a ◦i b ≤ a’ ◦i b’ (7) is a two-sided identity for all the ◦i .
(◦i ∈ {◦ , ◦> , ◦< , ◦× , ◦· })
Phonological Interpretation into Preordered Algebras
a.
IDl
◦i a ≤ a (◦i ∈ {◦ , ◦> , ◦< , ◦× , ◦· })
b.
205
IDr
a ◦i ≤ a (◦i ∈ {◦ , ◦> , ◦< , ◦× , ◦· })
(8) Mode-specific rules: a. (LA; left associativity) (a ◦i b) ◦j c ≤ a ◦i (b ◦j c)
LA
(◦i , ◦j ∈ {◦< , ◦· })2
b. (RA; right associativity) a ◦i (b ◦j c) ≤ (a ◦i b) ◦j c
RA
(◦i , ◦j ∈ {◦> , ◦· })
c. (Perm; permutation) a ◦i (b ◦j c) ≤ b ◦i (a ◦j c)
PERM
(◦i , ◦j ∈ {◦× , ◦· })
(9) Concatenation mode is most pronounceable (phonological values are strings). a ◦i b ≤ a ◦ b
5
CONCAT
(◦i ∈ {◦ , ◦> , ◦< , ◦× , ◦· })
A Japanese Fragment
We illustrate the system developed in the previous section with an analysis of word-order variation found in the -te form complex predicate construction in Japanese. A fuller set of facts and an analysis embodying the same basic idea but formulated within the framework of Multi-Modal Combinatory Categorial Grammar [1] can be found in [3]. We here focus on two facts, the basic scrambling pattern and the patterns of argument cluster coordination in the -te form complex predicate. In both of these cases, the embedded verb (V1) and the embedding verb (V2) cluster together, suggesting that they are combined in a mode that is tighter than the mode in which ordinary nominal arguments are combined with the verbs that subcategorize for them. V1 and V2 are independent words syntactically, however, given that this construction systematically differs from the more typical cases of lexical complex predicate constructions in Japanese in terms of the wordhood of the sequence of the V1 and V2.3 Japanese does not freely allow long-distance scrambling. That is, normally, elements of an embedded clause cannot be scrambled to the domain of the higher clause. However, in the -te form complex predicate construction, such scrambling patterns are perfectly acceptable, suggesting that V1 and V2 form a single clausal domain in this construction. (10b) is a case in which the accusative object pianoo of V1 is scrambled over a matrix dative argument John-ni: 2 3
Note about notation: the modes i and j can (but need not) be identical. For the relevant set of facts, see [3].
206
(10)
Y. Kubota and C. Pollard
a. Mary-ga John-ni piano-o hii-te morat-ta. Mary-NOM John-DAT piano-ACC play-TE BENEF-PAST ‘Mary had John play the piano for her.’ b. Mary-ga piano-o John-ni hii-te morat-ta.
Argument cluster coordination patterns also suggest that, in this construction, (semantic) arguments of V1 and V2 are clausemates. As in (11), argument cluster coordination involving arguments of both V1 and V2 is possible, as long as the cluster of V1 and V2 is not split apart. (11) Mary-ga [John-ni piano-o], [Bill-ni gitaa-o] hii-te Mary-NOM John-DAT piano-ACC Bill-DAT guitar-ACC play-TE morat-ta. BENEF-PAST ‘Mary had John play the piano and Bill play the guitar for her. Under the account we propose, these word-order variation facts receive a straightforward account in terms of φ-modalities. The gist of the analysis is that V1 and V2 combine in a mode tighter than the mode in which ordinary arguments combine with the verbs that subcategorize for them, but looser than the mode in which components of lexical complex predicates combine. Specifically, we specify in the lexicon that V1 and V2 in this construction combine with one another in the left-associative mode ◦< , which is distinct from the default scrambling mode ◦· used for putting together nominal arguments with their verbal heads, which is both left- and right-associative and permutative. A sample lexicon is given in (12): (12) mary-ga; m; NPn piano-o; p; NPa gitaa-o; g; NPa john-ni; j; NPd
bill-ni; b; NPd hii-te; play; NPa \· NPn \· S morat-ta; λP λyλx.benef(x, P (y)); (NPn \· S)\< (NPd \· NPn \· S)
The derivation for (10) is given in (13): (13)
morat-ta; λP λyλx.benef(x, P (y)); piano-o; p; NPa hii-te; play; NPa \· NPn \· S \· E (NPn \· S)\< (NPd \· NPn \· S) piano-o ◦· hii-te; play(p); NPn \· S john-ni; j; NPd \· E (piano-o ◦· hii-te) ◦< morat-ta; λyλx.benef(x, play(p)(y)); NPd \· NPn \· S \· E john-ni ◦· ((piano-o ◦· hii-te) ◦< morat-ta); λx.benef(x, play(p)(j)); NPn \· S PI piano-o ◦ (john-ni ◦ (hii-te ◦ morat-ta)); λx.benef(x, play(p)(j)); NPn \· S
The key step in this derivation is the last one. Two things are happening here: (i) the direct object piano-o of the embedded verb scrambles over the dative argument of the higher verb, resulting in the surface order in which the former linearly precedes the latter and (ii) the modes by which the φ-terms of lexical words are combined are all converted to the concatenation mode ◦, so that we get a pronounceable φ-term for the expression derived. Technically, this last step is an application of the Interface rule, whose validity is supported by the following lemma in the inequational logic for φ-terms:
Phonological Interpretation into Preordered Algebras
207
(14) Lemma: a ◦· ((b ◦· c) ◦< d) ≤ b ◦ (a ◦ (c ◦ d)) Proof: a≤a
RFF
LA
(b ◦· c) ◦< d ≤ b ◦· (c ◦< d) MON PERM a ◦· ((b ◦· c) ◦< d) ≤ a ◦· (b ◦· (c ◦< d)) a ◦· (b ◦· (c ◦< d)) ≤ b ◦· (a ◦· (c ◦< d)) a ◦· ((b ◦· c) ◦< d) ≤ b ◦· (a ◦· (c ◦< d))
... ... a ◦· ((b ◦· c) ◦< d) ≤ b ◦· (a ◦· (c ◦< d)) a ◦· ((b ◦· c) ◦
> >
C
Fig. 2. Interaction principles: output (7.1) (left) and (7.3) (right)
Grishin considers a second group of interaction principles. In the rule format, these are the converses of (7), where premise and conclusion change place. Characteristic theorems for this second group are (11), i.e. the converses of (8), and (12). (1) (A B) ⊗ C A (B ⊗ C) (2) C ⊗ (A B) A (C ⊗ B)
C ⊗ (B A) (C ⊗ B) A (3) (11) (B A) ⊗ C (B ⊗ C) A (4)
(1) (A ⊕ B) ⊗ C A ⊕ (B ⊗ C) (2) C ⊗ (A ⊕ B) A ⊕ (C ⊗ B)
C ⊗ (B ⊕ A) (C ⊗ B) ⊕ A (3) (12) (B ⊕ A) ⊗ C (B ⊗ C) ⊕ A (4)
214
N. Kurtonina and M. Moortgat
B
·<