Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
6339
José M. Sempere Pedro García (Eds.)
Grammatical Inference: Theoretical Results and Applications 10th International Colloquium, ICGI 2010 Valencia, Spain, September 13-16, 2010 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors José M. Sempere Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Camino de Vera s/n, 46022 Valencia, Spain E-mail:
[email protected] Pedro García Universidad Politécnica de Valencia Departamento de Sistemas Informáticos y Computación Camino de Vera s/n, 46022 Valencia, Spain E-mail:
[email protected] Library of Congress Control Number: 2010933123
CR Subject Classification (1998): I.2, F.1, I.4, I.5, J.3, H.3
LNCS Sublibrary: SL 7 – Artificial Intelligence
ISSN
0302-9743
ISBN-10 ISBN-13
3-642-15487-5 Springer Berlin Heidelberg New York 978-3-642-15487-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
The first edition of the International Colloquium on Grammatical Inference (ICGI) was held in Essex (United Kingdom) in 1993. After the success of this meeting there have been eight more editions that have been hosted by different academic institutions across the world: Alicante (Spain, 1994), Montpellier (France, 1996), Ames, Iowa (USA, 1998), Lisbon (Portugal, 2000), Amsterdam (The Netherlands, 2002), Athens (Greece, 2004), Tokyo (Japan, 2006) and SaintMalo (France, 2008). ICGI 2010 was held in Valencia (Spain) during September 13–16. It was organized by the Research Group on Formal Language Theory, Computability and Complexity from the Technical University of Valencia. This was the tenth edition of ICGI, which is a nice number for celebrations. Ten editions is a sign of good health for any conference. In the case of Grammatical Inference, it means that the topics, problems and applications of this research area are alive and serve as a good framework to study related aspects of artificial intelligence, natural language processing, formal language theory, computability and complexity, bioinformatics, pattern recognition, etc. There were two reviews and local discussions among the members of the Program Committee (PC) in order to evaluate every work proposed to the conference. This volume contains the texts of 32 papers presented at ICGI 2010. They are divided into two groups of works. There are 18 regular papers (out of 25) and 14 short papers (11 out of 15, and three regular papers proposed as short ones). The topics of the papers range from theoretical results about the learning of different formal language classes (regular, context-free, context-sensitive, etc.) to application papers on bioinformatics, language modelling, software engineering, etc. In addition, there are two invited lectures delivered by distinguished scientists on the following topics: – Simon Lucas (University of Essex, UK): Grammatical Inference and Games – David B. Searls (University of Pennsylvania, USA): Molecules, Languages, and Automata In this edition, for the first time, there was a Best Student Paper Award to motivate young researchers in this area to continue their research work. The award was given to Franco Luque for his paper “Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars.” The first day of the conference hosted four tutorial talks given by prominent scientists of the area on different aspects of grammatical inference. We are grateful to the tutorial lecturers for the brilliant talks: Tim Oates, with Sourav Mukherjee, Colin de la Higuera, Francois Coste and Dami´ an L´ opez, with Pedro Garc´ıa. We would like to thank the many people who contributed to the success of ICGI 2010. First of all, we are grateful to the members of the Steering Committee
VI
Preface
that supported our proposal to organize the conference. It was very exciting to organize ICGI 2010 given that some members of the Local Organizing Committee were involved in the organization of ICGI 1994. We are very grateful to the members of the PC for their time and effort in carrying out the reviewing process. The help and the experience that they provided were invaluable, and the suggestions that they proposed to improve different aspects of the conference were brilliant. Thanks are given to external reviewers that helped the PC members during the review process: Kengo Sato, Manuel V´ azquez de Parga and Dami´an L´ opez. The joint effort of these people ensured the quality of the works presented in this volume. The success of the conference was possible due to the work of the Local Organizing Committee. We especially thank the effort and work made by Dami´an L´opez, who was involved in many aspects of the conference. In addition, we received the support of the Centre for Innovation, Research and Technology Transfer (CTT) and the Continuous Training Centre (CFP) of the Technical University of Valencia. We are grateful to the people of such institutions for helping us to carry out different aspects of the organization of the conference. Last, but not least, we are grateful to the sponsors of the conference: The PASCAL2 Network of Excellence, the Spanish Ministry of Science and Innovation, BANCAJA, and the Technical University of Valencia together with the Department on Information Systems and Computation and the School of Engineering in Computer Science. We hope to celebrate the next ten editions of ICGI. We are sure that it will have a brilliant and exciting future in this research area that tries to identify and solve many interesting problems before the limit. June 2010
Jos´e M. Sempere Pedro Garc´ıa
Conference Organization
Program Chair Jos´e M. Sempere
Universidad Polit´ecnica de Valencia, Spain
Program Committee Pieter Adriaans Dana Angluin Jean-Marc Champarnaud Alexander Clark Francois Coste Colin de la Higuera Francois Denis Henning Fernau Pedro Garc´ıa Makoto Kanazawa Satoshi Kobayashi Laurent Miclet Tim Oates Arlindo Oliveira Jose Oncina Georgios Paliouras Yasubumi Sakakibara Etsuji Tomita Menno van Zaanen Ryo Yoshinaka Sheng Yu Thomas Zeugmann
Universiteit van Amsterdam, The Netherlands Yale University, USA Universit´e de Rouen, France Royal Holloway University of London, UK INRIA, France Universit´e de Nantes - LINA, France Universit´e de Provence, France Universit¨ at Trier, Germany Universidad Polit´ecnica de Valencia, Spain National Institute of Informatics, Japan University of Electro-Communications, Japan ENSSAT-Lannion, France University of Maryland Baltimore County, USA Lisbon Technical University, Portugal Universidad de Alicante, Spain Institute of Informatics Telecommunications, Greece Keio University, Japan University of Electro-Communications, Japan Tilburg University, The Netherlands Japan Science and Technology Agency, Japan The University of Western Ontario, Canada Hokkaido University, Japan
Local Organization All members are from the Universidad Polit´ecnica de Valencia, Spain. Marcelino Campos Antonio Cano Dami´an L´ opez Alfonso Mu˜ noz-Pomer Piedachu Peris Manuel V´ azquez de Parga
VIII
Conference Organization
Sponsoring Institutions The PASCAL2 Network of Excellence Ministerio de Ciencia e Innovaci´on (Spain) Universidad Polit´ecnica de Valencia (UPV) Department of Information Systems and Computation (DSIC, UPV) School of Engineering in Computer Science (ETSINF, UPV) BANCAJA
Table of Contents
Invited Talks Grammatical Inference and Games: Extended Abstract . . . . . . . . . . . . . . . Simon M. Lucas
1
Molecules, Languages and Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David B. Searls
5
Regular Papers Inferring Regular Trace Languages from Positive and Negative Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Cano G´ omez
11
Distributional Learning of Some Context-Free Languages with a Minimally Adequate Teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Clark
24
Learning Context Free Grammars with the Syntactic Concept Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Clark
38
Learning Automata Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Garc´ıa, Manuel V´ azquez de Parga, Dami´ an L´ opez, and Jos´e Ruiz
52
Exact DFA Identification Using SAT Solvers . . . . . . . . . . . . . . . . . . . . . . . . Marijn J.H. Heule and Sicco Verwer
66
Learning Deterministic Finite Automata from Interleaved Strings . . . . . . Joshua Jones and Tim Oates
80
Learning Regular Expressions from Representative Examples and Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efim Kinber
94
Splitting of Learnable Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongyang Li and Frank Stephan
109
PAC-Learning Unambiguous k,l -NTS≤ Languages . . . . . . . . . . . . . . . . . . . . Franco M. Luque and Gabriel Infante-Lopez
122
Bounding the Maximal Parsing Performance of Non-Terminally Separated Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Franco M. Luque and Gabriel Infante-Lopez
135
X
Table of Contents
CGE: A Sequential Learning Algorithm for Mealy Automata . . . . . . . . . . Karl Meinke Using Grammar Induction to Model Adaptive Behavior of Networks of Collaborative Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wico Mulder and Pieter Adriaans
148
163
Transducer Inference by Assembling Specific Languages . . . . . . . . . . . . . . . Piedachu Peris and Dami´ an L´ opez
178
Sequences Classification by Least General Generalisations . . . . . . . . . . . . . Fr´ed´eric Tantini, Alain Terlutte, and Fabien Torre
189
A Likelihood-Ratio Test for Identifying Probabilistic Deterministic Real-Time Automata from Positive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sicco Verwer, Mathijs de Weerdt, and Cees Witteveen A Local Search Algorithm for Grammatical Inference . . . . . . . . . . . . . . . . . Wojciech Wieczorek Polynomial-Time Identification of Multiple Context-Free Languages from Positive Data and Membership Queries . . . . . . . . . . . . . . . . . . . . . . . . Ryo Yoshinaka Grammatical Inference as Class Discrimination . . . . . . . . . . . . . . . . . . . . . . Menno van Zaanen and Tanja Gaustad
203
217
230
245
Short Papers MDL in the Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pieter Adriaans and Wico Mulder
258
Grammatical Inference Algorithms in MATLAB . . . . . . . . . . . . . . . . . . . . . Hasan Ibne Akram, Colin de la Higuera, Huang Xiao, and Claudia Eckert
262
A Non-deterministic Grammar Inference Algorithm Applied to the Cleavage Site Prediction Problem in Bioinformatics . . . . . . . . . . . . . . . . . . Gloria In´es Alvarez, Jorge Hern´ an Victoria, Enrique Bravo, and Pedro Garc´ıa
267
Learning PDFA with Asynchronous Transitions . . . . . . . . . . . . . . . . . . . . . . Borja Balle, Jorge Castro, and Ricard Gavald` a
271
Grammar Inference Technology Applications in Software Engineering . . . Barrett R. Bryant, Marjan Mernik, Dejan Hrnˇciˇc, Faizan Javed, Qichao Liu, and Alan Sprague
276
Table of Contents
H¨older Norms and a Hierarchy Theorem for Parameterized Classes of CCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Costa Florˆencio and Henning Fernau
XI
280
Learning of Church-Rosser Tree Rewriting Systems . . . . . . . . . . . . . . . . . . . M. Jayasrirani, D.G. Thomas, Atulya K. Nagar, and T. Robinson
284
Generalizing over Several Learning Settings . . . . . . . . . . . . . . . . . . . . . . . . . Anna Kasprzik
288
Rademacher Complexity and Grammar Induction Algorithms: What It May (Not) Tell Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sophia Katrenko and Menno van Zaanen
293
Extracting Shallow Paraphrasing Schemata from Modern Greek Text Using Statistical Significance Testing and Supervised Learning . . . . . . . . . Katia Lida Kermanidis
297
Learning Subclasses of Parallel Communicating Grammar Systems . . . . . Sindhu J. Kumaar, P.J. Abisha, and D.G. Thomas Enhanced Suffix Arrays as Language Models: Virtual k -Testable Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Herman Stehouwer and Menno van Zaanen Learning Fuzzy Context-Free Grammar—A Preliminary Report . . . . . . . . Olgierd Unold
301
305 309
Polynomial Time Identification of Strict Prefix Deterministic Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitsuo Wakatsuki and Etsuji Tomita
313
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317
Grammatical Inference and Games: Extended Abstract Simon M. Lucas School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, UK
[email protected] 1
Introduction
This paper discusses the potential synergy between research in grammatical inference and research in artificial intelligence applied to games. There are two aspects to this: the potential as a rich source of challenging and engaging test problems, and the potential for real applications. Grammatical Inference (GI) addresses the problem of learning a model for recognising, interpreting, generating or transducing data structures. Learning may proceed based on samples of the structures or via access to a simulator or oracle with which the learner can interact by asking questions or running experiments. In the majority of GI research the data structures are labelled strings, and the most successful GI algorithms infer finite state automata, or their stochastic counterparts such as N-Gram models, or hidden Markov models. We now consider some different types of grammatical inference, and the application of those types to particular problems in AI and Games.
2
Sequence Recognition
A common application of GI is to sequence recognition. The aim of the learning phase is to infer a sequence recognition model which is then used for classification. Real-world problems tend to be noisy, and recognition of real-world sequences is usually best performed by stochastic models. The type of GI that works best for these applications is often based on relatively simply statistical models, such as n-gram models or hidden Markov models. A significant application in computer games is the so-called “bot-detection” problem. Massively Multiplayer Online Games often involve the players acquiring valuable assets, and this acquisition process may involve a significant amount of tedious labour on behalf of the player. An alternative is for the player to spend real-world money to acquire such assets. Typically the in-game assets can be bought either with real money or with virtual game money (hence there is an exchange rate between the two). Unscrupulous players may use bots to do the tedious work needed to acquire the assets - which can then be sold to generate real-world revenue. The use of bots has a detrimental effect on the game play. People play on-line games to play against other people, and bots are typically J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 1–4, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
S.M. Lucas
less fun to play against, partly because bots lack the flexible intelligence that people take for granted. There are two main possible approaches to bot detection: active detection and passive detection. Active detection involves modifying the game to introduce tests which are specifically hard for bots to pass, such as CAPTCHA style tests. These are highly effective, but rather disruptive to the game play. The passive approach is to try to identify behaviour that would be unnatural for human players, based on some statistical measures of the observed actions of the player. An example of this uses the trajectories of the players’ avatars (the in-game characters controlled by the players) to compare against typical bot trajectories [1]. Given the vast amount of player data available this would make an interesting challenge for statistical GI methods, such as those that have been reported in previous ICGI conferences.
3
Learning Finite State Machines
Finite state automata have been among the most widely studied models within the GI community and have been the subject of some interesting competitions such as the Abbadingo One DFA induction competition [2] and the Gecco 2005 DFA from noisy samples competition1 . State machines are also the most widely used architecture for controlling the non-player characters (NPCs) in video games. The state machines used in video games are typically more complex than the ones used in GI research. In particular, the states represent actions that the character may execute continuously until an event or a condition being satisfied triggers a transition to a new state. Hence the complete representation of a state machine controller goes beyond a transition matrix and set of labels, and includes some decision logic to trigger the transitions, and perhaps also procedural code to map high-level actions into low-level actions. Finite state machines have proven to be useful for encoding relatively simple behaviours but the main limitation is that they do not scale well to more complex problems. For this reason more sophisticated NPC control architectures such as hierarchical state machines, planning systems and behaviour trees are being developed and applied, and grammatical inference research could benefit from trying to learn richer models of this type. This would have the potential to reduce game development costs by realising a programming-by-example model. The idea would be for the game designers to provide sample behaviours for the non-player characters using standard game controllers, and have the system learn an underlying finite state machine able to reproduce the desired behaviour. The learning of finite state machines has also been studied from the perspective of learning to play games, such as the Resource Protection Game [3]. The challenge here was to learn a strategy encoded as a finite state machine, where the objective for the player is to capture grid cells by visiting them before the opponent does, given only local information about the neighbouring grid cells. By placing the finite state machine induction problem within the context of 1
http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html
Grammatical Inference and Games: Extended Abstract
3
game playing, it becomes an even more challenging problem than the more conventional GI problem of trying to learn a model from a fixed sample of data, or with reference to an oracle, since now the learner must also attempt to solve an even harder credit assignment problem. Actions taken early in the game may lead to success or to failure, but this also depends on the actions taken by the opponent. Over the years the grammatical inference community has run several competitions that go beyond learning DFA from samples of data, such as context-free grammar learning (Omphalos, ICGI 2004 [4]), learning models of machine translation (Tenjinno, ICGI 2006), and the active learning of DFA in the minimum number of queries to an oracle (Zulu, ICGI 2010). An interesting future competition would be the learning of finite-state (or other) game controllers either from game logs or by embedding the learning agent directly in the game, giving it control over its learning experience.
4
Semantic Language Learning
Most of the work on grammatical inference involves learning only the syntax of language, but it is well understood that children learn language within a rich semantic and pragmatic context. Feldman [5] describes how computational modelling of language acquisition can be extended and applied to grammatical inference within a semantic context. Orkin and Roy [6] devised a relatively simple on-line game called the Restaurant Game with part of the motivation being to test how well a system would be able to learn to behave in realistic ways using a plan network from the observed interactions of human users playing the game. To play the game users play either as a customer or a waitress, and click actions while typing free text to fill in the details with the aim of completing a successful dining transaction. This is of interest to grammatical inference in several ways. The system learned a plan network from the game logs of over 5,000 games. The plan network consists of a set action nodes together with arcs showing which nodes follow other nodes. Each action node is defined by its name (e.g. Pickup), its requirements (e.g. actor=customer and object=menu), the localworld pre-conditions (e.g. actor sitting on chair, menu on table), and the effects of taking the action (e.g. customer has menu). The learning algorithm was able to infer plan networks from the game logs using clustering and statistical ngram methods, and the inferred networks were able to rate the degree to which a particular game log was representative of typical restaurant behaviour.
5
Grammatical Inference and Opponent Modelling
In order to provide some simple yet interesting examples of game-log analysis, results will be reported on some problems related to playing Ms. Pac-Man. This is a classic arcade game requiring great skill in order to achieve high scores. The best human players can score over 900,000 after many hours of play. The ghosts in Ms Pac-Man were programmed to provide the player with a fun experience,
4
S.M. Lucas
and they do not play optimally. Part of the control logic of the ghosts is a finite state machine. Expert players are able to make good predictions about the next moves of the ghosts, and by making such predictions are able to escape from apparently impossible situations. The challenge here for GI methods is to infer finite state machines and hence perform ghost behaviour prediction. This can be done either by passively studying the game-logs of any players, or for potentially higher performance learning, done by an active learner embedded in the game deliberately attempting to reach states of the game in which it is likely to learn most about the ghost behaviours.
6
Conclusion
The overall conclusion of this paper is that there is a significant overlap in some of the fundamental problems and architectures used in grammatical inference and in games. Now that games have superb graphics and increasingly realistic physics, the next frontier is improving the game AI. Grammatical inference has the potential to contribute to this, but to make a convincing impact, it will need to deal with the richer control models used in game AI. The talk will discuss these ideas in more detail, and describe some on-going experiments by the author.
References 1. Chen, K.-T., Liao, A., Pao, H.-K.K., Chu, H.-H.: Game bot detection based on avatar trajectory. In: Stevens, S.M., Saldamarco, S.J. (eds.) ICEC 2008. LNCS, vol. 5309, pp. 94–105. Springer, Heidelberg (2008) 2. Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one dfa learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998) 3. Spears, W.M., Gordon, D.F.: Evolution of strategies for resource protection problems. In: Advances in evolutionary computing: theory and applications, pp. 367–392. Springer, Heidelberg (2000) 4. Starkie, B., Coste, F., van Zaanen, M.: The omphalos context-free grammar learning competition. In: Paliouras, G., Sakakibara, Y. (eds.) ICGI 2004. LNCS (LNAI), vol. 3264, pp. 16–27. Springer, Heidelberg (2004) 5. Feldman, J.A.: Real language learning. In: Honavar, V.G., Slutzki, G. (eds.) PICGI 1998. LNCS (LNAI), vol. 1433, pp. 114–125. Springer, Heidelberg (1998) 6. Orkin, J., Roy, D.: The restaurant game: Learning social behavior and language from thousands of players online. Journal of Game Development 3(1), 39–60 (2007)
Molecules, Languages and Automata David B. Searls Lower Gwynedd, PA 19454, USA
Abstract. Molecular biology is full of linguistic metaphors, from the language of DNA to the genome as “book of life.” Certainly the organization of genes and other functional modules along the DNA sequence invites a syntactic view, which can be seen in certain tools used in bioinformatics such as hidden Markov models. It has also been shown that folding of RNA structures is neatly expressed by grammars that require expressive power beyond context-free, an approach that has even been extended to the much more complex structures of proteins. Processive enzymes and other “molecular machines” can also be cast in terms of automata. This paper briefly reviews linguistic approaches to molecular biology, and provides perspectives on potential future applications of grammars and automata in this field.
1
Introduction
The terminology of molecular biology from a very early point adopted linguistic and cryptologic tropes, but it was not until some two decades ago that serious attempts were made to apply formal language theory in this field. These included efforts to model both the syntactic structure of genes, reflecting their hierarchical organization, and the physical structure of nucleic acids such as DNA and RNA, where grammars proved suitable for representing folding patterns in an abstract manner. In the meantime, it was also recognized that automata theory could be a basis for representing some of the key string algorithms used in the analysis of macromolecular sequences. These varied approaches to molecular biology are all bound together by formal language theory, and its close relationship to automata theory. In reviewing these approaches, and discussing how they may be extended in new directions within biology, we hope to demonstrate the power of grammars as a uniform, computer-readable, executable specification language for biological knowledge.
2
Structural Grammars
Nucleic acids are polymers of four bases, and are thus naturally modeled as languages over the corresponding alphabets, which for DNA comprise the wellknown set Σ = {a, c, g, t}. (RNA bases are slightly different, but for all practical purposes can be treated the same.) DNA, which carries the genetic information in our chromosomes, tends to form itself into a double helix with two strands J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 5–10, 2010. c Springer-Verlag Berlin Heidelberg 2010
6
D.B. Searls
that are held together by a complementary pairing of the opposing bases, ‘a’ with ‘t’ and ‘g’ with ‘c’. RNA molecules are more often single-stranded, though they can fold back on themselves to form regions of double-stranded structure, called secondary structure. Given these bare facts, the language of all possible RNA molecules is specified by the following trivial grammar (with S the start symbol and the empty string, as usual): S → xS |
for each x ∈ {a, c, g, u}
(1)
It would seem to be natural to specify DNA helices, which comprise two strands, as a pair of strings, and a DNA language as a set of such pairs. However, DNA has an additional important constraint, in fact two: first, the bases opposing one another are complementary, and second, the strings have directionality (which is chemically recognizable) with the base-paired strands running in opposite directions. We have previously shown how this sort of model can be extended to describe a number of specific phenomena in RNA secondary structure, and in fact a simple addition to the stem-and-loop grammar allows for arbitrarily branching secondary structures: S → xS x ¯ | SS |
where g¯ = c, c¯= g, a ¯ = t, t¯= a
(2)
Examples of such branching secondary structure include a cloverleaf form such as is found in transfer RNA or tRNA, an important adaptor molecule in the translation of genetic information from messenger RNA to protein. The language of (2) describes what is called orthodox secondary structure, which for our purposes can be considered to be all fully base-paired structures describable by context-free grammars. There are, however, secondary structures that are beyond context-free, the archetype of which are the so-called pseudoknots. Pseudoknots can be conceived as a pair of stem-loop structures, one of whose loops constitutes one side of the others stem. The corresponding (idealized) language is of the form uv¯ uR v¯R , which cannot be expressed by any context-free grammar. It is sometimes described as the intersection of two context-free palindromes of the form uv¯ uR and R R v¯ u v¯ , but of course context-free languages are not closed under intersection. Pseudoknots and other non-context-free elements of the language of secondary structure can be easily captured with context-sensitive grammars, but the resulting complex movements of nonterminals in sentential forms tend not to enlighten. Rather, grammars with more structured rules, such as Tree-Adjoining Grammars (TAG), have been more profitably used for this purpose [6]. A grammar variation that the author has proposed describes even more complex, multi-molecular base-paired complexes of nucleic acids [4]. This is enabled by the addition to any grammar of a new symbol δ which is understood to cut the string at the point it appears. This means that derivations ultimately give rise not to strings but to sets of strings arising from such cuts, which may be base-paired in arbitrarily ramified networks.
Molecules, Languages and Automata
7
Proteins are more complex macromolecular structures with several kinds of intermolecular interactions. Some of the basic types of such recurrent structural themes have been described with a variety of grammars [6].
3
Gene Grammars
Genes, which are encoded in the DNA of organisms, have a hierarchical organization to them that is determined by the process by which they are converted into proteins (for the most part). Genes are first transcribed into messenger RNA, or mRNA, which constitutes a complementary copy of the gene, and then this is translated into protein. The latter step requires the DNA/RNA code to be adapted to that of proteins, whose alphabet comprises the twenty amino acids. This encoding is called the genetic code, which appears as a table of triplets of bases mapped to amino acids. Transcription itself involves a number of complications regarding the structure of genes, such as the fact that the actual coding sequence is interrupted by segments that are spliced out at an intermediate step, establishing what is called the intron/exon structure of the gene. In addition there are many signal sequences embedded in the gene, including in flanking non-coding regions, that determine such things as the starting point of transcription, the conditions under which transcription will occur, and the points at which splicing will occur. The author has demonstrated how grammars can effectively capture all these features of genes, including ambiguities such as alternative splicing whereby different versions of genes may arise from the same genome sequence [1]. Such grammars have been used to recognize the presence of genes in raw sequence data by means of parsing, in what amounts to an application of syntactic pattern recognition [2]. (Modern ‘gene-finders’, however, use highly customized algorithms for efficiency, though the most effective of these do capture the syntactic structure of the standard gene model.) As the variety of genes and related features (such as immunoglobulin superfamily genes and microRNA) and their higher-level organization in genomes continues to grow more complex, grammars may yet prove to be a superior means to formally specify knowledge about such structure.
4
Genetic Grammars
Gregor Mendel laid the foundation for modern genetics by asserting a model for the inheritance of traits based on a parsimonious set of postulates. While many modifications have been required to account for a wider and wider set of observations, the basic framework has proven robust. Many mathematical and computational formalizations of these postulates and their sequelae have been developed, which support such activities as pedigree analysis and genetic mapping. The author has been developing a grammar-based specification of Mendelian genetics which is able to depict the basic processes of gamete formation, segregation of alleles, zygote formation, and phenotypic expression within a uniform
8
D.B. Searls
framework representing genetic crosses [unpublished]. With this basic ‘Mendelian grammar,’ extensions are possible that account in a natural way for various known mechanisms for modification of segregation ratios, linkage, crossing-over, interference, and so forth. One possible use of this formalism is as a means to frame certain types of analysis as a form of grammar inference. For example, mapping of genes to linkage groups and ordering linkage groups can be seen as finding an optimal structure of an underlying grammar so as to fit experimental data. Especially intriguing is the possibility of including together in one framework the genetic analysis with phenotypic grammars, for example in the genetic dissection of pathways.
5
Molecular Machines
Enzymes and other biochemical structures such as ribosomes are sometimes called ‘molecular machines’ because they perform repetitive chemical and/or mechanical operations on other molecules. In particular, a large class of such objects process nucleic acids in various ways, many of them by attaching to and moving along the DNA or RNA in what is termed processive fashion. This immediately brings to mind computational automata which perform operations on tapes. Since automata have their analogues in grammars, it is natural to ask whether grammars can model enzymes that act on DNA or RNA. In fact the trivial rightrecursive grammar that we showed at the outset (1) can be considered a model for terminal transferase, an enzyme that synthesizes DNA by attaching bases to a growing chain, as in this derivation: S ⇒ cS ⇒ ctS ⇒ ctcS ⇒ ctcaS ⇒ ctcaaS ⇒ ctcaagS ⇒ ctcaag We can view the nonterminal S as the molecular machine, the terminal transferase itself, laying down bases sequentially and then departing into solution. Similarly, we can envision a context-sensitive grammar that models an exonuclease, an enzyme that degrades nucleic acids a base at a time from one or the other end. The orientation is important, because exonucleases are specific for which end they chew on, and therefore whether they run in the forward or reverse direction on the strand: Fx → F | xR → R |
forward exonuclease reverse exonuclease
(3)
These can produce derivations such as the following, with the nonterminals again physically mimicking the action of the corresponding enzymes: F gcaa ⇒ F gcaa ⇒ F caa ⇒ F aa ⇒ F a ⇒ F ⇒ atggacR ⇒ atggaR ⇒ atggR ⇒ atgR ⇒ atR ⇒ at
Molecules, Languages and Automata
9
In the first derivation, the F completely digests the nucleic acid strand and then itself disappears via the disjunct — into solution, as it were. On the other hand, in the second example we show the R exonuclease departing without completing the job, which mirrors the biological fact that enzymes can show greater or lesser propensity to hop on or off the nucleic acid spontaneously. We could model the tendency to continue the recursion (known as an enzyme’s processivity) with a stochastic grammar, where probabilities attached to rules would establish the half-lives of actions of the biological processes. The author’s most recent efforts [unpublished] have been to catalogue a wide range of grammars describing the activities of enzymes acting on nucleic acids in various circumstances. This requires the extension of the model to doublestranded DNA, as well as the ability to act on more than one double-stranded molecule at once. Wth the employment of stochastic grammars, it appears possible to specify a wide variety of biochemical details of molecular machines.
6
Edit Grammars
Another view of the movement of nonterminals is as a means to perform editing operations on strings. As in the case for processive enzymes, we view a nonterminal as a ‘machine’ that begins at the left end of an input string and processes to the right end, leaving an altered string as output. 0
identity (x ∈ Σ)
1
substitution (x = y)
xS −→ Sx yS −→ Sx 1
S −→ Sx 1 xS −→ S
deletion insertion
To frame this input/output process in a more standard fashion, one can simply assert a new starting nonterminal S , a rule S →Swτ where w ∈ Σ ∗ is the input string and τ is a new terminal marker not in the language, and an absorbing rule S→τ that is guaranteed to complete any derivation and leave only the output string. Note that the insertion rule is not strictly context-sensitive (the left side being longer than the right), and can generate any string whatever as output. The numbers above the arrows here represent a cost of applying the corresponding edit rule. An overall derivation would again move the S nonterminal from the beginning of an input string to the end, leaving the output to its left, as follows: 0
1
1
2
2
Sgact =⇒ gSact =⇒ gtSct =⇒ gtcSt =⇒ gtcgSt =⇒ gtcgtS Here the numbers above the double arrows represent the cumulative cost of the derivation. The rules applied are an identity (for no cost), a substitution of a ‘t’ for an ‘a’ (adding a cost of 1), another identity, an insertion of a ‘g’ (adding a cost of 1), and an identity.
10
D.B. Searls
Minimal edit distances are typically calculated with dynamic programming algorithms that are O(nm) in the lengths of the strings being compared. The same order of results can be obtained with the appropriate table-based parsers for grammars such as that above, though perhaps with the sacrifice of some efficiency for the sake of generality. The great advantage of the parsing approach is that grammars and their cognate automata make it possible to describe more complex models of string edits, and therefore of processes related to molecular evolution. The author has recast a number of the algorithms developed for such purposes in the form of automata, which can be shown to be equivalent to the recurrence relations typically used to specify such algorithms [4].
References 1. Searls, D.B.: The linguistics of DNA. Am. Sci. 80, 579–591 (1992) 2. Dong, S., Searls, D.B.: Gene structure prediction by linguistic methods. Genomics 23, 540–551 (1994) 3. Searls, D.B.: String Variable Grammar: a logic grammar formalism for DNA sequences. J. Logic Prog. 24, 73–102 (1995) 4. Searls, D.B.: Formal language theory and biological macromolecules. In: FarachColton, M., Roberts, F.S., Vingron, M., Waterman, M. (eds.) Mathematical Support for Molecular Biology, pp. 117–140. American Mathematical Society, Providence (1999) 5. Searls, D.B.: The language of genes. Nature 420, 211–217 (2002) 6. Chiang, D., Joshi, A.K., Searls, D.B.: Grammatical representations of macromolecular structure. J. Comp. Biol. 13, 1077–1100 (2006)
Inferring Regular Trace Languages from Positive and Negative Samples Antonio Cano G´ omez Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Valencia, Spain
[email protected] Abstract. In this work, we give an algorithm that infers Regular Trace Languages. Trace languages can be seen as regular languages that are closed under a partial commutation relation called the independence relation. This algorithm is similar to the RPNI algorithm, but it is based on Asynchronous Cellular Automata. For this purpose, we define Asynchronous Cellular Moore Machines and implement the merge operation as the calculation of an equivalence relation. After presenting the algorithm we provide a proof of its convergence (which is more complicated than the proof of convergence of the RPNI because there are no Minimal Automata for Asynchronous Automata), and we discuss the complexity of the algorithm.
1
Introduction
This work presents an algorithm that infers Regular Trace Languages. Traces were first introduced by Mazurkiewicz [7] to describe the behavior of concurrent systems. The main idea of traces is to consider that each letter of a given alphabet represents a process. When two processes in a concurrent system can be executed simultaneously, they are considered to be independent, so it does not matter which letter is written first in the word that represents the concurrent system. Mazurkiewitcz’s theory of traces has developed very rapidly since its introduction [13,9]. In Grammatical Inference, the inference of finite automata has been a central subject [3,6,8,10]. One of the most popular algorithms is RPNI [8] which has led to many other algorithms that have attempted to improve it. Another option for improving the efficiency of the RPNI algorithm is to not work with all Regular Language but to work with some subclasses of Regular Languages. FCRPNI algorithm [1], was created for this purpose in order to improve the efficiency of RPNI. Even though it was based on the RPNI, grouping of states was not allowed if the result did not have an automaton that belonged to the corresponding subclass (in other words, it had a forbidden configuration for that class).
Work supported by the project T´ecnicas de Inferencia Gramatical y aplicaci´ on al procesamiento de biosecuencias (TIN2007-60769) supported by the Spanish Ministry of Education and Sciences.
J.M. Sempere and P. Garc´ıa (Eds.): ICGI 2010, LNAI 6339, pp. 11–23, 2010. c Springer-Verlag Berlin Heidelberg 2010
12
A.C. G´ omez
In [4], another idea was introduced to define a new kind of automata for a given subclass and apply the ideas of RPNI to the new kind of automata. In [4], that idea was applied to Commutative Regular Languages, and the results for the efficiency and complexity of the algorithm were very good . The problem with this algorithm is that Commutative Regular Languages are a very small subclass of Regular Languages. This is where the inference of Regular Trace Languages might be useful. Regular Trace languages can be viewed as Regular Languages that are closed under an independence relation where words of the alphabet can commute. For instance, if we take equality as the independence relation, we obtain Regular Languages. However, if we take the relation that relates every letter of the alphabet as the independence relation, we obtain Commutative Regular Languages (an overview of subclasses of regular languages that are closed under independence relations can be found in [2]). The aim of our work is to present an algorithm for the inference of Regular Trace Languages, prove its convergence, and analyze its complexity. In Section 2, we present the definition of the main concepts that will be used in this paper about Trace Theory and Grammatical Inference. In Section 3, we introduce the concept of Asynchronous Automata, which is used to recognize Regular Trace Languages. Specifically, we focus on an special kind of Asynchronous Automata called an Asynchronous Cellular Automaton. We present its formal definition and provide some definitions that are useful for the following sections. In Section 4, we define the adaptation of an Asynchronous Cellular Automaton for a Moore Machine and present the definition and results. In Section 5, we define a version of RPNI, that is based on equivalence relations on an Asynchronous Cellular Moore Machine that could be adapted to an Asynchronous Cellular Automaton. In Section 6, we study our main algorithm. In Section 7, we study the convergence of this algorithm. The proof of convergence is not a simple adaptation of the convergence for the RPNI algorithm since there are several Minimal Cellular Asynchronous Automata for a given trace language. Therefore, we need to use the lexicographical order in order to determine which of the irreducible automata the algorithm converges to. In Section 8 we discuss the general complexity of the algorithm, and in Section 9, we present the conclusions of our work and give an overview of possible further work.
2
Prelimiaries
Let Σ be a finite alphabet, whose elements are called letters. We denote the set of all words over Σ by Σ ∗ . Formally, Σ ∗ with the concatenation operation forms the free monoid with the set of generators Σ. The empty word, denoted by λ, plays the role of unit element. Given a set S, we denote the set of subsets of S by P(S). Given two sets S and T , we denote the complement of S by S, the union of S and T by S ∪ T , the intersection of S and T by S ∩ T , and the difference of S and T by S\T = S ∩ T .
Inferring Regular Trace Languages from Positive and Negative Samples
13
For any word x of Σ ∗ , |x| denotes the length of x, and |x|a denotes the number of occurrences of a letter a in x. Alph(x) denotes the set of all letters appearing in x. Given words p, x on Σ ∗ , we say that p is a prefix of x if and only if there exist a word x of Σ ∗ such that x = py. Given a word x of Σ ∗ , we define P ref (x) = {p ∈ Σ ∗ | p is prefix of x}. Given a word x of Σ ∗ and a letter a ∈ Σ, we define P refa (x) = {p ∈ Σ ∗ | p is prefix of p and the last word of x s a } = {λ} ∪ (P ref (x) ∩ Σ ∗ a). We can extend these last two concepts to languages as usual: given a language L ⊆ Σ ∗ and a letter a ∈ Σ, we define P ref (L) = x∈L P ref (x) and P refa (L) = x∈L P refa (x). Given a total order < on Σ, we can define a lexicographical order