I
Machine Learning
Machine Learning
Edited by
Yagang Zhang
In-Tech
intechweb.org
Published by In-Teh In-Teh Olajnica 19/2, 32000 Vukovar, Croatia Abstracting and non-profit use of the material is permitted with credit to the source. Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher. No responsibility is accepted for the accuracy of information contained in the published articles. Publisher assumes no responsibility liability for any damage or injury to persons or property arising out of the use of any materials, instructions, methods or ideas contained inside. After this work has been published by the In-Teh, authors have the right to republish it, in whole or part, in any publication of which they are an author or editor, and the make other personal use of the work. © 2010 In-teh www.intechweb.org Additional copies can be obtained from:
[email protected] First published February 2010 Printed in India Technical Editor: Sonja Mujacic Cover designed by Dino Smrekar Machine Learning, Edited by Yagang Zhang p. cm. ISBN 978-953-307-033-9
V
Preface The goal of this book is to present the key algorithms, theory and applications that from the core of machine learning. Learning is a fundamental activity. It is the process of constructing a model from complex world. And it is also the prerequisite for the performance of any new activity and, later, for the improvement in this performance. Machine learning is concerned with constructing computer programs that automatically improve with experience. It draws on concepts and results from many fields, including artificial intelligence, statistics, control theory, cognitive science, information theory, etc. The field of machine learning is developing rapidly both in theory and applications in recent years, and machine learning has been applied to successfully solve a lot of real-world problems. Machine learning theory attempts to answer questions such as “How does learning performance vary with the number of training examples presented?” and “Which learning algorithms are most appropriate for various types of learning tasks?” Machine learning methods are extremely useful in recognizing patterns in large datasets and making predictions based on these patterns when presented with new data. A variety of machine learning methods have been developed since the emergence of artificial intelligence research in the early 20th century. These methods involve the application of one or more automated algorithms to a body of data. There are various methods developed to evaluate the effectiveness of machine learning methods, and those methods can be easily extended to evaluate the utility of different machine learning attributes as well. Machine learning techniques have the potential of alleviating the complexity of knowledge acquisition. This book presents today’s state and development tendencies of machine learning. It is a multi-author book. Taking into account the large amount of knowledge about machine learning and practice presented in the book, it is divided into three major parts: Introduction, Machine Learning Theory and Applications. Part I focuses on the Introduction of machine learning. The author also attempts to promote a new thinking machines design and development philosophy. Considering the growing complexity and serious difficulties of information processing in machine learning, in Part II of the book, the theoretical foundations of machine learning are considered, mainly include self-organizing maps (SOMs), clustering, artificial neural networks, nonlinear control, fuzzy system and knowledge-based system (KBS).Part III contains selected applications of various machine learning approaches, from flight delays, network intrusion, immune system, ship design to CT, RNA target prediction, and so on.
VI
The book will be of interest to industrial engineers and scientists as well as academics who wish to pursue machine learning. The book is intended for both graduate and postgraduate students in fields such as computer science, cybernetics, system sciences, engineering, statistics, and social sciences, and as a reference for software professionals and practitioners. The wide scope of the book provides them with a good introduction to many basic approaches of machine learning, and it is also the source of useful bibliographical information. Editor:
Yagang Zhang
VII
Contents Preface
V
PART I INTRODUCTION 1. Machine Learning: When and Where the Horses Went Astray?
001
Emanuel Diamant
PART II LEARNING THEORY 2. SOMs for machine learning
019
Iren Valova, Derek Beaton and Daniel MacLean
3. Relational Analysis for Clustering Consensus
045
Mustapha Lebbah, Younès Bennani, Nistor Grozavu and Hamid Benhadda
4. A Classifier Fusion System with Verification Module for Improving Recognition Reliability
061
Ping Zhang
5. Watermarking Representation for Adaptive Image Classification with Radial Basis Function Network
077
Chi-Man Pun
6. Recent advances in Neural Networks Structural Risk Minimization based on multiobjective complexity control algorithms
091
D.A.G. Vieira, J.A. Vasconcelos and R.R. Saldanha
7. Statistics Character and Complexity in Nonlinear Systems
109
Yagang Zhang and Zengping Wang
8. Adaptive Basis Function Construction: An Approach for Adaptive Building of Sparse Polynomial Regression Models
127
Gints Jekabsons
9. On The Combination of Feature and Instance Selection
157
Jerffeson Teixeira de Souza, Rafael Augusto Ferreira do Carmo and Gustavo Augusto Campos de Lima
10. Fuzzy System with Positive and Negative Rules Thanh Minh Nguyen and Q. M. Jonathan Wu
173
VIII
11. Automatic Construction of Knowledge-Based System using Knowware System
189
Sio-Long Lo and Liya Ding
12. Applying Fuzzy Bayesian Maximum Entropy to Extrapolating Deterioration in Repairable Systems
217
Chi-Chang Chang, Ruey-Shin Chen and Pei-Ran Sun
PART III APPLICATIONS 13. Alarming Large Scale of Flight Delays: an Application of Machine Learning
239
Zonglei Lu
14. Machine Learning Tools for Geomorphic Mapping of Planetary Surfaces
251
Tomasz F. Stepinski and Ricardo Vilalta
15. Network Intrusion Detection using Machine Learning and Voting techniques
267
Tich Phuoc Tran, Pohsiang Tsai, Tony Jan and Xiaoying Kong
16. Artificial Immune Network: Classification on Heterogeneous Data
291
Mazidah Puteh, Abdul Razak Hamdan, Khairuddin Omar and Mohd Tajul Hasnan Mohd Tajuddin
17. Modified Cascade Correlation Neural Network and its Applications to Multidisciplinary Analysis Design and Optimization in Ship Design
301
Adeline Schmitz, Frederick Courouble, Hamid Hefazi and Eric Besnard
18. Massive-Training Artificial Neural Networks (MTANN) in Computer-Aided Detection of Colorectal Polyps and Lung Nodules in CT
343
Kenji Suzuki, Ph.D.
19. Automated detection and analysis of particle beams in laser-plasma accelerator simulations
367
Daniela M. Ushizima, Cameron G. Geddes, Estelle Cormier-Michel, E.Wes Bethel, Janet Jacobsen, Prabhat, Oliver R ubel, GuntherWeber, Bernd Hamann, Peter Messmer and Hans Haggen
20. Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
391
Yanju Zhang, Jeroen S. de Bruin and Fons J. Verbeek
21. Extraction Of Meaningful Rules In A Medical Database
411
Sang C. Suh, Nagendra B. Pabbisetty and Sri G. Anaparthi
22. Establishing and retrieving domain knowledge from semi-structural corpora Hsien-chang WANG, Pei-chin YANG and Chen-chieh LI
427
Machine Learning: When and Where the Horses Went Astray
1
x1 Machine Learning: When and Where the Horses Went Astray? Emanuel Diamant VIDIA-mant Israel
1. Introduction The year of 2006 was exceptionally cruel to me – almost all of my papers submitted for that year conferences have been rejected. Not “just rejected” – unduly strong rejected. Reviewers of the ECCV (European Conference on Computer Vision) have been especially harsh: "This is a philosophical paper... However, ECCV neither has the tradition nor the forum to present such papers. Sorry..." O my Lord, how such an injustice can be tolerated in this world? However, on the other hand, it can be easily understood why those people hold their grudges against me: Yes, indeed, I always try to take a philosophical stand in all my doings: in thinking, paper writing, problem solving, and so no. In my view, philosophy is not a swear-word. Philosophy is a keen attempt to approach the problem from a more general standpoint, to see the problem from a wider perspective, and to yield, in such a way, a better comprehansion of the problem’s specificity and its interaction with other world realities. Otherwise we are doomed to plunge into the chasm of modern alchemy – to sink in partial, task-oriented determinations and restricted solution-space explorations prone to dead-ends and local traps. It is for this reason that when I started to write about “Machine Learning“, I first went to the Wikipedia to inquire: What is the best definition of the subject? “Machine Learning is a subfield of Artificial Intelligence“ – was the Wikipedia’s prompt answer. Okay. If so, then: “What is Artificial Intelligence?“ – “Artificial Intelligence is the intelligence of machines and the branch of computer science which aims to create it“ – was the response. Very well. Now, the next natural question is: “What is Machine Intelligence?“ At this point, the kindness of Wikipedia has been exhausted and I was thrown back, again to the Artificial Intelligence definition. It was embarrassing how quickly my quest had entered into a loop – a little bit confusing situation for a stubborn philosopher. Attempts to capitalize on other trustworthy sources were not much more productive (Wang, 2006; Legg & Hutter, 2007). For example, Hutter in his manuscript (Legg & Hutter, 2007) provides a list of 70-odd “Machine Intelligence“ definitons. There is no consensus among the items on the list – everyone (and the citations were chosen from the works of the most prominent scholars currently active in the field), everyone has his own particular view on the subject. Such inconsistency and multiplicity of definitions is an unmistakable sign of
2
Machine Learning
philosophical immaturity and a lack of a will to keep the needed grade of universality and generalization. It goes without saying, that the stumbling-block of the Hutter’s list of definitions (Legg & Hutter, 2007) is not the adjectives that were used– after all the terms “Artificial“ and “Machine“ are consensually close in their meaning and therefore are commonly used interchangeably. The real problem – is the elusive and indefinable term „Intelligence“. I will not try the readers’ patience and will not tediously explain how and why I had arrived at my own definition of the matters that I intend to scrutinize in this paper. I hope that my philosophical leanings will be generously excused and the benevolent readers will kindly accept the unusual (reverse) layout of the paper’s topics. For the reasons that would be explained in a little while, the main and the most general paper’s idea will be presented first while its compiling details and components will be exposed (in a discending order) afterwards. And that is how the proposed paper’s layout should look like: Intelligence is the system’s ability to process information. This statement is true both for all biological natural systems as for artificial, human-made systems. By “information processing“ we do not mean its simplest forms like information storage and retrieval, information exchange and communication. What we have in mind are the high-level information processing abilities like information analysis and interpretation, structure patterns recognition and the system’s capacity to make decisions and to plan its own behavior. Information in this case should be defined as a description – A language and/or an alphabet-based description, which results in a reliable reconstruction of an original object (or an event) when such a description is carried out, like an execution of a computer program. Generally, two kinds of information must be distinguished: Objective (physical) information and subjective (semantic) information. By physical information we mean the description of data structures that are discernable in a data set. By semantic information we mean the description of the relationships that may exist between the physical structures of a given data set. Machine Learning is defined as the best means for appropriate information retrieval. Its usage is endorsed by the following fundamental assumptions: 1) Structures can be revealed by their characteristic features, 2) Feature aggregation and generalization can be achieved in a bottom-up manner where final results are compiled from the component details, 3) Rules, guiding the process of such compilation, could be learned from the data itself. All these assumptions validating Machine Learning applications are false. (Further elaboration of the theme will be given later in the text). Meanwhile the following considerations may suffice: Physical information, being a natural property of the data, can be extracted instantly from the data, and any special rules for such task accomplishment are not needed. Therefore, Machine Learning techniques are irrelevant for the purposes of physical information retrieval. Unlike physical information, semantics is not a property of the data. Semantics is a property of an external observer that watches and scrutinizes the data. Semantics is assigned to phisical data structures, and therefore it can not be learned straightforwardly from the data. For this reason, Machine Learning techniques are
Machine Learning: When and Where the Horses Went Astray
3
useless and not applicable for the purposes of smantic information extraction. Semantics is a shared convention, a mutual agreement between the members of a particular group of viewers or users. Its assignment has to be done on the basis of a consensus knowledge that is shared among the group members, and which an artificial semantic-processing system has to possess at its disposal. Accomodation and fitting of this knowledge presumes availability of a different and usually overlooked special learning technique, which would be best defined as Machine Teaching – a technique that would facilitate externally-prepared-knowledge transfer to the system’s disposal . These are the topics that I am interested to discuss in this paper. Obviously, the reverse order proposed above, will never be reified – there are paper organization rules and requirements, which none never will be allowed to override. They must be, thus, reverently obeyed. And I earnestly promiss to do this (or at least to try to do this) in this paper.
2. When the State of the Art is Irrelevant One of the commonly accepted rules prescribes that the Introduction Section has to be succeeded by a clear presentation of a following subject: What is the State of the Art in the field and what is the related work done by the other researchers? Unfortunately, I’m unable to meet this requirement, because (to the best of my knowledge) there is no relevant work in the field that can be used for this purpose. Or, let us put this in a more polite way: The work presented in this paper is so different from other mainstream approaches that it would be unwise to compare it with the rest of the work in the field and to discuss arguments in favour or against their endless disagreements and discrepancies. However, to avoid any possible allegations in disrespectfulness, I would like to provide here some reflections on the departure points of my work, which (I hope) are common to many friends and foes in the domain. My first steps in the field were inspired by David Marr’s ideas about the “Primal” and the “Two-and-a-half” image representation sketch, which is carrying out the information content of an image (Marr, 1978; Marr, 1982). Image understanding was always the most relevant and the most palpable manifestation of human intelligence, and so, those who are busy with intelligence replications in machines, are due to cope with image understanding and image processing issues. “You see, – had I proudly agitated my employers, trying to convince them to fund my image-processing enterprises, – meagre lines of a painter’s caricature provide you with all information needed to comprehend the painter’s intention and to easily recognise the objects drawn in the picture. Edges are the information bearers! Edge exploration and processing will help us to reach advances in pattern recognition and image understanding. ” My employers were skeptic and penny-pinching, but nevertheless, I was allowed to do some work. However, very soon it had become clear that my problems are far from being information retrieval issues – my real problem was to run (approximately in a real-time fashion) a 3-by-3 (or 5-by-5) operator over a 256-by-256 pixel image. And only then, when the run is somehow successfully completed, I was doomed to find myself inflated with a multitude of edges: cracked, disjoint, and inconsistent. On one hand, an entire spectrum of dissimilar edge pieces, and on the other hand – a striking deficit of any hints regarding how to arrange them into something handy and meaningful. At least, to choose among them (to
4
Machine Learning
discriminate, to segment, to threshold) those that would be suitable for further processing. Even though, it was at all not sure that anybody knows what such a processing should be. It was not only my nightmare. Many people have swamped in this bog. Many are still trying to tempt the fate – even today, the flow of edge extraction and segmentation publications does not dry up, and new machine learning techniques are reportedly proposed to cure the problem (Ghosh et al., 2007; Awad & Man, 2008; Qiu & Sun, 2009). Human vision physiology studies, which have been always seen as an endless source of computer vision R&D inspiration, have also proved to be of a little help here. Treisman’s feature-integration theory (Treisman & Gelade, 1980) and Biederman’s recognition-bycomponents theory (Biederman, 1987) – the cornerstones of contemporary vision science – were fitting well the bottom-up image processing philosophy, (where initial feature gathering is followed by further feature consolidation), but they have nothing to say about how this feature aggregation and integration (into meaningful perceptible objects) has to be realized. They only say that this process has to be done in a top-down fashion, in opposite to the bottom-up processing of the initial features. To overcome the problem, a great variety of so-called “binding” theories have been proposed (Treisman, 1996; Treisman, 2003). However, all of them turned out as inappropriate. In a desperate attempt to resolve this irresolvable contradiction, even a theory of a mysterious homunculus has been proposed – a “little man inside the head” that perceives the world through our senses and then unmistakably fulfils all the needed (intelligent) actions (Crick & Koch, 2000). But the theory of the homunculus has not taken roots. Human level intelligence has been and continues to be a challenge, and nothing in the field has changed since the 50s of the past century, when the first steps of Artificial Intelligence exploration have been carried out (Turing, 1950; McCarthy et al., 1955).
3. In Search for a Better Fortune I am not trying to claim that I am more clever or wise than others. All the stupid things that others have persistently tried to do, I have repeatedly tried as well. But in one thing, however, I was certainly different from the others – I have never neglected my final goal: To grasp the information content of an image. Together with other image-processing “partisans” and “camarados” I fought my pixel-oriented battles, but a dream about objectoriented image processing was always blooming in my heart. As you can understand, nothing worthy had come out from that. Nevertheless, some of the things that I was lucky to make happen (at that time) are worth to be mentioned here. For example, I have invented a notion of “Single Pixel Information Content” and a way for its quantitative evaluation (Diamant, 2003). I have also invented a notion of “Specific Information Density of an Image”, and, relying on the pixel’s information content (measure), I have attempted to investigate the effect of “Image Information Content Conservation”. That is, when an image scale is successively reduced, Image Specific Information Density remains unchanged (or even slightly grows up). Then, at some level of reduction, it rapidly declines. This scale, actually the scale one step preceding the drop of Information Density, I thought, should be the most advantageous (scale) to start image information content explorations. A paper, containing quantitative results and a proof of this idea, has been submitted to the British Machine Vision Conference (Diamant, 2002), but, (as usually), was decisively
Machine Learning: When and Where the Horses Went Astray
5
rejected. Never mind, these investigations have led to an important insight that image information content excavation has to be commenced at an optimal, low-dimensional image representation scale. I am proud to inform the interested readers that similar investigations have been performed recently (and similar results have been attained) by MIT researchers (Torralba, 2009). However, that was done about seven years later, and only in qualitative experiments conducted on human participants (but not as a quantitative work). Never mind, the idea of initial low-dimensional image exploration was in some way consistent with a naïve psychological vision conjecture about how humans look at a scene. Since biological vision research was always busy with only foveated vision studies, one principal aspect of human vision was always remained neglected: How does the brain know where to look in a scene? We do not search our field of view in a regular, raster-scan manner. On the contrary, we do this in an unpredictable, but certainly a not-random manner (Koch et al., 2007; Shomstein & Behrmann, 2008). If so, how does the brain know where to place the eye’s fovea – (the main means for visual information gathering) – before it knows in advance where such information is to be found? Certainly, the brain must have a prior knowledge about the scene layout, about the general map of a scene. Certainly, the scale of this map must be several orders lower than the fovea resolution scale, and it is clear that these information gathering maps are being used simultaneously and interchangeably. Such considerations have inevitably led us to a conclusion that other theories, currently unknown to us, which would be capable of explaining such multiscale brain performance have to be urgently searched for. Indeed, very soon I came upon an appropriate theory. And even not a single one, but a whole bundle of theories. In the middle of the 60s of the previous century, three almost simultaneous, but absolutely independently developed, theories have sprung up: Solomonoff’s theory of Inference (Solomonoff, 1964), Kolmogorov’s Complexity theory (Kolmogorov, 1965), and Chaitin’s Algorithmic Information theory (Chaitin, 1966). Since among the three, Kolmogorov’s theory is the most known one, I will first and mainly refer to it in our further discussion. Just as Shannon’s Information theory (Shannon, 1948) published almost 20 years ahead, Kolmogorov’s theory was aimed at providing means for measuring “information”. However, while Shannon’s theory was dealing only with the average amount of information conveyed by an outcome of a random source, Kolmogorov’s theory was busy with information contained in a particular isolated object. Thus, Kolmogorov’s theory was far more suitable to deal with human vision peculiarities. However, I do not intend to bother the readers with explanations about Kolmogorov’s theory merits. Such expanded enlightenment could be found else where, for example (Li & Vitanyi, 2008; Grunvald & Vitanyi, 2008). My humble intention is, relying on the insights of the Kolmogorov’s theory, to provide some useful illuminations, which can be deduced from the theory and applied to the practice of image information content excavation. An essential part of my work has been already done in the past years, and has been even published on several occasions (Diamant, 2004; Diamant, 2005a; Diamant, 2005b). (The publications could be easily found at some freely accessible web repositories, like CiteSeer, Eprintweb, ArXiv, etc. And also on my personal web site: http://www.vidia-mant.info). However, for the consistency of our discussion, I would like to repeat here the main points of these previous publications.
6
Machine Learning
The key point is that information is a description, a certain alphabet-based or languagebased description, which Kolmogorov’s theory regards as a program that, being executed, trustworthy reproduces the original object (Vitanyi, 2006). In an image, such objects are visible data structures from which an image is comprised of. So, a set of reproducible descriptions of image data structures is the information contained in an image. The Kolmogorov’s theory prescribes the way in which such descriptions must be created: At first, the most simplified and generalized structure must be described. Recall the Occam’s Razor principle: Among all hypotheses consistent with the observation choose the simplest one that is cohirent with the data, (Sadrzadeh, 2008). Then, as the level of generalization is gradually decreased, more and more fine-grained image details (structures) become revealed and depicted. This is the second important point, which follows from the theory’s pure mathematical considerations: Image information is a hierarchy of decreasing level descriptions of information details, which unfolds in a coarse-to-fine top-down manner. (Attention, please! Any bottom-up processing is not mentioned here! There is no low-level feature gathering and no feature binding!!! The only proper way for image information elicitation is a top-down coarse-to-fine way of image processing!) The third prominent point, which immediately pops-up from the two just mentioned above, is that the top-down manner of image information elicitation does not require incorporation of any high-level knowledge for its successful accomplishment. It is totally free from any high-level guiding rules and inspirations. (The homunculus have lost his job and is finally fired). That is why I call the information, which unconditionally can be found in an image, – the Physical Information. That is, information that reflects objective (physical) structures in an image and is totally independent of any high level interpretation of the interrelashions between them. What immediately follows from this is that high-level image semantics is not an integrated part of image information content (as it is traditionally assumed). It cannot be seen more as a natural property of an image. Semantic Information, therefore, must be seen as a property of a human observer that watches and scrutinizes an image. That is why we can say now: Semantics is assigned to an image by a human observer. That is strongly at variance with the contemporary views on the concept of semantic information. As it was mentioned above, I have no intention to argue with the mainstream experts, conference chaires, keynotes speekers and invited talks presenters about the validity of my contemplations, about my philosophical inclinations or research duties and preferences. These respected gentlemans would continue to teach you how to extract semantic information from an image or how it should be derived from low-level information features. (I do not provide here examples of such claims. I hope, the readers are well enough acquinted with the state of the art in the field and its mainstream developments, to be able to recall the appropriate cases by themselves. I also hope that readers of this paper are ready to change their minds – fifty or so years of Machine Learning triumfal marching in the head of the Artificial Intelligence parade have not got us closer to the desired goal of Intelligent Machines deployment and use. Partially and loosely defined (or it would be right to say, undefined) departure points of this enterprise were the main reasons responsible for this years-long wandering in the desert far away from the promissed land.)
Machine Learning: When and Where the Horses Went Astray
7
4. “Repetitio est Mater Studiorum” (For those who are not fluent enough in Latin, the translation of this proverb would be: Reiteration is the mother of learning). Okay, I am really sorry that instead of dealing with the declared subject of this paper (that is, Machine Learning and all its corresponding issues), I have to return again and again to topics that have been already discussed in the past and even published at some previous occasions. (But that is the bad luck of an imageprocessing partisan). Therefore, with all apologies to be due, I will continue our discourse with some extended citations seized from my previously published papers. 4.1 Image Physical information Processing The first citation is related to physical information processing issues and is taken from a five years old paper (Diamant, 2004). The citation subject is – an algorithmic implementation of image physical information mining principles. The algorithm’s block-scheme looks as follows: Bottom-up path
Top-down path
Last (top) level 4 to 1 comprsd image
4 to 1 compressed image
Segmentation Classification
Level n-1
Object shapes Labeled objects
Object list
Top level object descriptors
1 to 4 expanded object maps
Level n-1 objects
. . . . . . . . . . . . . . . . 4 to 1 compressed image
Original image
Level 1
Level 0
1 to 4 expanded object maps
1 to 4 expanded object maps
Levl 1 obj.
L0
Fig. 1. The block-diagram of physical information elucidation. As can be seen at Fig. 1, the proposed schema is comprised of three main processing paths: the bottom-up processing path, the top-down processing path and a stack where the discovered information content (the generated descriptions of it) is actually accumulated. The algorithm’s structure reflects the principles of information representation, which have been already defined previously. As it is shown in the schema, the input image is initially squeezed to a small size of approximately 100 pixels. The rules of this shrinking operation are very simple and fast: four non-overlapping neighbor pixels in an image at level L are averaged and the result is assigned to a pixel in a higher (L+1)-level image. This is known as “four children to one parent relationship”. Then, at the top of the shrinking pyramid, the image is segmented, and
8
Machine Learning
each segmented region is labeled. Since the image size at the top is significantly reduced and since in the course of the bottom-up image squeezing a severe data averaging is attained, the image segmentation/labeling procedure does not demand special computational resources. Any well-known segmentation methodology will suffice. We use our own proprietary technique that is based on a low-level (single pixel) information content evaluation (Diamant, 2003), but this is not obligatory. From this point on, the top-down processing path is commenced. At each level, the two previously defined maps (average region intensity map and the associated label map) are expanded to the size of an image at the nearest lower level. Since the regions at different hierarchical levels do not exhibit significant changes in their characteristic intensity, the majority of newly assigned pixels are determined in a sufficiently correct manner. Only pixels at region borders and seeds of newly emerging regions may significantly deviate from the assigned values. Taking the corresponding current-level image as a reference (the left-side unsegmented image), these pixels can be easily detected and subjected to a refinement cycle. In such a manner, the process is subsequently repeated at all descending levels until the segmentation/classification of the original input image is successfully accomplished. At every processing level, every image object-region (just recovered or an inherited one) is registered in the objects’ appearance list, which is the third constituting part of the proposed scheme. The registered object parameters are the available simplified object’s attributes, such as size, center-of-mass position, average object intensity and hierarchical and topological relationship within and between the objects (“sub-part of…”, “at the left of…”, etc.). They are sparse, general, and yet specific enough to capture the object’s characteristic features in a variety of descriptive forms. Examples of algorithm’s performance and some concrete palpable results are provided in several previously published papers (Diamant, 2005a; Diamant, 2005b). In our current discussion it is worth to be mentioned that: First, image segmentation (physical image structures delineation, physical image information elicitation) is performed in a top-down manner, not in a conventional bottom-up mode. Second, the suggested image segmentation principle does not require any knowledge about high-level rules, which are used to support the segmentation process and which are an obligatory part of any bottomup proceeding procedure. Third, canceling the necessity of these high-level rules, makes all Machine Learning techniques useless and invalidates all efforts that are specially carried out to meet this sacred requirement! In this way, Machine Learning loses its role as the main performer in physical information recovery enterprises. 4.2 Image Semantic Information Processing The context of this sub-section is also an extended quotation from a recently published paper (Diamant, 2008). The key point of this quotation is a semantic information processing architecture based on the same information-defining rules and the same (top-down) information representation principles that were already introduced in Section 3. The blockschema of such a semantic information processing architecture is borrowed from the above mentioned paper (Diamant, 2008), and is depicted in the Fig. 2.
Machine Learning: When and Where the Horses Went Astray
9
Fig. 2. Physical and Semantic Information processing hierarchies. Scrutinizing of this general image information processing architecture must be preceded by some remarks: Semantic information, which (as we understand now) is a property of an external observer, is separated and dissociated from the physical information processing, in our case an image. Therefore it must be treated (or modeled) in accordance with observerspecific (his/her) cognitive information processing rules.
10
Machine Learning
It is well known that human cognitive abilities (including the aptness for image interpretation and the capacity to assign semantics to an image) are empowered by the existence of a huge knowledge base about the things in the surrounding world kept in human brain. This knowledge base is permanently upgraded and updated during the human’s life span. So, if we intend to endow our design with some cognitive capabilities we have to provide it with something equivalent to this (human) knowledge base. It goes without saying that this knowledge base will never be as large and developed as its human prototype. But we are not sure that such a requirement is valid here. After all, humans are also not equal in their cognitive capacities, and the content of their knowledge bases is very diversified as well. (The knowledge base of an aerial photographs interpreter is certainly different from the knowledge base of an X-ray images interpreter, or an IVUS images interpreter, or PET images). The knowledge base of our visual thinking machine has to be small enough to be effective and manageable, but sufficiently large to ensure the machine acceptable performance. Certainly, for our feasibility study we can be satisfied even with a relatively small, specific-task-oriented knowledge base. The next crucial point is the knowledge representation issue. To deal with it, we first of all must arrive at a common agreement about what is the meaning of the term “knowledge”. (A question that usually does not have a single answer.) We state that in our case a suitable definition of it would be: “Knowledge is memorized information”. Consequently, we can say that knowledge (like information) must be a hierarchy of descriptive items, with the grade of description details growing in a top-down manner at the descending levels of the hierarchy. One more point that must be mentioned here, is that these descriptions have to be implemented in some alphabet (as it is in the case of physical information) or in a description language (which better fits the semantic information case). Any farther argument being put aside, we will declare that the most suitable language in our case is the natural human language. After all, the real knowledge bases that we are familiar with are implemented in natural human languages. The next step, then, is predetermined: if natural language is a suitable description implement, the suitable form of this implementation is a narrative, a story tale (Tuffield et al., 2005). If the description hierarchy can be seen as an inverted tree, then the branches of this tree are the stories that encapsulate human’s experience with the surrounding world. And the leaves of these branches are single words (single objects) from which the story parts (single scenes) are composed of. The descent into description details, however, does not stop here, and each single word (single object) can be farther decomposed into its attributes and rules that describe the relations between the attributes. At this stage the physical information reappears. Because the words are usually associated with physical objects in the real world, words’ attributes must be seen as memorized physical information (descriptions). Once derived (by the human visual system) from the observable world and learned to be associated with a particular word, these physical information descriptions are soldered in into the knowledgebase. Object recognition, thus, turns out to be a comparison and similarity test between currently acquired physical information and the one already retained in the memory. If the similarity test is successful, starting from this point in the hierarchy and climbing back up on the knowledgebase ladder
Machine Learning: When and Where the Horses Went Astray
11
we will obtain: first, the linguistic label for a recognized object; second, the position of this label (word) in the context of the whole story; and third, the ability to verify the validity of an initial guess by testing the appropriateness of the neighboring parts composing the object, that is, the context of a story. In this way, object’s meaningful categorization can be reached, and the first stage of image annotation can be successfully accomplished, providing the basis for farther meaningful (semantic) image interpretation. One question has remained untouched in our discourse: How does this artificial knowledgebase have to be initially created and brought into our thinking machine disposal? This subject deserves a special discussion. 4.3 Can Semantic Knowledge be Learned? There is no need to reiterate the dictums of the today’s Internet revolution, where access and exchange of semantic information is viewed as a prime and an ultimate goal. Machines are supposed to handle the documents’ semantic content, and Machine Learning techniques, thus, supporting this knowledge mining venture are supposed to be the leading force, the centre forward of this exciting enterprise. Semantic Knowledge mining is now the hottest topic of every conference discussion, most recent research projects and many other applied science initiatives. However, in the light of our new definition of information, which was recently introduced in my work and re-introduced in the Section 3 of this paper, I am really skeptic about the Machine Learning ability to meet this challenge. Again, some philosophy would not be out of place here. At first, it must be reiterated that semantics is not a natural property of an image (or a natural property of the data, if you would like a more general view on the subject). Semantics is a property of a human observer that watches and scrutinizes the data, and this property is shared among the observer and other members of his community. By the way, this community does not have to embrace the whole mankind, it can be even a very small community of several people or so, which, nevertheless, were lucky to establish a common view on a particular subject and a common understanding of its meaning. That is the reason why this particular (privet) knowledge can not be attained in any reasonable way, including Machine Learning techniques and tricks. On the other hand, an intelligent information-processing system has to have at its disposal a memorized knowledgebase hierarchy (implemented, as we postulate, as a collection of typical stories) against which the physical information of its input sensors is matched and associated. Finding the suitable story whose attributes most closely match the sensors’ physical information is equivalent to finding the interpretation for the input sensor data (the input physical information). That is the novelty of our proposed architecture. That is the most important feature of our design approach: The knowledgebase hierarchy is used for a linguistic input interpretation, but this knowledge is not derived (by the system) from the input data. It is not learned from the data. On the contrary, in accordance with the top-down information unfolding principle, the knowledge-base hierarchy (as a whole) has to be transferred to the system disposal from the outside. The system doesn’t learn the knowledgebase, it is taught to use the knowledgebase (In our case, a pool of task related stories and their ramifications putted at system disposal in advance). Thus, providing the system with the needed new knowledge each time when the system is due for a new task accomplishment is becoming a natural duty of Artificial Intelligence (Machine Intelligence) system designer. This shift from Machine Learning to Machine Teaching paradigm should become the key point of intelligent system design and
12
Machine Learning
development roadmap. But unfortunately, that has not happen although it has been about three years since the idea was at first articulated and even occasionally published (Diamant, 2006b). 4.4 Some additional remarks That is a very important and an interesting twist in the philosophy of intelligent artificial systems design. It does not result from the understanding of the principals of biological systems intelligence or other proudly declared biological inspirations. On the contrary, it results from pure mathematical considerations of the Kolmogorov’s complexity theory. Only now, drawing on the insights of Kolmogorov’s theory, we can seize the interpretation of the facts that we usually come across in our natural (biological) surrounding. It is a very subtle issue among the topics of machine intelligence that I would like to address. “Biologically inspired” means that we borrow from the nature some fruitful ideas, which we intend to replicate in our artificial designs. That is, we presume that we understand or at least are very close to the state of understanding how some biological mechanisms operate, performing their natural duties. But that is not true!. We don’t have even a slightest hint about how the nature works. What we have are gambling guesses, intuitive feelings, speculations, and – nothing more than that. Another important remark in this regard, is that Nature is not an Engineer. It does not invent new mechanisms and new solutions for its problem-solving. On the contrary, it gradually adjusts and adapts what it already has on the hand. Although the final results are really remarkable, it takes a lot of time to reach them in the course of natural evolution, millions and billions of years. Despite all this, the nature has never reached some very important human-life-shaping revelations – for example, the wheel (as a means for transportation), the cooked food, the writing and numbering practice, etc. The inventors of “Genetic Programming” provide very interesting quotations from Turing’s early works considering Machine Intelligence (Koza et al., 1999; Koza et al., 2002). In his 1948 essay “Intelligent Machines” Alan Turing has identified three broad approaches by which machine intelligence could be achieved: “One approach… is a search through the space of integers representing candidate computer programs, (a logic-driven search)… Another approach is the “cultural search” which relies on knowledge and expertise acquired over a period of years from others. This approach is akin to present-day knowledge-based systems… The third approach is “genetical or evolutionary search”…” (Koza, et al., 1999). From the three, the inventors of Genetic Programming pick up the idea of biological relevance to the problem of machine intelligence acquisition. However, from our point of view (from the point of view inspired by Kolmogorov’s theory) this can not be true. Genetic Programming and Neural Networking are pure bottom-up informationprocessing approaches. As we know today, the right way of information retrieval is a topdown coarse-to-fine approach. Therefore, it might be more intelligent to embrace the first Turing’s alternative – the logic-driven approach. That is, relying on pure logical and engineering approaches to find out the best ways of intelligence reification, and only then to verify our hypothetical solutions against known (or unknown) biological evidences and facts. That is exactly what we are intended to do now. The first issue is the bottom-up versus top-down information-processing alternatives. Despite the traditional dominance of the bottom-up approach, evidence for top-down preliminary processing in biological vision systems is present in research literature since the
Machine Learning: When and Where the Horses Went Astray
13
early 80s of the previous century (Navon, 1977; Chen, 1982). Unfortunately, they were overlooked both by biological and computer vision communities. The next phenomenon which is usually misunderstood (and consequently mistreated) is the knowledge transfer (in human and animal world), which is usually mistakenly defined as a Learning process. We have proposed a more suitable definition – a Teaching process. Indeed, it turns out that in nature, teaching is a universal and a wide-spread phenomenon. Only recently this fact has become recognized and earned its careful investigation (Csibra, 2007; Hoppitt et al., 2008). Teaching in nature does not mean human-like mentoring – animals do not possess spoken language capabilities. Teaching in nature assumes specific semantic knowledge transfer, specific information relocation from a teacher to a pupil, from one community member to another. And examples of this knowledge transfer are really abundant in our surrounding, if only we are ready to look at them and see them in a proper way. In this regard, dancing bees that convey to the rest of the hive the information about melliferous sites (Zhang et al., 2005), ants that learn in tandem (Franks & Richardson, 2006), and even bacteria developing their antibiotic resistance as a result of a so-called horizontal gene transfer when a single DNA fragment of one bacteria is disseminated among other colony members (Lawrence & Hendrickson, 2003), all these examples convincingly support our claim that meaningful information (the semantic knowledge base) is always transfered to the individual information processing system from the outside, from the external world. The system does not learn it in a traditionally assumed Machine Learning manner. Another benefit which biological science can gain from our logically-driven (engineering) approach is the issue of astrocyte-neuron communication. Only defining information as a description message you can explain how astrocities, (the dominant glial cells), “listen and talk” with neuronal and synaptic networks. In their paper, Voltera & Meldolesi wrote that: “One reason that the active properties of astrocytes have remained in the dark for so long relates to the differences between the excitation mechanisms of these cells and those of neurons. Until recently, the electrical language of neurons was thought to be the only form of excitation in the brain. Astrocytes do not generate action potentials, they were considered to be non-excitable and, therefore, unable to communicate. The finding that astrocytes can be excited non-electrically has expanded our knowledge of the complexity of brain communication to an integrated network of both synaptic and non-synaptic routs” (Voltera & Meldolesi, 2005). That is, traditional belief that a spiking neuron burst is a valid form of information exchange and representation does not hold any more, and has to be replaced by a chemical molecular-language-based discription-massages transfer mechanism. A very important issue of our discussion about semantic information processing is the issue of knowledge representation. As it was already mentioned above, and it also stems from the insights of Kolmogorov’s theory, the best form of knowledge representation has to be a language-based description, a narrative, a story. I do not intend to expand here on the implementaition deatails of this issue. I would like to continue to maintain our discussion on a philosophical level. What follows from this is that we have always to consider intelligence as being carried out in a language, in a linguistic structure. That is, although the blockschema depicted in Fig. 2 outlines only visual information incorporation into the semantic processing hierarchy, you can easily imagin physical information of other modalities (hearing, haptics, olfactory senses information) being subjected (usually in parallel with information from other sensors) as attributes of semantic (linguistic) objects into the
14
Machine Learning
knowledgebase processing hierarchy. (That will again explain you why functional Magnetic Resonance Imaging shows you that visual stimuli are processed in audio stimuli processing zones, which are naturally associated with speech processing. The simple explanation for this is that all modalities are finally processed in the language processing zone, as it is proposed by our approach.) The next important issue, which naturally follows the preceeding ones, is the narrative story form of knowledge representation accepted for the discussed case of semantic information processing. Linguistic tagging, that means labeling image objects with words, is a well known and widely used methodology of image semantics retrival supported by a special class of Machine Learning techniques (Barnard et al., 2003; Duygulu et al., 2008; Blondin Masse et al., 2008). This way of thinking naturally stems from another wide-spread assumption that ontology (the basis of semantic reasoning and elaboration) is a vocabulary, a thesaurus, a dictionary. What follows from our descriptive form of knowledge representation is that ontology has to be treated as a story, a narrative, which naturally describes the system’s behavior in various real-life-encountered situations. However, this very important aspect of intelligent systems design philosophy leads us far away from the main theme of our discussion – the philosophy of Machine Learning. And for that reason I will quit at this point, and not continue further.
5. Conclusions In this paper I have attempted to promote a new Thinking Machines design and development philosophy. The central point of my approach is a new definition of information, that is, a notion of information as a language-based description. Then, above it the notion of intelligence can be placed, defining intelligence as the system’s ability to process information. The principles of information mining should be placed in the lower part of the construction. Thus, it seems to me, a proper frame for a rational Artificial or Machine Intelligence devices research and development enterprise can be established. Essentially, the declared focus of the paper’s subject is the issue of Machine Learning, which is assumed to be a bundle of techniques used to support all information-processing machinery. But, as you know, Machine Learning as by now (and already for a very long time) is treated as an independent and stand alone discipline, totally detached from its original destination – Thinking Machines research and development (Turing, 1950). The roadmap for this challenge was formulated at the Dartmouth College meeting in the summer of 1956 (McCarthy, et al. 1955). The date of this meeting is considered today as the Artificial Intelligence (AI) birthday. (The very name of AI was coined at this time by John McCarthy, one of the authors of the Dartmouth Proposal). At first, the excitement and hopes were really high, and the goals have seemed to be reachable in a short while. In the Panel Discussion at the Artificial General Intelligence (AGI) Workshop in 2006, Steve Grand has recalled that “Rodney Brooks has a copy of a memo from Marvin Minsky (another father of the Dartmouth Proposal), in which he suggested charging an undergraduate for a summer project with the task of solving vision. I don’t know where that undergraduate is now, but I guess he hasn’t finished yet” (Panel Discussion, 2006). Indeed, problems of Vision, as well as all other AI troubles, have turned out to be much more complicated and problematic than it looked out at the beginning. Within a decade or
Machine Learning: When and Where the Horses Went Astray
15
so, it became clear that AI tribulations are immense, maybe even intractable. As a consequence, the AI community to a large extent has abandoned its original dream, and turned to more “practical” and “manageable” problems (Wang & Goertzel, 2006). “AI has evolved to being a label on a family of relatively disconnected efforts” (Brachman, 2005). Exactly the same were the milestones of Machine Learning. Machine Learning, which was always perceived as an indispensible component of intelligence, has undergone all the metamorphoses as its general domain. At first, there was a generous offer to let the system by itself (in an autonomous manner) to find out the best way to mimic Intelligence. Why to trouble oneself trying to grasp the principles of intelligence? Let us give the machine the chance to do this by itself. (I can not to withstand the temptation to provide an example of such a fatal misunderstanding: IGI Global Publisher (formerly Idea Group Inc.) has published a Call for Chapter Proposals for a future book “Intelligent Systems for Machine Olfaction: Tools and Methodologies” (Can be found at the publisher site: http://www.igiglobal.com/requests/details.asp?ID=610). You can read in the Introduction part of it: “Intelligent systems are those that, given some data, are able to learn from that data. This ability makes it possible for complex systems to be modeled and/or for performance to be predicted. In turn it is possible to control their functionality through learning/training, without the need for a priory knowledge of the system’s structure”. Once more, I apologize for such a so long quotation.) Then, when the first idealistic objective has failed, Machine Learning was broken into pieces, disintegrated and fragmented to many partial tasks and goals. Now the question in the paper’s title – “When and Where the Horses Went Astray?” – can be answered beyond any doubts: It has happened about 50 years ago! From the standpoint that we possess today, we can even spell out the fundamental flaws which are responsible for this derailment: First, the bottom-up philosophy of information retrieval. (As we know today, the right way of information treatment is the top-down coarse-to-fine line of information processing). Second, is the lack of a proper definition of information, leading, consequently, to a lack of a clear distinction between physical and semantic information. (This failure had a tremendous impact on the Machine Learning disruption). The same can be said about the third misleading factor – misunderstanding of the very nature of semantic information, which has led to an endless, infamous race for knowledge and semantic meaning extraction directly from the raw data. (Which is, obviously, a philosophical lapse). For the same reasons, the basic notion of intelligence has been overlooked and defined erroneously. I hope, in this paper I was lucky to repair some of these misconceptions.
6. References Awad, A. & Man, H. (2008). Similar Neighbourhood Criterion for Edge Detection in Noisy and Noise-Free Images, Proceedings of the International Multiconference on Computer Science and Information Technology, pp. 483-486, Wisla, Poland, October 2008. Barnard, K.; Duygulu, P.; Forsyth, D.; de Freitas, N.; Bley, D. & Jordan, M. (2003). Matching Words and Pictures, Journal of Machine Learning Research, Vol. 3, pp. 1107-1135. Biederman, I. (1987). Recognition-by-Components: A Theory of Human Image Understanding, Psychological Review, Vol. 94, No. 2, 1987, pp. 115-147.
16
Machine Learning
Blondin Masse, A.; Chicoisne, G.; Gargouri, Y.; Harnad, S.; Picard, O. & Marcotte, O. (2008). How Is Meaning Grounded in Dictionary Definitions? Available: http://arxiv.org/abs/0806.3710. Brachman, R. (2005). Getting Back to “The Very Idea”. AI Magazine, Vol. 26, pp. 48-50, Winter 2005. Chaitin, G. (1966). On the length of programs for computing finite binary sequences. Journal of the ACM, Vol. 13, pp. 547-569, 1966. Chen, L. (1982). Topological structure in visual perception, Science, 218, pp. 699-700, 1982. Crick, F. & Koch, C. (2000). The Unconscious Homunculus, In: The Neuronal Correlates of Consciousness, Metzinger, T. (Ed.), pp. 103-110, MIT Press: Cambridge, MA, 2000. Csibra, G. (2007). Teachers in the wild. Trends in Cognitive Science, Vol. 11, No. 3, pp. 95-96, March 2007. Diamant, E. (2002). Image Segmentation Scheme Ruled by Information Density Optimization, Submitted to British Machine Vision Conference (BMVC-2002) and decisively rejected there. Available: http://www.vidia-mant.info. Diamant, E. (2003). Single Pixel Information Content, Proceedings SPIE, Vol. 5014, pp. 460465, IST/SPIE 15th Annual Symposium on Electronic Imaging, Santa Clara, CA, January 2003. Diamant, E. (2004). Top-Down Unsupervised Image Segmentation (it sounds like an oxymoron, but actually it isn’t), Proceedings of the 3rd Pattern Recognition in Remote Sensing Workshop (PRRS’04), Kingston University, UK, August 2004. Diamant, E. (2005a). Searching for image information content, its discovery, extraction, and representation, Journal of Electronic Imaging, Vol. 14, Issue 1, January-March 2005. Diamant, E. (2005b). Does a plane imitate a bird? Does computer vision have to follow biological paradigms?, In: De Gregorio, M., et al, (Eds.), Brain, Vision, and Artificial Intelligence, First International Symposium Proceedings. LNCS, Vol. 3704, SpringerVerlag, pp. 108-115, 2005. Available: http://www.vidia-mant.info. Diamant, E. (2006a). In Quest of Image Semantics: Are We Looking for It Under the Right Lamppost?, Available: http://arxiv.org/abs/cs.CV/0609003; http://www.vidiamant.info. Diamant, E. (2006b). Learning to Understand Image Content: Machine Learning Versus Machine Teaching Alternative, Proceedings of the 4th IEEE Conference on Information Technology: Research and Education (ITRE-2006), Tel-Aviv, October 2006. Diamant, E. (2007). The Right Way of Visual Stuff Comprehension and Handling: An Information Processing Approach, Proceedings of The International Conference on Machine Learning and Cybernetics (ICMLC-2007), Hong Kong, August 2007. Diamant, E. (2008). Unveiling the mystery of visual information processing in human brain, Brain Research, Vol. 1225, 15 August 2008, pp. 171-178. Duygulu, P.; Bastan, M. & Ozkan, D. (2008). Linking image and text for semantic labeling of images and videos, In: Machine Learning Techniques for Multimedia, M. Cord & P. Cunnigham (Eds.), Springer Verlag, 2008. Floridi, L. (2003). From Data to Semantic Information, Entropy, Vol. 5, pp. 125-145, 2003. Franks, N. & Richardson, T. (2006). Teaching in tandem-running ants, Nature, 439, p. 153, January 12, 2006.
Machine Learning: When and Where the Horses Went Astray
17
Gerchman, Y. & Weiss, R. (2004). Teaching bacteria a new language. Proceedings of The National Academy of Science of the USA (PNAS), Vol. 101, No. 8, pp. 2221-2222, February 24, 2004. Ghosh, K.; Sarkar, S. & Bhaumik, K. (2007). The Theory of Edge Detection and Low-level Vision in Retrospect, In: Vision Systems: Segmentation and Pattern Recognition, G. Obinata and A. Dutta, (Eds.), I-Tech Publisher, Viena, June 2007. Goertzel, B. (2006). Panel Discussion: What are the bottlenecks, and how soon to AGI?, Proceedings of the Artificial General Intelligence Workshop (AGI 2006), Washington DC, May 2006. Grunvald, P. & Vitanyi, P. (2008). Algorithmic Information Theory, In: The Handbook of the Philosophy of Information, P. Adriaans, J. van Benthem (Eds.), pp. 281-320, North Holland, 2008. Available: http://arxiv.org/abs/0809.2754. Hoppitt, W.; Brown, G.; Kendal, R.; Rendell, L.; Thornton, A.; Webster, M. & Laland, K. (2008). Lessons from animal teaching. Trends in Ecology and Evolution, Vol. 23, No. 9, pp. 486-493, September 2008. Hutter, M. (2007). Algorithmic Information Theory: A brief non-technical guide to the field, Available: http://arxiv.org/abs/cs/0703024. Koch, C.; Cerf, M.; Harel, J.; Einhauser, W. (2007). Predicting human gaze using low-level saliency combined with face detection, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems (NIPS 2007), Vancouver, Canada, December 2007. Available: http://papers.klab.caltech.edu/view/year/2007.html. Kolmogorov, A. (1965). Three approaches to the quantitative definition of information, Problems of Information and Transmission, Vol. 1, No. 1, pp. 1-7, 1965. Koza, J.; Bennett, F.; Andre, D. & Keane, M. (1999). Genetic Programming: Turing’s Third Way to Achieve Machine Intelligence. EUROGEN Workshop in Jyvdskyld, Finland, May 1999. Available: http://www.genetic-programming.com/jkpdf/eurogen1999. Koza, J.; Bennett, F.; Andre, D. & Keane, M. (2002). Genetic Programming: Biologically Inspired Computation that Exhibits Creativity in Solving Non-Trivial Problems. In: Evolution as Computation: DIMACS Workshop, Princeton, 2002. Available: http://gridley.res.carleton.edu/~kachergg/docs/geneticProgramming.pdf. Lawrence, J. & Hendrickson, H. (2003). Lateral gene transfer: when will adolescence end?, Molecular Microbiology, vol. 50, no. 3, pp. 739-749, 2003. Legg, S. & Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence, Available: http://arxiv.org/abs/ 0706.3639. Li, M. & Vitanyi, P. (2008). An Introduction to Kolmogorov Complexity and Its Applications, Third Edition, New York: Springer-Verlag, 2008. McCarthy, J.; Minsky, M.; Rochester, N. & Shannon, C. (1955). A proposal for the Dartmouth summer research project on Artificial Intelligence, AI Magazine, Vol. 27, No. 4, 2006. Avail.: //www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1904/1802. Marr, D. (1978). Representing visual information: A computational approach, Lectures on Mathematics in the Life Science, Vol. 10, pp. 61-80, 1978. Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information, Freeman, San Francisco, 1982. Navon, D. (1977). Forest Before Trees: The Precedence of Global Features in Visual Perception, Cognitive Psychology, 9, pp. 353-383, 1977.
18
Machine Learning
Panel Discussion, (2006). Panel Discussion: What are the bottlenecks, and how soon to AGI?, Proceedings of the AGI Workshop, Washington DC, USA, May 2006. Qiu, P. & Sun, J. (2009). Using Conventional Edge Detectors and Post-Smoothing for Segmentation of Spotted Microarray Images, Journal of Computational and Graphical Statistics, Vol.18, No. 1, pp. 147-164, 2009. Saba, W. (2008). Commonsense Knowledge, Ontology and Ordinary Language. International Journal of Reasoning-based Intelligent Systems, Vol. n., No. m., pp. 43-60, 2008. Available: http://arxiv.org/abs/0808.1211. Sadrzadeh, M. (2008). Occam’s razor and reasoning about information flow, Available: http://arxiv.org/abs/cs/0808.1354. Shannon, C. E. (1948). The mathematical theory of communication, Bell System Technical Journal, Vol. 27, pp. 379-423 and 623-656, July and October 1948. Shomstein, S. & Behrmann, M. (2008). Object-based attention: Strength of object representation and attentional guidance. Perception & Psychophysics, Vol. 70, No. 1, pp. 132-144, January 2008. Solomonoff, R. J. (1964). A formal theory of inductive inference. Information and Control, Part 1: Vol. 7, No. 1, pp. 1-22, March 1964; Part 2: Vol. 7, No. 2, pp. 224-254, June 1964. Torralba, A. (2009). How many pixels make an image? Visual Neuroscience, Vol. 26, Issue 1, pp. 123-131, 2009. Available: http://web.mit.edu/torralba/www/. Treisman, A. (1996). The binding problem. Current Opinion in Neurobiology, Vol. 6, pp.171178, 1996. Treisman, A. (2003). Consciousness and perceptual binding. Available: http://www.csbmb.princeton.edu/conte/pdfs/project2/Proj2Pub5anne.pdf. Treisman, A. & Gelade, G. (1980). A feature-integration theory of attention, Cognitive Psychology, Vol. 12, pp. 97-136, Jan. 1980. Tuffield, M.; Shadbolt, N. & Millard, D. (2005). Narratives as a Form of Knowledge Transfer: Narrative Theory and Semantics, Proceedings of the 1st AKT (Advance Knowledge Technologies) Symposium, Milton Keynes, UK, June 2005. Turing, A. (1950). Computing machinery and intelligence. Mind, Vol. 59, pp. 433-460. Available: http://scholar.google.co.il/. Vitanyi, P. (2006). Meaningful Information, IEEE Transactions on Information Theory, Vol. 52, No. 10, pp. 4617-4624, October 2006. Availbl: http://www.cwi.nl/~paulv/papers. Voltera, A. & Meldolesi, J. (2005). Astrocytes, from brain glue to communication elements: the revolution continues, Nature Reviews, Neuroscience, vol. 6, No. 8, pp. 626-640. Wang, P. (2006). The Logic of Intelligence. In: Artificial General Intelligence, Wang, P. & Goertzel, B. (Eds.), pp. 31-62. Springer Verlag, May 2006. Available: http://nars.wang.googlepages.com/nars%3Application. Wang, P. & Goertzel, B. (2006). Introduction: Aspects of Artificial General Intelligence. In: Artificial General Intelligence, Wang, P. & Goertzel, B. (Eds.), Springer Verlag, 2006. Available: http://nars.wang.googlepages.com/nars%3Application. Zhang, S.; Bock, F.; Si, A.; Tautz, J. & Srinivasan, M. (2005). Visual working memory in decision making by honey bees, Proceedings of The National Academy of Science of the USA (PNAS), vol. 102, no. 14, pp. 5250-5255, April 5, 2005.
SOMs for machine learning
19
x2 SOMs for machine learning Iren Valova, Derek Beaton and Daniel MacLean
University of Massachusetts Dartmouth USA
1. Introduction In this chapter we offer a survey of self-organizing feature maps with emphasis on recent advances, and more specifically, on growing architectures. Several of the methods are developed by the authors and offer unique combination of theoretical fundamentals and neural network architectures. Included in this survey of dynamic architectures, will also be examples of application domains, usage and resources for learners and researchers alike, to pursue their interest in SOMs. The primary reason for pursuing this branch of machine learning, is that these techniques are unsupervised – requiring no a priori knowledge or trainer. As such, SOMs lend themselves readily to difficult problem domains in machine learning, such as clustering, pattern identification and recognition and feature extraction. SOMs utilize competitive neural network learning algorithms introduced by Kohonen in the early 1980’s. SOMs maintain the features (in terms of vectors) of the input space the network is observing. This chapter, as work emphasizing dynamic architectures, will be incomplete without presenting the significant achievements in SOMs including the work of Fritzke and his growing architectures. To exemplify more modern approaches we present state-of-the art developments in SOMs. These approaches include parallelization (ParaSOM – as developed by the authors), incremental learning (ESOINN), connection reorganization (TurSOM – as developed by the authors), and function space organization (mnSOM). Additionally, we introduce some methods of analyzing SOMs. These include methods for measuring the quality of SOMs with respect to input, neighbors and map size. We also present techniques of posterior recognition, clustering and input feature significance. In summary, this chapter presents a modern gamut of self-organizing neural networks, and measurement and analysis techniques.
2. Overview of competitive learning 2.1 Unsupervised and competitive learning Very broadly defined, neural networks learn by example and mimic human brain in its decision or object identification capabilities. The concept of artificial neural networks (ANN) is based on two different views of the human brain activity, both of which rely on the functionality of a single neuron. The neurons are perceived as adding devices, which react,
20
Machine Learning
or fire, once the incoming signals sum reaches a threshold level. Fig.1 illustrates the functionality of a single neuron, which receives signals from other neurons it is connected to via weighted synapses. Upon reaching the firing level, the neuron will broadcast a signal to the units connected to its output. inputs
output
Fig. 1. Artificial neuron functionality Returning to the two major types of ANN, one view generally relies on the individual neuron and its ability to respond, or fire, given sufficient stimulus. The topology of neurons and the connections among them is not the goal of this type of ANN, but rather the output produced by the respective neuron. The second view banks on the neurons functioning as a team. As such, it takes into account the concept of map formed by neuron positions, much like the visual cortex map, producing a two dimensional image of the perceived visual field. This type of ANN produces a topology of neurons, connected by weighted synapses and features the natural grouping of the input data (Fig.2). This translates into input density map and necessitates the development of evaluation procedures on the formed clusters for the purpose of identifying or matching patterns in the data.
Fig. 2. The black area denotes input space distribution, where the neurons have organized to cover that input topology as a team The taxonomy of learning methods and algorithms for ANN is multifaceted and includes many hierarchical classifications (Fig.3). In this chapter we are concerned with unsupervised learning that is also competitive. Learning in ANN is the process of connection weight adjustment, which, in turn guides the neuron to a better position in terms of input data configuration. In the case of supervised learning, the weight adjustment will be guided by the teaching signal and the penalty/reward of the error in the ANN response. Unsupervised learning methods do not benefit from teacher signal guidance. The neurons compete to match the input as closely as possible, usually based on Euclidean distance. The neuron closest to the considered input exemplar is the winner taking it all, i.e. adjusting its weight to improve its position and thus move closer to the input.
SOMs for machine learning
21
We are describing the extreme case of competition, i.e. winner-take-all. Depending on the learning algorithm and the ANN application, while the winning neuron will be selected, a neighbourhood of influence may be established, whereby the winning neuron neighbours will also move in the same direction albeit at a lesser distance. ANN training paradigms
Supervised Hopfield
Backpropagation
Unsupervised Associative memory
ART
SOM
Fig. 3. Taxonomy of ANN learning methods Unsupervised learning (UL) is generally based on competition. UL seeks to map the grouping or patterns in the input data. This can be accomplished either by neurons resonating with the input exemplar (e.g. adaptive resonance theory) or by neurons winning the distance from the input exemplars competition. It must be noted that there are ANN models that learn in unsupervised manner, but are not based on competition. Among those, the principal component network should be mentioned here as a prelude to later sections in this chapter. 2.2 Kohonen’s SOM The brains of higher animals are organized by function, e.g. the visual cortex processes the information received through the optical nerve from the eyes, the somatosensory cortex maps the touch information from the surface of the body, etc. Inspired by the mapping abilities of the brain, the self-organizing feature map (SOM) was introduced in early 1980’s by Teuvo Kohonen (Kohonen, 1995). SOMs are used to topologically represent the features of the input based on similarity usually measured by Euclidean distance. SOMs are useful tools in solving visualization and pattern recognition tasks as they map a higher dimension input space into a one- or two-dimensional structure. SOMs are initialized usually randomly (Fig.4b), in a topology with fixed number of neurons, that can be ordered in a chain (i.e. each neuron has at most two neighbors) or in a two-dimensional grid of rectangular (Fig.4c) or hexagonal nature, where the neurons have at most four neighbors.
Fig. 4. Input space: a) distribution; b) with randomly initialized neurons; c) two-dimensional rectangular grid
22
Machine Learning
Before a brief overview of the SOM algorithm, let us take the reader through the concept of Kohonen’s ANN. The input space of Fig.4a is used along with the random initialization in Fig.4b. As every input vector (in this case a two-dimensional x, y representation of the location of each dot comprising the black area in Fig.4a) is presented to the network, the closest neuron responds as a winner of the Euclidean distance-based competition and updates its weight vector to be closer to the input just analyzed. So do its neighbors dependent on the neighborhood radius, which can be reduced as the time progresses. The rate of reduction is determined by the learning rate. As inputs are presented in a random order, the neurons move closer to the input vectors they can represent and, eventually, the movement of neurons becomes negligible. Usually, that is when the map is considered to have converged. This process is illustrated in Fig.5a for one-dimensional SOM and Fig.5b for rectangular SOM.
Fig. 5. Converged SOM: a) one-dimensional topology; b) two-dimensional topology
Fig. 6. Hilbert curve initialization approach: a) initialized network; b) resulting map While the classic SOM features random initialization of the weight vectors, the authors have demonstrated the advantages of one-dimensional SOM initialization based on a Hilbert curve (Buer, 2006; Valova, Beaton, & MacLean, 2008a) with neurons positioned in a chain following the space-filling curve (Fig.6a). Kohonen posited that one-dimensional SOM converge in a shape resembling Peano curves. The authors followed this observation and utilized the idea in the initialization process to speed up convergence and ensure linearity of the network. Fig.6b demonstrates the resulting map. It is obvious that the map is not tangled, and the neurons that are physical neighbors also represent topologically close input vectors unlike the map on Fig.5a, which is tangled and topologically close neighbors do not always share physical proximity. The algorithm can now be formalized. Each neuron in the network has a weight or reference vector [ x1 , x2 ,..., xm ] where xi is an individual attribute of . The neurons are gradually organized over an n-dimensional input space V n . Each input [ 1 , 2 ,..., n ] V n , where
i is an attribute in , has the same number of attributes as the weight vector of each neuron in the network.
SOMs for machine learning
23
Once the network is initialized, the input is presented sequentially to the network. The bestmatching unit in the network is determined by comparing each neuron to the current input based on Euclidean distance, with the winner being the neuron closest to the input. (1)
c arg i min{i i }
where c is the best-matching unit. The winning neuron and all neurons falling within the neighborhood radius for an iteration i update their reference vectors to reflect the attraction to . To change the reference vector of a neuron, the following equation (which is a very common SOM learning equation) is used:
(t 1) (t ) h(t ) (t )( )
(2)
where t is the current iteration and (t) is a monotonically decreasing learning rate, and h(t) is the neighborhood function. 2.3 Visualizing a SOM In low dimensional patterns, such as one- or two-dimensional, the input and the SOM can be visualized by using positions of pixels. However, when scaling to three, or five dimensions, pixels can be used for dimensions that represent two-dimensional space, but the remaining one or three attributes in this case would be represented by gray scale or RGB, respectively. This, however, implies that the visualization of the data, and the SOM can be expressed in x, y and color values. When dealing with complex data that is not directly visualized, or even very high dimensional data (e.g. greater than 10) visualization becomes an issue. One of the methods used for showing the lattice structure style network for high dimensional data, is to use a dimensionality reduction or dimensionality mapping technique. One of the simplest, and most well known is Sammon’s mapping (Eq.3) (Sammon 1969).
i j d A (i, j ) ( i , j ) A2
1
i j
(d A (i, j ) dV ( wi , w j )) 2
( i , j ) A2
d A (i , j )
(3)
This mapping measure effectively takes the distances between high dimensional objects, e.g. neurons, and allows them to be plotted in two-dimensions, so the researcher may visually inspect the geometric relations between neurons. Often, when the number of inputs is very high, or patterns are non-distinct and complex, the visualization does not include input data, rather, just the neurons. To form the lattice structure correctly, the connections between known neighbors should be illustrated. The neurons are often recorded in one- or two-dimensional arrays, to allow the physical neighbors of each neuron to be recorded.
24
Machine Learning
3. Growing Self-organizing Algorithms 3.1 Fritzke’s growing SOM variants Self-organizing Maps, as introduced by Kohonen, are static sized network. The obvious disadvantage to the predetermined number of neurons is that number is either not high enough to adequately map the input space or too high, thus leaving many neurons underutilized. SOM being trained without supervision, is not expected to know the input space characteristics apriori. Hence, a topology which allows the addition/removal of neurons, is a logical step in the development of self-organization. Fritzke introduced three architectures in the 1990’s - growing grid (GG), growing cells (GC), and growing neural gas (GNG) (Fritzke, 1992, 1993a, b, c, 1994, 1995). All three start with minimal number of neurons and add neurons as needed. The need is based on an age parameter, while an error parameter determines which neuron will be joined by a new neighbor (GC, GNG) or a new set of neighbors (GG). In the case of GC, once a neuron with the highest error value is selected, its longest connection (or edge) is replaced by two edges and a neuron is added to facilitate their connection (Fig.7).
Fig. 7. Growing cells at: a) 3333 iterations; b) 15000 iterations
Fig. 8. Growing grid at: a) 33333 iterations; b) 100000 iterations GG also utilizes an age parameter for the neurons. Every time a neuron wins, its age is incremented. At intervals, the neuron with the most wins is slated to receive new neighbors and the furthest direct topological neighbor from the selected neuron is found. GG utilizes a rectangular grid topology, which is maintained during growth. If these two neurons are in the same row of neurons, a column will be added between the two – thus affecting neurons
SOMs for machine learning
25
in other rows. If the selected neurons are in the same column, then a new row will be added between them – thus affecting neurons in other columns (Fig. 8). In the third Fritzke contribution - the GNG - a similar concept to GC is observed. However, the connections between the neurons are assigned the age parameter. Therefore, GNG adds and removes connections, based on neuron selection. GNG is also capable of attaining configurations with multiple networks (Fig. 9). One of the major drawbacks to GNG, as well as other growing methods, is that the number of nodes are ever increasing. This is, in part, compensated for by an extension to GNG, Growing neural gas with utility (GNG-U), which features removal of neurons based on the criterion of low probability density in the underlying input space “beneath” the neuron. The downfalls to these methods are than they do not exhibit incremental learning – a problem that is discussed in a later section on ESOINN.
Fig. 9. Provided by (DemoGNG); At 7500 iterations (left) the map is starting to take the form of the input; At 12000 iterations (right) the map is a much better topological representation of the given input. Fritzke’s growing methods can be explored by the reader interactively at the DemoGNG website (DemoGNG), provided by the Institut für Neuroinformatik. 3.2 ParaSOM The ParaSOM is architecture developed by the authors (Valova, Szer, Gueorguieva, & Buer 2005), which shares several similar attributes with the classic SOM and the growing architectures developed by Fritzke. However, it improves the effectiveness of the network and allows for different approaches to cluster analysis and classification, which will be demonstrated in later sections. The unique characteristics of the architecture include: parallel input processing, the adoption of cover region to manage the influence area of a neuron, and network growth, which is inspired by the GC architecture. The ParaSOM is designed to process the entire input space in parallel. The classic SOM presents the network with one input at a time and determines the winner to move closer to that input. With the ParaSOM, however, the entire input space is presented to the network at the same time. Therefore, multiple neurons can adapt to nearby inputs independently of other neurons. This trait helps the network in recognizing patterns that it has already learned. For instance, imagine the network adapted itself to a particular pattern A, and another pattern B, that is very similar to A, is presented to the network. Because the neurons process the input space independently of each other, the ones that are already covering the pattern (most of them, since A is very similar to B) will not move. As a result, adapting to B is much faster, as the network only needs to learn the small differences between A and B. ParaSOM features the following parameters that are unique to the architecture. The network produces a cover matrix, which represents the approximation of an input pattern made by
26
Machine Learning
the network in its current state. The cover matrix calculation is based on the cover region, which is an attribute of every neuron. This is the area surrounding the point in space where the neuron exists and is considered to be the region of space that the neuron is covering. The move vector is another neuron attribute, which indicates the amount a neuron should move and in which direction is calculated, and added to the neuron’s reference (weight) vector. The age parameter, which represents for how long a neuron has been well or poorly positioned is closely related to the inertness of a neuron. The inertness is the measure of how effectively a neuron covers the input space. Both, age and inertness determine when and where the network should grow or contract. The ParaSOM in action is illustrated in Fig.10. The network is initialized with a minimal number of neurons, which have large cover regions (denoted by the large circles in Fig.10a). As the input space is shown to the network, the move vectors are calculated along with the other parameters gauging the performance of individual neurons (i.e. age, inertness, cover region) and the neurons are moved, new ones are added or some are removed in order to achieve the final comprehensive coverage illustrated in Fig.10c.
Fig. 10. ParaSOM in action: a) randomly initialized with a predetermined number of neurons; b) the network position at iteration 25; c) the network at iteration 500 The formalization of the algorithm will begin with the cover matrix which consists of subtracting a set of cover regions from the input space. Each such region is associated with a single neuron, and represents the contribution of the neuron to the total approximation. The job of the cover region is to weaken the signals of the inputs that are being covered, where the weaker the signal, the better the neuron is covering that region. Each cover region also maintains a radius which decreases in each epoch. The cover matrix is formulated as C = Vm-
s
i 1 i
(4)
with the cover region calculated as
f ( x), if x j mi threshold i (x j ) = m , 1 j k else 0, i
where the modified Gaussian cover function is defined as
(5)
SOMs for machine learning
27
(6) 2 2 2 1 x1 n xn f mi (x)=exp for radius and being an attribute of the input vector. The cover function is utilized in the computation of a cover value, which indicates how well the neuron is covering the inputs and is calculated as ci = Ci f mi
(7)
where the local cover matrix is represented by Ci = C i
(8)
The inertness, is an indicator as to whether the neuron should be moved, removed, or is in a place where new neurons ought to be added nearby. The lower the inertness the better the coverage and hence position of the neuron. The inertness is given by
i = ci cmax
(9)
where the cover value max is calculated by cmax =
Vm
f mi ( x)dx
(10)
The network utilizes the inertness to determine whether and where to grow/shrink. A high inertness indicates that the neuron is well positioned and should not move (or move very little), while a low inertness indicates poor positioning and greater neuron movement. Inertness is one of two components that dictate network growth, with age being the second. Each neuron has an age attribute that. When a neuron is well positioned, as determined by a high-inertness threshold, its age is incremented. The same is true with a poorly positioned neuron having its age decremented based on a low-inertness threshold. When a neuron is well (or poorly) positioned for a sufficient period of time, it becomes a candidate to have a neuron added as a neighbor, or to be removed from the network, respectively. Finally, the move vector, which indicates the amount a neuron should move and in which direction is calculated, and added to the neuron’s reference vector. The attractions also affect immediate neighbors of the neuron but to a lesser degree where the amount of movement is proportional to the distance between the neuron and neighbor. The move vector vi = ( vi1 , vi 2 , , vin ) consists of components vik =
C
V
m
i
f mi ( xk ) dx .
The authors have explored the effect a Hilbert initialization has on ParaSOM. As with the classic SOM, this network is also positively influenced by this mode of initialization. Fig.11 shows some results. Fig.11a features the network at iteration 0, using the same input topology as Fig.10. Fig.11b illustrates the intermediate result at iteration 25, and Fig.11c illustrates the final converged state of ParaSOM at iteration 500, same as the iteration in
28
Machine Learning
Fig.10c. The last point is made to focus the reader attention to the tangled state of the randomly initialized network in Fig.10c. The Hilbert initialization, as the same iteration, features untangled, well-organized network.
Fig. 11. ParaSOM in action: a) Hilbert initialized with a predetermined number of neurons; b) the network position at iteration 25; c) the network at iteration 500 Other investigations with ParaSOM include parallelization (Hammond, MacLean, & Valova, 2006) via message-passing interface (MPI) among 4 worker and 1 director machines (Valova et al., 2009), controlling the parameters of ParaSOM with genetic algorithms (MacLean & Valova, 2007), and more recently, testing and network adjustment for multidimensional input. 3.3 ESOINN The Enhanced Self-Organizing Incremental Neural Network (ESOINN) (Furao, Ogura, & Hasegawa 2007) represent growing architectures, which are partially inspired by GNG and GNG-U. According to the authors of ESOINN, it addresses the stability-plasticity dilemma (Carpenter & Grossberg 1988), by providing the ability to retain knowledge of patterns it has already learned (stability), while still being able to adapt to, and learn, new patterns that it is yet to be exposed to (plasticity). ESOINN identifies clusters during execution by joining together subclusters that form within a larger cluster. Thus, overlap in multiple clusters can be identified and effectively separated. Typically with growing architectures, the network grows by adding neurons in sufficiently dense area of the input space. In ESOINN, neurons are added when the current input is adequately distant from the closest neuron. A new neuron is added to the network containing the same (not similar) reference vector as the input. ESOINN decides on adding a neuron based on similarity threshold. It is basically a dynamic distance measure calculated by the distance of the neuron’s neighbors, or, if no neighbors are available, all other neurons in the network. When an input is presented to the network, the first and second winners, or best matching unit (BMU) and second matching unit (2BMU), are determined. The network then decides if a connection between the winner and second winner should be created, if one does not already exist. In ESOINN, knowledge of the neuron density in a given area of the input space is critical to performing tasks such as creating connections between neurons and detecting overlap in clusters. By being able to measure the density, the network can better determine whether a particular section of the input space is part of a single cluster or of an overlapped section. After detection of overlapped areas, connections between neurons of different subclasses are
SOMs for machine learning
29
removed. This separates the subclasses belonging to different composite classes. This process is performed at regular intervals, where the number of inputs presented to the network is evenly divisible by predetermined integer value.
Fig. 12. Neurons in {A, B, C, D} are all connected by paths and therefore are in the same cluster. The same is true with {E, F, G}, and {H, I}. Conversely, A and E have no path to each other, and therefore are not in the same class Connections in ESOINN are used to identify subclasses. To aide in this identification, they are created and removed from neurons as new data is presented to the network. When a connection is removed between two neurons, a boundary is identified between the different classes that each neuron is a part of. The paths created by connections are also the way that neurons are classified at the end of ESOINN execution. Any given neuron i, and all other neurons that are connected to i by a path, are considered to be in the same class. Neurons that cannot be connected by a path are said to be in different classes (Fig.12). Connections in ESOINN are created when there is no existing connection between a winner and second winner. In this case, the newly created connection has an age attribute that is set to zero. If a connection already exists between the winner and second winner, the age of that connection is reset to zero. In either scenario, the ages of all existing connections between the winner and its neighbors are increased by one (except the connection between it and the second winner). Deletion of connections occurs when the ESOINN algorithm determines that the current winner and second winner are in different subclasses and those subclasses should not be merged. ESOINN adds neurons because they represent noisy input which is likely to be distant from relevant patterns. As a result, the input will be outside of the similarity threshold of the winner and second winner, and a new neuron is created. These neurons are undesirable because they are generally placed in low-density areas and can skew the cluster identification. ESOINN removes neurons with two or fewer neighbors utilizing average density value. When a neuron is deleted, all connections associated with it are also removed. This process also occurs after predetermined number of inputs has been presented. The connection removal and addition features of ESOINN make it very powerful at finding distinct patterns in a wide array of problem domains. ESOINN is a major step forward in unsupervised learning. Since ESOINN addresses the stability-plasticity dilemma (continued learning with no forgetting), it is an algorithm that can be used for varying types of data sets, including overlapping Gaussian distributions. 3.4 TurSOM TurSOM (the amalgamation of Turing and SOM) is a new variant of the Kohonen Selforganizing Map, introduced by the authors (Beaton, 2008; Beaton, Valova, & MacLean, 2009a, b, c). TurSOM’s primary contribution is the elimination of post-processing techniques
30
Machine Learning
for clustering neurons. Its features are inspired in part by Turing’s work on unorganized machines (Turing, 1948). Turing’s unorganized machines (TUM) represent early connectionist networks, meant to model the (re)organization capability of the human cortex. In Kohonen’s SOM algorithm, the neurons are the components of self-organization, whereas with Turing’s idea, the connections also fulfil that role. In TurSOM, we capitalize on both methods of self-organization. While the neurons of TurSOM adhere to the same learning rules and criteria of the standard SOM, the major differentiating feature of TurSOM is the ability to reorganize connections between neurons. Reorganization includes the removal, addition, or exchanging of connections between neurons. These novelties make TurSOM capable of identifying unique regions of input space (clustering) during execution (on-the-fly), as demonstrated in Fig.13. The clustering behavior is achieved by allowing separate networks to simultaneously execute in a single input space. As TurSOM progresses, connections may be removed, or exchanged – causing a network to split into two networks, and two into three or four, and so on. Additionally, when separate networks get closer to one another they may join to form a single network.
Fig. 13. TurSOM on connection reorganization: a) TurSOM at initialization; b) TurSOM at 250 iterations – exemplary of TurSOM reorganizing connections; c) TurSOM at 350 iterations – exemplary of TurSOM identifying unique patterns In order for TurSOM to achieve the behavior it exhibits, several new mechanisms are introduced to the Kohonen SOM. In SOM algorithms, there is a neuron learning rate. The point of the neuron learning rate is to decrease the movement of winning neurons (and subsequently their neighbors) as time progresses. As a SOM adapts to input, it should require less drastic organization, i.e., smaller movements. Similarly, TurSOM introduces a connection learning rate (CLR), which regulates the reorganization of connections as TurSOM progresses. The CLR is a global value controlling the maximum allowable distance between two neurons. If the distance between any two neurons exceeds the CLR, they must disconnect. CLR is computed as follows: CLR = Q3+(i
(Q3-Q1))
(11)
The CLR formula is derived from the upper outlier formula from box-plots (a statistical technique of measuring distribution by analyzing four quartiles). In CLR, the x in Qx represents which quartile it is, and i, is an incrementing value as time progresses. The data
SOMs for machine learning
31
being measured (for the quartiles), is the length of all connections available in the current input space. The CLR is instrumental to the reorganization process in TurSOM as it effectively decides which connections are unnecessary.
Fig. 14. CLR in TurSOM: a) the square pattern with random initialization; b) the first 50 iterations of TurSOM, where its behavior is the same as a SOM; c) CLR has been active for 100 iterations and a rapid, and sudden reorganization of connections is evident Fig.14a, b and c demonstrates a simple solid pattern for the first 150 iterations of TurSOM. The CLR determines which connections will be eliminated. The connections that are not considered optimal are removed and as evident by the figure, the removed connections were negatively impacting the network. So far, we have described how TurSOM may separate into different networks, but we have not addressed how two networks can rejoin into one. The neuron responsibility radius (NRR), inspired by ParaSOM’s neuron cover region (addressed in section 3.2), becomes active in TurSOM when two neurons disconnect from one another. However, there is one requirement for networks that disconnect – they must be of size three or greater. Empirically, it has been shown (in TurSOM) that networks smaller than three (i.e. 2 or a single neuron) become “pushed aside” for other neurons that are active in a network. A neuron with an active radius still has one neighbor. The neuron responsibility radius, is effectively a “feeler”, actively searching for other “free” neurons with similar features. To calculate the NRR, the following formulae are used when the dimensionality of input space is even:
re e
1
(12)
1
(13)
e ! 2 2 If the dimensionality is odd:
ro o
1
1 1 2 2 ! o 2 !
(14) (15)
32
Machine Learning
where represents the number of dimensions, and represents the number of inputs a neuron is responsible for. is calculated by dividing the number of neurons, by the number of inputs. To follow along with the example provided in the previous section on connection learning rate, the following Fig.15) demonstrate the remaining iterations of TurSOM, where the effects of the NRR are seen.
Fig. 15. The effects of NRR: a) demonstrates the reconnection process, which is governed by the NRR; b) The single pattern in an optimal mapping, where the Peano-like curve of the network must be noted As demonstrated in Fig.15, the NRR determines the reconnection process, and works in cooperation with the CLR (the disconnection process). TurSOM also provides for a neuron to sense nearby neurons that are better suited to be connected neighbors than current neighbors. Simply stated, this is a check that neurons perform by knowing the distance to their neighbors, and knowing of other neurons in the network that are nearby. This process is considered to be a part of the reorganization process. Similar to Frtizke’s growing grid algorithm, TurSOM has a growth mechanism. TurSOM begins as a one-dimensional chain, which upon convergence, will spontaneously grow (SG) to two-dimensional grids. The term convergence is used loosely here to mean a network reaching a fairly stable representation of the input where further computation would not benefit the network significantly. During the spontaneous growth phase, connection reorganization (which implies network splitting and rejoining) is turned off. Presumably, at this point, the one-dimensional networks have settled to satisfactory positions, and do not require further adaptation. The growing process is demonstrated in Fig.16. Fig.16a illustrates the input pattern. The converged one-dimensional SOM is shown in Fig.16b. Finally, the SG is demonstrated in Fig.16c, where it is evident that each one-dimensional network grows independently.
Fig. 16. TurSOM in action: a) Input space with 4 distinct patterns, which are five-dimensional data (X,Y, R, G, B); b) TurSOM in one-dimensional form mapping each of the distinct patterns; c) TurSOM with SG for a better representation of the patterns
SOMs for machine learning
33
TurSOM’s growing capabilities are an instrumental part facilitating the performance of the network. Often times, one-dimensional networks do not represent the underlying data well enough. Two-dimensional networks have a better representation, or, a better resolution. Finally, the TurSOM algorithm can be summarized in the following steps: a) Select Input b) Find best-matching unit (BMU) c) Update BMU and BMU’s neighbors 1) Record the distances between all connected neighbors d) Check lengths of all connections (Step c.1) 1) If connection is too large - Disconnect neurons - Update Connection Learning Rate - Activate Neuron Responsibility Radius e) Check neuron physical locations 1) If neuron A is a neighbor of B, but not C (which is a neighbor of B), but A is closer to C than B, switch connections - thereby changing neighbors f) Check neuron responsibility radius for proximity to other neurons 1) Reconnect neurons that have overlapping NRR g) If TurSOM has reached convergence 1) Spontaneous Growth 3.5 Modular Network Self-organizing Map While not a growing architecture, a very recent SOM architecture called the modular network Self-Organizing Map (mnSOM) (Tokunagaa & Furukawa, 2009) is mentioned here. This architecture is a hybrid approach to neural network computing. The mnSOM architecture consists of a lattice structure like that of a typical SOM. Additionally, the neurons in the SOM behave in a similar self-organizing fashion. However, each neuron is composed of or “filled with” a feed-forward network, such as a multi-layer perceptron (MLP). The major difference between SOMs and feed-forward networks, is that SOMs learn the topology or structure of data. Feed-forward architectures learn functions about input. The effective outcome of this network is that it self-organizes function space. That is to say, when presented with various types of input patterns where functional knowledge might be very important, mnSOM is able to topologically order functions based on similarity.
4. Methods for SOM analysis Self-organizing maps are powerful analytical tools. Visualization is often employed to analyze the resulting topological map. However, sometimes networks do not represent optimal mappings. This can skew the understanding, or even representation of the data that is supposed to be visualized. In this section we provide methods of analyzing the quality of a SOM network. Commonly, these techniques are used post-execution, in order to analyze how well the SOM converged to the given data. A survey of SOM quality measures can be found in (Pölzlbauer 2004).
34
Machine Learning
4.1 Quantization Error Quantization error (QE) is a simple measure used in other fields, including clustering and vector quantization as a technique for verifying that inputs are with their proper (or best suited) clusters. As SOMs perform clustering, QE can be utilized. However, one major draw back is that QE does not address the quality of organization of the network. Rather, it measures neuron placement to inputs. Quantization error is measured by computing the average distance from inputs to their appropriate outputs. One point to note about this measure, is that when the number of neurons is decreased or increased for the same input space, the value acquired by quantization error is increased or decreased respectively. Effectively, more neurons mean a smaller error, and vice versa for less neurons. The QE is calculated by computing the average distance from inputs to their associated neuron. Short pseudocode is given below: uniquely number all neurons for each input find best-matching unit (BMU); aka neuron array[BMU#][1] = array[BMU#][1] + distance from input to BMU array[BMU#][2] = array[BMU#][2] + 1; end for each neuron as x error[x] = array[x][2] / array[x][1] end 4.2 Topographic Error Topographic error (TE) measures the quality of organization of SOMs, and provides information of how well organized neurons are with respect to other neurons. This measure is used to see if neurons are correctly identified as topological neighbors, with respect to inputs. Conceptually, TE is attempting to give an idea as to how twisted, or tangled a SOM network is. An example of a 2-dimensional pattern, with a 1-dimensional map is shown in Fig.17. Topographic error is represented as a value between 0 and 1, where 0 indicates no topographic error, therefore, no tangling, and 1 would indicate maximum topographic error or complete tangling.
Fig. 17. This pattern shows (in the bottom right) a 1-dimensional network that intersects, or crosses connections. Effectively, this network is tangled, as there are more appropriate neighbors for some of the neurons. Topographic error is computed as follows: error = 0 for each data sample find best-matching unit (BMU) find second best-matching unit (2BMU)
SOMs for machine learning
35
if BMU is not a lattice neighbor of 2BMU error = error + 1; end
end error = error / number of neurons;
4.3 Topographic Product Topographic product (TP), introduced by (Bauer & Pawelzik, 1992), is a measure to indicate if a SOM is too large or too small for the input it is representing. TP measures the suitability and size appropriateness of a map, for a given set of input. Two primary variables are utilized for the computation of TP, Qx and Px. Qx is a ratio of distances found in input and output spaces, and Px is a multiplicative normalization of it’s respective Qx value. Below are the initial steps for TP: Step 1: For the weight (in input space) (w) of neuron j (wj), find the kth: a. Closest data in input space, as distance d1V b. Closest neuron in output space, as distance d V2 Step 2: For the neuron j, find the kth a. Closest neuron in output space, as distance d1A b. Closest data in input space, as distance d 2A Step 3: Create two ratios: Q1(j,k) = d1V / d V2
Q2(j,k) = d1A / d 2A , where k represents an iterative value. When using k in the SOM, the iteration occurs through all other neurons besides j (steps 1a and 2a). Similarly, when calculating Q1, the iteration occurs through all inputs, excluding wj if wj is equal to one of the inputs (steps 1b and 2b). These two values, Q1 and Q2, optimally would be equal to 1, if and only if neighbors are correctly preserved and suitably organized. However, Bauer and Pawelzik point out that this is far too sensitive. A normalization of Q1 and Q2 is required, via function (pseudo code provided): Px(j,k) = 1; for each neighbor of a neuron j, represented by k Px(j,k) = Px(j,k) * Qx(j,k) end Px(j,k) = power(Px(j,k), (1/k)) x of Px, and Qx are either 1 or 2, defined from the previous steps. At this point, P1 1, and P2 1, as P1 is a value created from all the data in input space, based on the weights of all neurons. If there is a 1-to-1 mapping of neurons to input, then this value should be 1. Additionally, P2 will be less than or equal to one, because it is a representation of neighboring neurons in the output space. This occurs because the denominator of Q2 comes from input space distances and the numerator comes from neurons distances. However, having two numbers to explain a mapping is not desirable, so Bauer and Pawelzik introduce P3 (provided below in pseudo code): P3(j,k) = 1;
36
Machine Learning
for each neighbor of a neuron j, as k P3(j,k) = P3(j,k) * (P1(j,k) * P2(j,k)) end P3(j,k) = power(P3(j,k), (1/2k)) P3 is normalized. The relationship of P1 and P2 is inverse, thereby giving way to these rules: 1: P3 > 1 means the map is too large, P1 > (1/P2) 2: P3 < 1 means the map is too small, P1 < (1/P2) The final step in topographic product is computing an average of the values already obtained, when using all neurons in a SOM: P = 0; for each neuron, as j for each neighbor as k P = P + log(P3(j,k)) end end P = P/(N*(N-1)) // where N is the number of neurons All values of P that deviate from 1, should be of concern. The same rules apply to P as do P3, concerning deviation from 1. The formulaic view of P3 is provided by Eqs. (16) and (17).
1/ 2 k
(16)
N N 1 1 log( P3( j , k )) N ( N 1) j 1 k 1
(17)
k dV w , w A d A j , nlA ( j ) j nl j A P3 = V l 1 d V w j , wnlV j d j , nl ( j )
P
4.4 Other Measures We have presented three measures of SOM that evaluate fundamental aspects of SOM quality, namely, correct neuron to input positioning (QE), network organization quality (TE), and suitability of map size (TP). However, there are several measures beyond these that attempt to combine these fundamental aspects, or measure other characteristics of SOMs. Some of these measures include Trustworthiness and Neighborhood Preservation (Venna & Kaski, 2001), which aim to measure data projection relations, by comparing input data relations to output data relations; and Topographic Function (Villmann et al., 2007), a measure which accounts for network structure, and neuron placement.
5. Pattern identification and clustering Over the years many methods of analyzing the patterns of the neurons of SOMs have been introduced. One of the simplest methods is the gray scale clustering presented by Kohonen in his book, on the poverty map data set (Kohonen, 1995). Kohonen’s example colors distances between nodes a shade of light gray, if the nodes are close, or dark gray, if the nodes are far. However, visual methods leave interpretation up to the reader. In this section
SOMs for machine learning
37
we present two methods of analyzing and identifying patterns exhibited by the neuron configurations of SOMs. These are methods for post-convergence analysis. When a SOM converges, it is not always necessary to perform any post-processing techniques, especially in lower dimensionality. At the time of convergence, what we do know is that each neuron has found a suitable region of space, where it is representing a given amount of inputs. Exactly what inputs is not always clear unless another technique is used (one technique to map inputs to neurons is QE). Additionally, there may be a relationship that exists between neurons. This section will explain methods of measuring similarity and the relationships between neurons. 5.1 PCA Principal components analysis (PCA) is a well-established statistical technique, used in a variety of fields on high-dimensional data. The primary goals of PCA are dimensionality reduction and explanation of covariance (or correlation) in variables. Effectively, PCA provides linearly separable groups, or clusters within high-dimensional data along a given dimension (variable). Additionally, the principal components computed by PCA can identify which variables to focus on, i.e. which variables account for the most variance. Variables are determined to be unnecessary when they do not explain much variance. For a detailed explanation on PCA and how to implement it, please see (Smith 2002, Martinez, & Martinez 2005). PCA can be used as a pre- (Kirt, Vainik & Võhandu, 2007; Sommer & Golz, 2001) and post(Kumar, Rai & Kumar 2005; Lee & Singh, 2004) processor for SOM. Additionally, a SOM has been created to combine the capabilities of both PCA and SOM (López-Rubio, Muñoz-Pérez, Gómez-Ruiz, 2004). When analyzing a SOM for potential clusters, understanding the relationship among neurons usually presents great challenge. This analysis can become difficult when analyzing a converged map when there are very few (small network) or very many (large network) neurons. Additionally, it may be more useful to ignore certain variables prior to executing a SOM on a data set. This is where PCA becomes a very useful tool. It is important to note that PCA is a linearly separable unsupervised technique. Effectively, a vector is drawn from the origin to a point in space and it is determined that the groups to one side and the other are significantly distinct (based on a given variable or dimension). SOM on the other hand, is non-linear, and each neuron can be thought of as a centroid in the k-means clustering algorithm (MacQueen, 1967). Neurons become responsible for the input that they are closest to, which may be a spheroid, or even a non-uniform shape. In the case PCA is performed prior to executing a SOM on a data set, it will be determined which variables, or dimensions, are most important for a SOM, and now the neurons in a SOM will have less weights than the original data set. In case PCA is performed after a SOM has executed, the method will determine which variables in the weights of the SOMs are most important. This will help explain which neurons are more similar than others, by contrast to other methods like distance measures and coloring schemes. In summary, PCA helps eliminate attributes that are largely unnecessary.
38
Machine Learning
5.2 ParaSOM modifications The ParaSOM architecture takes a unique approach to performing cluster identification. It relies heavily on the features of the network and the behavior it exhibits because of those features (Valova, MacLean & Beaton, 2008b). When the network is well-adapted and near the end of execution, the cover regions of the neurons are generally small and covering their respective sections of the input space precisely. Therefore, in dense regions the neurons should be plentiful and in very close proximity. A key property of the network at convergence is that the distances between intra-cluster neurons will likely be much smaller than the distance between inter-cluster neurons. This is the central concept of the cluster identification algorithm that ParaSOM takes advantage of. Once convergence takes place, the network will perform clustering in two phases. The first phase utilizes statistical analysis to initially identify clusters. The second phase employs single and complete linkage to combine any clusters that may have been separated, but are in close enough proximity to be considered a single cluster. In order to determine the minimal distance between neurons in different clusters we make use of the mean of the distances between neighboring neurons, x , as well as their standard deviation, σ. The standard deviation is used to determine how far away from the mean is considered acceptable in order for a neighbor to be labeled in the same cluster as the neuron in question. The overwhelming majority of the connections between neighbors will be within clusters. Therefore, the number of standard deviations away from the mean connection distance that a certain majority of these connections is within will be a good indicator of an adequate distance threshold. To discover the initial clusters, the mean of the distances between neighbors is determined through iteration on all neurons. Following that, the standard deviation of the distances between neighboring neurons is computed via Eq. (18).
n
x d i 1
2
(18)
i
where di is the distance between a pair of neighboring neurons. Further, determine how many neuron pairs lie within x m standard deviations, for each m 1 n . Based on some threshold α, where 0.0 1e10)&( x > 0)&(y > 5)”; this can be used to select particles with interesting characteristics in multi-dimensional phase space. FastBit can also track selected particles across time steps by issuing queries of the form id in (5, 10, 31), which pulls out data for the three specific particles. FastBit indices are stored within H5Part files and accessed using a custom C++ interface called HDF5-FastQuery [HDF5-FastQuery (2009)]. All our analysis software is written in R. Therefore, in order to utilize FastBit’s functionality within the R runtime, we extended the RcppTemplate package [Samperi (2006)] to make function calls to the HDF5-FastQuery interface. This saves us considerable time to load subsets of particle data, at least 6.6 times faster than R-package hdf5. This is a considerable improvement over existing HDF5 packages in R, which often constrain the user to load the entire HDF5 file or complete groups within the file. In addition to efficient data access, our framework implements data reduction by using physical domain knowledge, data analysis algorithms and clustering techniques as described in the following sections.
372
Machine Learning
2.3 Proposed framework
Unlike image data, composed of pixel values at regular spaces, laser wakefield simulations contain particles irregularly spaced in all dimensions. Scattered data is a common problem in scientific data mining when trying to extract patterns in large datasets, particularly because the physical phenomenon is evolving over time [Kamath (2009)]. Data reduction of large datasets is often mandatory before applying clustering algorithms due to their inherent combinatorial complexity. Figure 2 shows our framework for detection of accelerated electron bunches in LWFA simulations; the algorithms for data partitioning and pattern detection are detailed in the next sections.
B1 Select timesteps for px> 1010
B2
Calculate kernel density f(x,y,px)
B3 Identify beam candidates
B4 Cluster particles using mixture-models, EM and sampling
B5 Evaluate compactness of electron bunch Fig. 2. Framework for data reduction and beam detection applied to each time step of data sets generated by laser wakefield acceleration (LWFA) simulations. The first step (B1) selects particles and time steps relevant for inspection from a n-time step simulation dataset, discarding those particles unlikely to belong to the physical phenomenon of interest. The pipeline obtains particle distribution (B2), using kernels to calculate an estimate ( f ( x, y, px )) of the probability density function. Next, we find parameters x, y, px for which f is maximum, selecting a subset of particles that may correspond to trapped bunches of electrons (B3). The following step (B4) then groups the simulation particles according to normal mixture models, before applying maximum likelihood estimation and Bayes criteria. The goal is to identify the most likely model and number of clusters that better refine the previous beam candidate particles. The simulation contains several time steps and varying number of particles per time step; we combine the result of beam detection for each time step and calculate statistics of the time series by applying moving averages (B5) to characterize the electron bunch. 2.3.1 High energy particles and densities (B1-B2)
Block B1 performs particle selection given a threshold in momentum in the x −direction, based on the fact that the bunch of electrons of interest should be observed near px = 1011 . We can then eliminate the low energy particles for which px < 1010 . The expected wake oscillation is up to px = 109 . Therefore this threshold excludes particles of the background plasma, while
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
(a) High quality beam
373
(b) Low quality beam
Fig. 3. Kernel density estimation of Dataset A (left) and D (right) at a single time step, showing high-density bunches (red): 3D representation of f ( x, y, px ) with heatmap colors representing the particle density. including all particles that could be in an accelerated bunch. The precise choice of the threshold does not affect the result accuracy and a lower threshold could be used at higher computational cost [Ushizima et al. (2008)]. After eliminating low momentum particles, some time steps (in general the first few time steps of the simulation) may not include a relevant amount of particles for inspection. We calculate the simulation average number of particles above threshold on px (µs ) to determine the “representative” time steps, ti , for which there is a number of particles greater than µs , determining a smaller subset of time steps. We observed that this constraint eliminates initial time steps, but maintains consecutive time steps throughout the time series from tµs , the first time step for which the number of particles is greater than µs . Again, this threshold can be adjusted to lower values. It is necessary to compute the density of the particles given the (x, y, px) parameters of the particles in each time step. The most widely used nonparametric density estimator is the histogram, whose main disadvantage is its sensitivity to the placement of the bin edges, a problem not shared by kernel density estimators, as described in Wand & Jones (1995). Kernel density estimators are hence a valuable tool to identify subgroups of samples with inhomogeneous behavior and to recover the underlying structure of the dataset, while minimizing binning artifacts. The PDF estimation also depends on the number of particles and a set of smoothing parameters called bandwidth [Weissbach & Gefeller (2009)]. We estimate the probability density function (PDF) f ( x, y, px ) for time steps t = [tµs , T ] in B2, where T is the original number of time steps in the simulation, before extracting beam
374
Machine Learning
Fig. 4. Projections of beam candidate region detection (block B3) from timestep in dataset D (top) and 2D particle density estimations (bottom) to confirm compactness of selected particles. candidate regions. An estimation of the PDF is calculated using kernel density estimation [Feng & Tierney (2009)] and defined on a grid with spacing ∆x = 0.5µm, ∆y = 0.5µm and ∆px ≈ 109 . These parameters are selected based on the physical expectation of electron beam size to be 2µm and momentum spread to be approximately 1010 [Geddes (2005); Geddes et al. (2004); Pukhov & ter Vehn (2002); Tsung et al. (2004)]. This PDF will be used later for retrieval of the maximum value and the first adjacent bins. Figure 3 shows 3D densities that are the result of the calculated multivariate kernel density estimators and illustrates the concepts of high and low quality beams. Notice that Fig.3(a) presents a concentrated region in (x, y, px) and higher values of px in comparison with Fig.3(b), which has scattered (red) groups with lower values of px, indicating a low-quality beam. Next, we propose a method to detect these groups of particles, independently of the range of energies they present. 2.3.2 Deriving maximum peaks (B3)
The task in B3 is to find particles at maximum values of f and their immediate vicinity to obtain compact electron bunches in space and with limited dispersion in momentum, as emphasized in red in Figure 4. These criteria determine the ability to characterize the quality of particle beams, which depends on the grouping of electrons in terms of their spatial parameters as well as momentum in the longitudinal (x) and transverse (y) directions. The binning used to calculate f may interfere in the beam quality descriptors if only the absolute maximum of the PDF is taken into account, e.g., the bins may separate a maximum peak into parts if the binning is too small to contain the particles of interest. To prevent this undesirable effect, we adopt a tolerance parameter to select compact bunches and extract more than one
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
(a)
375
(b)
Fig. 5. Comparison of particle selection with/without MVEE: extracting the orientation and the axes of an enclosing ellipse from (a) produces (b), increasing the number of particles from 173 to 263. Colors indicate the density of particles, using only (x, y)-coordinates, and black dots show potential particles to belong to the beam, according to the different methods. maximum (beam candidate region) per time step. In addition, this is a way of accruing more samples and detecting secondary beams when these are almost as prominent as the primary beam, associated to the maximum of f . During the searching for values that are approximately equal to max ( f ), we keep not only the maximum, but all bins where f ≥ u ∗ max ( f ), where u is an uncertainty or tolerance parameter, here empirically set to 0.85. While this value enables the detection of the main and the secondary beams (when present), lower values of u could be used to control the amount of particles to be selected at a lower accuracy of beam position. From this point, we refer to the subset of particles conditioned to u ∗ max ( f ) and its adjacency, calculated for each time step, as “beam candidates”. Figure 4 (top) presents projections of Figure 3.b with their calculated beam candidates emphasized in red. These are the result of our first attempt to improve particle selection by using an algorithm known as minimum volume enclosing ellipsoid as in Khachiyan & Todd (1993), which is able to enclose previously selected particles and to include others based on a geometrically defined polytope. Figure 5 illustrates the algorithm when applied to LWFA data, showing the selected particles as black dots; these particles are not in the most dense region (red) once the colors refers to (x, y)-density calculation. When including compactness in px, the most dense region happens further ahead. As distinct from calculating center of mass and forcing an ad hoc diameter or semi-major/minor axes, the minimum volume enclosing ellipsoid (MVEE) algorithm [Khachiyan & Todd (1993); Kumar & Yildirim (2005); Moshtagh (2009)] takes the subset of points and prescribes a polytope model to extrapolate a preliminary sub-selection to other particles likely to be in the bunch. The MVEE algorithm is a semidefinite programming problem and consists of a better approximation to the convexity of subsets of
376
Machine Learning
(b) Low quality beam
(a) High quality beam
Fig. 6. Lifetime diagram of peaks, center of beam candidates, evolving in time for Dataset A (left) and D (right) for all time steps. particles that correspond to compact groups of electrons. After querying hypervolumes similar to the one in Figure 3, we applied the geometrical model to adjust the particle selection as illustrated in Figure 5. By running the MVEE algorithm, we determine an ellipse as compact as possible covering the data points of the beam candidate region, increasing the number of samples without increasing the binning parameters. Here, we consider the problem of finding a MVEE, that minimizes the logarithm of the determinant of H such that
Rn ,
( x − c) TH( x − c) ≤ n
(1)
an ellipsoid with center c and shape H [Khachiyan & Todd (1993)]. for a set of points x in Further discussions on optimization of this algorithm can be found in Ahipasaoglu et al. (2008). In addition to methods to select beam particles and graphics for each time step, it is often useful to track the bins occupied by the beam candidates by using lifetime diagrams. This diagram show the whole simulation, i.e. a global representation of the temporal evolution of beam candidate in bins. The diagram relates the time steps (t) to the relative position in the simulation window (x), which corresponds to the maximum of f for each time step of the simulation, as calculated in block B3. Figure 6(a) shows earlier time steps containing a bunch of particles that remains at constant speed, with dispersion around t = 32 and formation of a second bunch around t = 35. Figure 6(b) also shows time steps with formation of particle bunches that remains at constant speed earlier in the simulation, but the group disperses around t = 28,followed by the formation of a second bunch around t = 34. The algorithms described in this section focused on the location of a subset of simulated particles, expected to be in the maximum of a multivariate density distribution, dependent on the particle properties. The next section complements the search for the beam by partitioning each time step to find subsets of particles, according to statistical modeling and clustering techniques. 2.3.3 Clustering particles (B4)
Data partitioning is a conceptually intuitive method of organizing simulation particles into similar groups given the absence of class labels. Since clustering is an unsupervised learning
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
377
method, evaluation and cluster quality assessment are valuable in interpreting classification results. We include both clustering methods and cluster validity techniques applied to particle acceleration to illustrate the applicability of dispersion measures to accurately evaluate an intrinsic structure, namely, a coherent bunch of electrons. In order to determine the number of clusters in each time step of the simulation while testing statistical models with different numbers of components, we perform cluster analysis using model-based clustering, where the model and the number of clusters are selected at run time by mclust [Fraley & Raftery (2009)]. The model-based clustering algorithm postulates a statistical model for the samples, which are assumed to come from a mixture of normal probability densities. The calculation of normal mixture models considers different covariance structures and different number of clusters [Haughton et al. (2009)] given an objective function (score). The assumption of the number of clusters k entails a loss of generality, so we consider a range of k in addition to parameters that control the shape of the class. These parametric models are flexible in accommodating data as shown in Fraley & Raftery (2002) and consider widely varying characteristics in estimating distributions. By assuming a normal mixture model, we represent the data d, with n samples and k components, considering a τk probability that an observation belongs to the kth component and a multivariate normal distribution ϕk with mean vector µk and covariance matrix Σk . The likelihood of d with n independent multivariate observations, represented by a Gaussian mixture model with G multivariate mixture components [Fraley & Raftery (2007)] is n
G
∏ ∑ τk ϕk (di |µk , Σk )
(2)
i =1 k =1
with priors conditioned to τ ≥ 0;
G
∑ τk = 1.
(3)
k =1
The maximum likelihood estimate uses expectation-maximization (EM) methods, which rely on iterative two-fold processing: an E-step for calculating the conditional probability that an observation belongs to a certain group given the parameters θk , and a M-step for computing the parameters that maximize the log-likelihood given the previously calculated conditional probability function [Fraley & Raftery (2002)]. In other words, EM determines the most likely parameters θ1 , ..., θk to represent a problem consisting of multivariate observations given by a mixture of k underlying probability distributions [Fraley & Raftery (2006)]. The size of the LWFA datasets can compromise the efficiency of mixture model-based algorithms due to mclust initialization [Fraley & Raftery (2002)], then we propose a random sampling technique before calculating the mixture model. To illustrate such algorithm, Figure 7 uses artificial data, generated by two normal distributions g1( x, y) and g2( x, y) with 100 unlabeled samples each (Figure 7.a). In this example, we subsample the data by extracting a quarter of its original samples and calculate mixture-models, varying the structure and the number of the clusters (Figure 7.b). The result of the clustering provides labels for a quarter of the samples (black and red dots in Figure 7.c) and these labels support a supervised learning to classify the remaining samples as in Figure 7.d, a generalization procedure to extrapolate the “learned” models to the full dataset by using expectation-maximization. Instead of imposing k, which is not known a priori, the objects are associated to each other according to a score that comes from the parameters, unknown quantities to be estimated from the probability distributions [Vermunt & Magidson (2002)]. Figure 7.b shows the calculation
378
Machine Learning
Fig. 7. Example of model-based clustering to clouds of points: (a) two Gaussian distributions with no label assignment; (b) Bayesian information criteria calculation for different number of clusters (k) and different models (E=equal volume and V=varying volume); (c) result of classification using subset (25%) of the data; (d) generalization of model using expectationmaximization. of a score for different k and the maximum value of the curves imply the number of k the best describe the samples. This process establishes the inference on the sample rather than on the full population [Fraley & Raftery (2002)]. This decision circumvents the bottleneck of the mclust initialization by a sampling strategy to partition large datasets: we propose a biased sampling process that ensures that the beam candidate region is in the sampled subset by guaranteeing that at least 10% (empirically chosen) of the samples belong to the high density particle volumes. We cluster the particles using the normal mixture models for values of k ∈ [1, 10] to follow an ellipsoidal model with variable volume (VEV) [Banfield & Raftery (1993); Fraley & Raftery (2009)]. We have tested other models as spherical, diagonal and ellipsoidal, which can have equal or varying volume and shapes. However, VEV was the best algorithm for most of the time steps in all the datasets, according to the Bayesian information criteria [Fraley & Raftery (2009); Greene et al. (2008)]. We re-run the experiments to have clustering results using VEV only, but a varying number of k. The resulting clusters from VEV are considered as the training set to classify all the remaining samples by using EM to extrapolate parameters from training samples. The result of the clustering (B4) is combined with B3 by calculating the intersection between the beam candidates and the a cluster that contains most of the particles from the beam candidates. In other words, we determine which cluster is most likely to contain the beam candidates by majority voting among all possible clusters, finalizing the tasks in block B4. The block B5 only analyzes the most compact group of particles that remains in the each time step. 2.3.4 Cluster quality assessment (B4-B5)
One of our goals in investigating particle simulations is to detect the electron beam and to characterize the dispersion of its particles in terms of spatial and momentum variables using clustering algorithms. Since we do not know a priori the number of clusters that best describe the particle grouping, we need some measure of goodness of fit to evaluate different clustering algorithms. A standard approach is to obtain the number of clusters (k) by maximizing a criterion function and to repeat the clustering procedure for different number of clusters.
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
379
We select k by maximizing the Bayesian information criterion (BIC) for a parametrized clustering algorithm using mixture models, following an ellipsoidal, varying volume model. The optimal BIC value considers the log-likelihood, the dimension of the data, and the number of mixture components in the model. The criterion function must describe how well a given clustering algorithm can match the data, defined as a function of the variable k. Herein we will evaluate the goodness-of-fit of the clustering algorithms for k groups of particles from each time step using BIC to guide model selection for a set of parameterized mixture models with a varying number of classes. BIC adds a penalty to the log-likelihood [Fraley & Raftery (2006)] by considering the number of parameters in a certain model M and the number of observations (n) in the data set, with the form
loglik M (d, θk∗ )
BIC ≡ 2 loglik M (d, θk∗ ) − (#params) M log(n)
(4)
is the maximized log-likelihood of the model with estimated paramewhere ters θk∗ from the observations d and (#params) M number of independent parameters to be estimated in M. In addition to the evaluation of the clustering method (B4), we also want to verify if our framework can capture the physical phenomena of trapping and acceleration, when the beam is expected to be more compact. We propose the inspection of the particles in adjacent time steps using moving averages [Shumway & Stoffer (2006)] to identify if the electrons are grouped into stable bunches (B5). The moving averages technique provides a simple way of seeing patterns in time series data, smooths out short-term fluctuations and highlights longer-term trends. This is physically motivated as the bunches of interest move at speed approximately equal to the speed of light, and hence are nearly stationary in the moving simulation window. We intersect particle bunches (b) at adjacent time steps, selecting the particles with the same identifier (id) and calculate statistical parameters (ρ) of a three-point moving average (mvk ), using the following algorithm: k←1 for t = 2 to n-1 do idk ← id(bt−1 ) ∩ id(bt ) ∩ id(bt+1 ) mvk ← (bt−1 |idk + bt |idk + bt+1 |idk )/3 ρ ← statistics(mvk ) k←k+1 end for The particles of the bunch at time step t − 1, bt−1 , indexed by idk , are called bt−1 |idk and the function statistics calculates parameters such as the mean, variance and maximum values from the moving averages. We use the plots in Figure 10, 13 and 14 to check the persistence of particle bunches by looking at the evolution of statistical parameters as discussed in the next section.
3. Results Here, we apply the above-described algorithms to analyze laser-plasma wakefield acceleration simulations, using the clustering techniques as part of a completely automated pipeline to detect dense particle groups (“electron bunches”). The main contributions of this work, in
380
Machine Learning
Fig. 8. Result of locating high-density bunches for one time step of dataset D: 3D scatter plot of particles, color is proportional to the particle energy (px) and white blobs correspond to the preliminary detection of beam candidate region as described in Sec.2.3.2. comparison with the previous approach in Ushizima et al. (2008), are that we can perform beam detection independent of the quality or energy of the beam. Thus compact groups of particles can have either high momentum or low momentum, instead of only being able to correctly detect groups of particles exhibiting high momentum. This improvement stems from determining particle distribution using kernel density estimators, which minimizes the sensitivity of bin size assumptions and placement, enabling accurate detection of maximum values of f ( x, y, px ). This is in contrast with the previous method that considered f as function of x ¨ only. Also, while Rubel et al. (2008) relies on user interaction, here we automatically detect compact groups of particles under acceleration. We show that using the particle x-coordinate relative to the window size, we can keep track of the maximum values of the kernel density functions and represent these points using lifetime diagrams. Figure 6 shows the evolution of peaks from f ( x, y, px ), which will support future work to restrict the search for compact bunches using clustering to specific regions around the maximum values. Figure 8 illustrates the result of identifying beam candidate from block B3 at a single time step from the dataset D, showing the ( x, y, px )-coordinates of particles and respective detected compact groups. The beam candidate region is represented by a cloud of white dots, containing all the particles for which 0.85 ∗ max ( f ) holds. The application of geometrical models such as the MVEE to enclose the detected beam candidates shows how structure assumptions may interfere in the number of particles selected, as illustrated in Figure 5. The advantage of this method is that it expands a previous restrictive selection to other neighboring points that should be included in the beam candidate region. As opposed to an approach that sets a fixed diameter, it also avoids an undesirable impact on the particle spread. We report results considering a geometrical model that encompass the beam candidate region by calculating the MVEE applied to preliminary selection of particles, which was mostly consistent with the shape of the bunch. The geometry assumption may
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
381
Fig. 9. Result of beam detection for ti = 27 from dataset A: beam candidates in gray from processing in block B3 (top), clustering using mixture models, with colors representing the different partitions over a sampled subset (center) and final result of electron bunch detection, with increased number of particles after generalization with EM-algorithm, from block B4 (bottom). result in inclusion of outliers if the beam present different shapes; however, we eliminate outliers during the moving averages procedure, keeping particles more likely to be part of the electron bunch. We calculate model-based clusters for each time step, after retrieving the results from block B1 and B3. We illustrate the partitions of one time step of all datasets in Figure 9, 11 and 12, showing the phase space of a time step where the beam was expected to be compact. In Figure 9, the two top plots show the result of beam candidate selection, in gray, for dataset A as output by block B3. The two center plots present different compact groups of particles given by the mixture model, and the bottom plots give the final result of electron bunch selection (B4), emphasized in red color. The result of B3 indicates potential clusters of particles, important to guide the sampling and identify the cluster position, but the definition of particle partitions that are connected and compact given x, y, px is only accomplished after B4, which finds
382
Machine Learning
Fig. 10. Beam quality assessment to evaluate the dispersion of particle parameters using the time series in dataset A: the curves show the history of one bunch that forms around t = 22, reaching maximum energy around t = 27. the k-component varying-volume ellipsoidal mixture model clustering that best represent the particles, using BIC as criterion function.
(a) Dataset B, t=21
(b) Dataset C, t=22
Fig. 11. Result of beam detection for ti = 21 from dataset B and ti = 22 from dataset C: beam candidates in gray from processing in block B3 (top), clustering using mixture models, with colors representing the different partitions over a sampled subset (center) and final result of electron bunch detection, after generalization with EM-algorithm, from block B4 (bottom).
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
383
Next, we evaluate the compactness of the electron bunch (B5) by calculating moving averages (mv j ) over the time series. Figures 10, 13 and 14 show the result of calculating statistics from mv j , using the particles selected according to block B4. While the beam detection at each time step may contain outliers, the intersection with adjacent time steps returns the core subset of particles (id j ) that persists at least for three time steps. At the top left of Figure 10, we show the red and blue curves, with the maximum and mean value of px (red and blue, respectively), for each time step. The distance between the red curve and the blue curve, at each time step, is an indicator of the dispersion of the particles in the bunch as well as the length of the yellow arrows (standard deviation of the mv j with respect to x, y, px or py). Also, notice that the moving averages capture the local behavior of a particle bunch that persists for at least three time steps, but do not guarantee that the bunch is present throughout the simulation. There are time steps where the algorithm does not capture any beam, which correspond to moving average equal to zero as in t = [28, 34] from dataset D in Figure 14.a. The period of non-bunch detection, mv j = 0, corresponds to the presence of peaks on f at different, non-adjacent positions, which is correlated to the dispersion of the particles for that period. It follows similar interpretation of the particle dispersion in terms of spatial parameters (x and y) and energy (px and py) to other datasets. Figures 10, 13 and 14 demonstrate that the algorithm automatically identifies the bunch over a range of simulation conditions and resulting bunch qualities. Our tests were conducted on an SGI Altix with 32 1.4 GHz Itanium-2 Processors and 180 GBytes of shared memory. The primary motivation for using this computing system is the large memory; the current implementation of the mixture model clustering algorithms in package mclust is fairly memory-intensive and does not work on standard workstations for large datasets. The SGI Altix is a multi-user machine, thus computing times in different stages of the framework are approximate. Our process of computing beam candidate regions (block B3) is reasonably fast and could be easily incorporated into routine inspection as a preprocessing step. The clustering computation is more expensive, and new implementations are necessary to improve performance. The approximate computing times of beam candidate (in seconds) and clustering (in minutes) for each dataset are organized as pairs with time in parenthesis: A=(15.6s, 31min), B=(66s, 20min), C=(24.3s, 42min), D=(295.8s, 116min) and E=(417.4s, 975min).
4. Conclusions and Future Work ¨ Previous investigations from Ushizima et al. (2008) and Rubel et al. (2008) to find particle bunches reported results using fixed spatial tolerance around centers of maximum compactness and assumed ad hoc thresholding values to determine potential particle candidates involved in the physical phenomena of interest. Ushizima et al. (2008) pointed out limitations inherent to techniques that detects maximum values using only one-dimensional spatial approach (x-axis), which did not capture the most condensed structure when confined to depressions between peaks in px or when dispersed in y. The current approach circumvented most of these problems, since the algorithm searched for compact high density group of particles using both spatial information, x and y, and momentum in the direction of laser propagation, px. We improved the detection of a high density volume of particles by using the 3D kernel density, followed by the detection of its maximum and enclosing the particle subsets using MVEE, thus generating subsets of particles which are beam candidate regions. These subsets provided the position of the most likely cluster to contain a compact electron bunch in a time step. We proposed the use of moving averages to
384
Machine Learning
(a) Dataset D, t=19
(b) Dataset E, t=23
Fig. 12. Result of beam detection for ti = 19 from dataset D and ti = 23 from dataset E: beam candidates in gray from processing in block B3 (top), clustering using mixture models, with colors representing the different partitions over a sampled subset (center) and final result of electron bunch detection, after generalization with EM-algorithm, from block B4 (bottom). identify periods of bunch stability, in the time series, and we derived dispersion measures to characterize beam compactness and quality. Our implementation of function calls to the HDF-FastQuery interface allowed us to load data using FastBit in R, saving time while only loading subsets of particles that potentially participate to the phenomenon of interest. Our results showed that we can assess the beam evolution using both mathematical models and machine learning techniques to automate the search for the beam using large LWFA simulation datasets. Application of hierarchical approaches as in the R packages hclust and mclust are prohibitive if not combined with sampling methods. We present an algorithm to sample the simulation data, but Monte Carlo methods [Banfield & Raftery (1993)] could be used by adding a repetitive randomness process as a way of guaranteeing representation of a beam candidate region and improvement of accuracy. Future evaluations may consider more sophisticated methods such as Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) as in Zhang et al. (1996) and hierarchical clustering based on granularity as in Liang & Li (2007), which are designed for very large data sets. Further investigation should also include subspace clustering as in Kriegel et al. (2009) once the large simulation datasets contain target regions that can be determined using the techniques proposed in our framework.
5. Acknowledgments This work was supported by the Director, Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 through the Scientific Discovery through Advanced Computing (SciDAC) program’s Visualization and Analytics Center for Enabling Technologies (VACET) and by the U.S. DOE Office
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
385
(a) Dataset B
(b) Dataset C
Fig. 13. Beam quality assessment to evaluate the dispersion of particle parameters using the time series in: (a) dataset B: the curves show the history of one bunch that forms around t = 23, reaching maximum energy around t = 33; (b) dataset C: the curves show the history of one bunch that forms around t = 21, reaching maximum energy around t = 33. of Science, Office of High Energy Physics, grant DE-FC02-07ER41499, through the COMPASS SciDAC project and by the U.S. DOE Office of Energy Research by the Applied Mathematical Science subprogram, under Contract Number DE-AC03-76SF00098. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0205CH11231. We also thank the VORPAL development team for ongoing efforts in development and maintenance on a variety of supercomputing platforms, including those at NERSC NERSC (2009).
386
Machine Learning
(a) Dataset D
(b) Dataset E
Fig. 14. Beam quality assessment to evaluate the dispersion of particle parameters using the time series in: (a) dataset D: the curves show the history of two bunches: one that forms at the beginning of the simulation, compact and lower energy (t = [18, 27]) and a second one with broader dispersion in px and higher energy (t = [36, 44]). The beam is not detected by the algorithm from t = [28, 34], represented by zero values in the four graphs; (b) dataset E: the curves show the history of two bunches that form around t = 23, reaching maximum energy and compactness around t = 34. The beam is not detected by the algorithms from t = [21, 22] and t = [29, 31], represented by zero values.
6. References Adelmann, A., Gsell, A., Oswald, B., Schietinger, T., Bethel, E. W., Shalf, J., Siegerist, C. & Stockinger, K. (2007). Progress on H5Part: A Portable High Performance Parallel
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
387
Data Interface for Electromagnetic Simulations, Particle Accelerator Conference PAC07 25–29 June. http://vis.lbl.gov/Publications/2007/LBNL-63042.pdf. Adelmann, A., Ryne, R., Shalf, J., & Siegerist, C. (2005). H5part: A portable high performance parallel data interface for particle simulations, Particle Accelerator Conference PAC05 May 16-20. Ahipasaoglu, S. D., Sun, P. & Todd, M. J. (2008). Linear convergence of a modified frankwolfe algorithm for computing minimum-volume enclosing ellipsoids, Optimization Methods Software 23(1): 5–19. Bagherjeiran, A. & Kamath, C. (2006). Graph-based methods for orbit classification, SDM. Banfield, J. D. & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering, Biometrics 49: 803–821. Birdsall, C. K., Langdon, A. B., Vehedi, V. & Verboncoeur, J. P. (1991). Plasma Physics via Computer Simulations, Adam Hilger, Bristol, Eng. FastBit (2009). Fastbit: An efficient compressed bitmap index technology, https:// codeforge.lbl.gov/projects/fastbit/. Feng, D. & Tierney, L. (2009). Miscellaneous 3d plots, http://cran.r-project.org/ web/packages/misc3d/misc3d.pdf. Fraley, C. & Raftery, A. (2006). Mclust version 3 for r: Normal mixture modeling and modelbased clustering, Technical Report no. 504. Fraley, C. & Raftery, A. (2009). Model-based clustering / normal mixture modeling: the mclust package, http://www.stat.washington.edu/fraley/mclust. Fraley, C. & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association 97: 611–631. Fraley, C. & Raftery, A. E. (2007). Model-based methods of classification: using the mclust software in chemometrics, Journal of Statistical Software 18(6): 1–13. Geddes, C. G. R. (2005). Plasma Channel Guided Laser Wakefield Accelerator, PhD thesis, University of California, Berkeley. Geddes, C. G. R., Bruhwiler, D. L., Cary, J. R., Mori, W. B., J.L. Vay, S. F. M., Katsouleas, T., Cormier-Michel, E., Fawley, W. M., Huang, C., Wang, X., Cowan, B., Decyk, V. K., Esarey, E., Fonseca, R. A., Lu, W., Messmer, P., Mullowney, P., Nakamura, K., Paul, K., Plateau, G. R., Schroeder, C. B., Silva, L. O., Toth., C., Tsung, F. S., Tzoufras, M., Antonsen, T., Vieira, J. & Leemans, W. P. (2008). Computational studies and optimization of wakefield accelerators, J. Phys.: Conf. Ser. 125 125: 1–11. Geddes, C. G. R., Cormier-Michel, E., Esarey, E. H., Schroeder, C. B., Vay, J.-L., Leemans, W. P., Bruhwiler, D. L., Cary, J. R., Cowan, B., Durant, M., Hamill, P., Messmer, P., ¨ Mullowney, P., Nieter, C., Paul, K., Shasharina, S., Veitzer, S., Weber, G., Rubel, O., Ushizima, D., Prabhat, W.Bethel, E. & Wu, K. (2009). Large Fields for Smaller Facility Sources, SciDAC Review 13. Geddes, C. G. R., Toth, C., van Tilborg, J., Esarey, E., Schroeder, C., Bruhwiler, D., Nieter, C., Cary, J. & Leemans, W. (2004). High-Quality Electron Beams from a Laser Wakefield Accelerator Using Plasma-Channel Guiding, Nature 438: 538–541. LBNL-55732. Gentleman, R. & Ihaka, R. (2009). The R project for statistical computing, http://www. r-project.org. Gosink, L., Shalf, J., Stockinger, K., Wu, K. & Bethel, E. W. (2006). HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices, Proceedings of the 18th International Conference on Scientific and Statistical Database Management, IEEE Computer Society Press. LBNL-59602.
388
Machine Learning
Greene, D., Cunningham, P. & Mayer, R. (2008). Unsupervised learning and clustering, Machine learning techniques for multimedia pp. 51–90. H5Part (2009). H5Part: a portable high performance parallel data interface to hdf5, https: //codeforge.lbl.gov/projects/h5part/. Haughton, D., Legrand, P. & Woolford, S. (2009). Review of three latent class cluster analysis packages: Latent gold, polca and mclust, The American Statistician 63(1): 81–91. HDF5-FastQuery (2009). Hdf5-fastquery: Accelerating complex queries on hdf datasets using fast bitmap indices, http://www-vis.lbl.gov/Events/SC05/ HDF5FastQuery/index.html. Kamath, C. (2009). Scientific Data Mining: A Practical Perspective, Society for Industrial and Applied Mathematic (SIAM), Philadelphia, USA. Khachiyan, L. & Todd, M. (1993). On the complexity of approximating the maximal inscribed ellipsoid for a polytope, Math. Program. 61(2): 137–159. ¨ Kriegel, H., Kroger, P. & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data 3(1): 1–58. Kumar, P. & Yildirim, E. A. (2005). Minimum-volume enclosing ellipsoids and core, Journal of Optimization Theory and Applications 126: 1–21. Liang, J. & Li, G. (2007). Hierarchical clustering algorithm based on granularity, GrC, IEEE, pp. 429–432. Love, N. S. & Kamath, C. (2007). Image analysis for the identification of coherent structures in plasma, Applications of Digital Image Processing. Edited by Tescher, Andrew G.. Proceedings of the SPIE, Vol. 6696. Messmer, P. & Bruhwiler, D. L. (2006). Simulating laser pulse propagation and low-frequency wave emission in capillary plasma channel systems with a ponderomotive guiding center model, Phys. Rev. ST Accel. Beams 9(3): 031302. Moshtagh, N. (2009). Minimum volume enclosing ellipsoid, http://www.mathworks. com/matlabcentral/fileexchange/9542. NERSC (2009). National energy research scientific computing center, http://www.nersc. gov/. Nieter, C. & Cary, J. R. (2004). Vorpal: a versatile plasma simulation code, J. Comput. Phys. 196(2): 448–473. Pukhov, A. & ter Vehn, J. M. (2002). Three-dimensional particle-in-cell simulations of laser wakefield experiments, Applied Physics B-Lasers and Optics 74(4-5): 355–361. ¨ O., Geddes, C. G., Cormier-Michel, E., Wu, K., Prabhat, Weber, G. H., Ushizima, D. M., Rubel, Messmer, P., Hagen, H., Hamann, B. & Bethel, W. (2009). Automatic beam path analysis of laserwakefield particle acceleration data. in submission. ¨ Rubel, O., Prabhat, Wu, K., Childs, H., Meredith, J., Geddes, C. G. R., Cormier-Michel, E., Ahern, S., weber, G. H., Messmer, P., Hagen, H., Hamann, B. & Bethel, E. W. (2008). High performance multivariate visual data exploration for extemely large data, SuperComputing 2008 (SC08), Austin, Texas, USA. Samperi, D. (2006). RcppTemplate, http://cran2.arsmachinandi.it/doc/ packages/RcppTemplate.pdf. Shumway, R. H. & Stoffer, D. S. (2006). Time Series Analysis and Its Applications (Springer Texts in Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA. Tajima, T. & Dawson, J. M. (1979). Laser electron accelerator, Physical Review Letters 43(4): 267– 270.
Automated detection and analysis of particle beams in laser-plasma accelerator simulations
389
Tsung, F., Antonsen, T., Bruhwiler, D., Cary, J., Decyk, V., Esarey, E., Geddes, C., Huang, C., Hakim, A., Katsouleas, T., Lu, W., Messmer, P., Mori, W., Tzoufras, M. & Vieira, J. (2007). Three-dimensional particle-in-cell simulations of laser wakefield experiments, J. Phys.: Conf. Ser. 78(1): 012077+. URL: http://dx.doi.org/10.1088/1742-6596/78/1/012077 Tsung, F. S., Narang, R., Mori, W. B., Joshi, C., Fonseca, R. A. & Silva, L. O. (2004). Neargev-energy laser-wakefield acceleration of self-injected electrons in a centimeter-scale plasma channel, Phys. Rev. Lett. 93(18): 185002. ¨ Ushizima, D., Rubel, O., Prabhat, Weber, G., Bethel, E. W., Aragon, C., Geddes, C., CormierMichel, E., Hamann, B., Messmer, P. & Hagen, H. (2008). Automated Analysis for Detecting Beams in Laser Wakefield Simulations, 2008 Seventh International Conference on Machine Learning and Applications, Proceedings of IEEE ICMLA’08. LBNL-960E. Vermunt, J. & Magidson, J. (2002). Latent class cluster analysis, Applied latent class analysis pp. 89–106. VisIt (2009). Visit - free interactive parallel visualization and graphical analysis tool, https: //wci.llnl.gov/codes/visit/. Wand, M. P. & Jones, M. C. (1995). Kernel smoothing, Chapman and Hall/CRC. Weissbach, R. & Gefeller, O. (2009). A rule-of-thumb for the variable bandwidth selection in kernel hazard rate estimation. Wu, K., Otoo, E. & Shoshani, A. (2004). On the performance of bitmap indices for high cardinality attributes, VLDB, pp. 24–35. Wu, K., Otoo, E. & Shoshani, A. (2006). Optimizing bitmap indices with efficient compression, ACM Transactions on Database Systems 31: 1–38. Yip, K. M. (1991). KAM: A System for Intelligently Guided Numerical by Computer, MIT Press. Zhang, T., Ramakrishnan, R. & Livny, M. (1996). Birch: an efficient data clustering method for very large databases, SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data, ACM, New York, NY, USA, pp. 103–114. URL: http://dx.doi.org/10.1145/235968.233324
390
Machine Learning
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
391
20 x
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery Yanju Zhang1, Jeroen S. de Bruin2 and Fons J. Verbeek1 1
Imaging and Bioinformatics group 2Algorithms group Leiden University, Leiden The Netherlands
1. Introduction In this chapter we explore and investigate a range of methods in pursue of improving target prediction of microRNA. The currently available prediction methods produce a large output set that also includes a rather high amount of false positives. Additional strategies for target prediction are necessary and we elaborate on one particular group of microRNAs; i.e. those that might bind to the same target. We intend to transfer our approach to other groups of microRNAs as well as the broader application to the important model species. microRNAs (miRNAs) are a novel class of post-transcriptional gene expression regulators discovered in the genome of plants, animals and viruses. The mature miRNAs are about 22 nucleotides long. They bind to their target messengerRNA (mRNA) and therefore induce translational repression or degradation of target mRNAs (Enright et al., 2003; Bartel, 2004). Recent studies have elucidated that these small molecules are highly conserved between species indicating their fundamental roles conserved in evolutionary selection. They are implicated in developmental timing regulation (Reinhart et al., 2000), apoptosis (Brennecke et al., 2003) and cell proliferation (Lecellier et al., 2005). Some of them have been described to act as potential tumor suppressors (Johnson et al., 2005), potential oncogenes (He et al., 2005) and might be important targets for drugs (Maziere & Enright, 2007). The identification of large number of miRNAs existing in different species has increased the interest in unraveling the mechanism of this regulator. It has been proven that more than one miRNA regulates one target and vice versa (Enright et al., 2003). Therefore understanding this novel network of regulatory control is highly dependent on identification of miRNA targets. Due to the costly, labor-intensive nature of experimental techniques required, currently, there is no large-scale experimental target validation available leaving the biological function of the majority completely unknown (Enright & Griffiths-Jones, 2007). These limitations of the wet experiments have lead to the development of computational prediction methods. It has been established that the physical RNA interaction requires sequence complementarity and thermodynamic stability. Unlike plant miRNAs, which bind to their
392
Machine Learning
targets through near-perfect sequence complementarity, the interaction between animal miRNAs and their targets is more flexible. Partial complementarity is frequently found (Enright et al., 2003) and this flexibility complicates computation. Lots of effort has been put into characterizing functional miRNA-target pairing. The most frequently used prediction algorithms are miRanda, TargetScan/TargetScanS, RNAhybrid, DIANA-microT, picTar, and miTarget. MiRanda (Enright et al., 2003) is one of the earliest developed large-scale target prediction algorithm which was first designed for Drosophila then adapted for human and other vertebrates. It consists of three steps: First, a dynamic programming local alignment is carried out between miRNAs and 3’UTR of potential targets using a scoring matrix. After filtering by threshold score, the resulting binding sites are evaluated thermodynamically using the Vienna RNA fold package (Wuchty, 1999). Finally, the miRNA pairs that are conserved across species are kept. TargetScan/TargetScanS (Lewis et al., 2003; Lewis et al., 2005) have a stronger emphasize on the seed region. In the standard version of TargetScan, the predicted target-sites first require a 7-nucleotide (nt) match to the seed region of miRNA, i.e., nucleotides 2-8; second, conservation in 4 genomes (human, mouse, rat and puffer fish), and third, thermodynamic stability. TargetScanS is the new and simplified version of TargetScan. It extends the crossspecies comparison to 5 genomes (human, mouse, rat, dog and chicken) and requires a seed match of only 6-nt long (nucleotides 2-7). Through the requirement of more stringent species conservation it leads to more accurate predictions even without conducting free energy calculations. RNAhybrid (Rehmsmeier et al., 2004) was the first method which integrated powerful statistical models for large-scale target prediction. Basically, this method finds the energetically most favorable hybridization sites of a small RNA in a large RNA string. It takes candidate target sequences and a set of miRNAs and looks for energetically favorable binding sites. Statistical significance is evaluated with an extreme value statistics of length normalized minimum free energies for individual hits, a Poisson approximation of multiple hits, and the calculation of effective numbers of orthologous targets in comparative studies of multiple organisms. Results are filtered according to p-value thresholds. DIANA-microT identified putative miRNA-target interaction using a modified dynamic programming algorithm with a sliding window of 38 nucleotides that calculated binding energies between two imperfectly paired RNAs. After filtering by an energy threshold, the candidates are examined by the rules derived from mutation experiments of a single let-7 binding site. Finally, those which were conserved between human and mouse were further considered for experimental verification (Grun & Rajewsky, 2007; Sethupathy et al., 2007). PicTar takes sets of co-expressed miRNAs and searches for combinations of miRNA binding sites in each 3’UTR (Krek et al., 2005). And miTarget is a support vector machine classifier for miRNA target-gene prediction, which utilizes a radial basis function kernel to characterize targets by structural, thermodynamic, and position-based features (Kim et al., 2006). Among the algorithms discussed previously, miRanda and TargetScan/TargetScanS belong to the sequence-based algorithms which evaluate miRNA-target complementarity first, then calculate the binding site thermodynamics to further prioritize; in contrast, DIANA-microT and RNAhybrid are based on algorithms that are rooted in thermodynamics, thus using thermodynamics as the initial indicator of potential miRNA binding site.
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
393
Until now, it remains unclear whether sequence or structure is the better predictor of a miRNA binding site (Maziere & Enright, 2007). All of the above mentioned methods produce a large set of predictions and include a relatively high false positive ratio; all in all this indicates that these methods are promising methods but still far away from perfect. The estimated false-positive rate (FPR) for PicTar, miRanda and TargetScan is about 30%, 24-39% and 22-31% respectively (Bentwich, 2005; Sethupathy et al., 2006b; Lewis et al., 2003). It has been reported that miTarget has a similar performance as TargetScan (Kim et al., 2006). In addition to the relatively high FPR, Enright et al. observed that many real targets are not predicted by these methods and this seems to be largely due to requirements for evolutionary conservation of the putative miRNA target-site across different species (Enright et al., 2003; Martin, 2007). In general we also notice that in all of these algorithms, the target prediction is based on features that consider the miRNA-target interaction such as sequence complementarity and stability of miRNA-target duplex. Through the observations in the population of confirmed miRNAs targets we became aware that some miRNAs are validated as binding the same target. For example, in human miR-17 and miR-20a both regulate the expression of E2F transcription factor 1 (E2F1); while miR-221 and miR-222 both bind to v-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene homolog (KIT). Subsequently, we considered that this observation would allow target identification from the analysis of functionally similar miRNAs. Based on this idea, we present an approach which analyzes miRNA-miRNA relationships and utilizes them for target prediction. Our aim is to improve target prediction by using different features and discovering significant feature patterns through tuning and combining several machine learning techniques. To this respect, we applied feature selection, principle component analysis, classification, decision trees, and propositionalization-based relational subgroup discovery to reveal the feature patterns between known miRNAs. During this procedure, different data setups were evaluated and the parameters were optimized. Furthermore, the derived rules were applied to functionally unknown miRNAs so as to see if new targets could be predicted. In the analysis of functionally similar miRNAs, we found that genomic distance, seed and overall sequence similarities between miRNAs are dominant features in the description of a group of miRNAs binding the same target. Application of one specific rule resulted in the prediction of targets for five functionally unknown miRNAs which were also detected by some of the existing methods. Our method is complementary to the existing prediction approaches. It contributes to the improvement of target identification by predicting targets with high specificity and without conservation limitation. Moreover, we discovered that knowledge discovery especially the propositionalization-based relational subgroup discovery, is suitable for this application domain since it can interpret patterns of similar function miRNAs with respect to the limited features available. The remainder of this chapter is organized as follows. In Section 2, miRNA biology and databasing as well as the background of the machine learning techniques which are the components of our method are explained: i.e., miRNA biogenesis and function, related databases, feature selection, principle component analysis, classification, decision trees and propositionalization-based relational subgroup discovery. Section 3 specifies the proposed method including data preparation, algorithm configuration and parameter optimization. The results are summarized in Section 4. Finally, In Section 5, we discuss the strengths and
394
Machine Learning
the weaknesses of the applied machine learning techniques and feasibility of the derived miRNA target prediction rules.
2. Background The first two subsections are devoted to the exploration of miRNA biology whereas the latter two subsections have a computational nature. 2.1 microRNA biogenesis and function The mature miRNAs are ~22 nucleotide single-stranded noncoding RNA molecules. They are derived from miRNA genes. First, miRNA gene is transcribed to primary miRNA transcripts (pri-microRNA), which is between a few hundred or a few thousand base pair long. Subsequently, this pri-microRNA is processed into hairpin precursors (premicroRNA), which has a length of approximately 70 nucleotides, by the protein complex consisting of the nuclease Drosha and the double-stranded RNA binding protein Pasha. The pre-miRNA then is transported to cytoplasm and cut into small RNA duplexes of approximately 22 nucleotides by the endonuclease Dicer. Finally, either the sense strand or antisense strand can function as templates giving rise to mature miRNA. Upon binding to the active RISC complex, mature miRNAs interact with the target mRNA molecules through base pair complementarity, therefore inhibit translation or sometimes induce mRNA degradation (Chen, 2005). Fig. 1 illustrates the process of biogenesis and function of miRNAs. For reasons of simplification the auxiliary protein complexes are not included in the picture.
Fig. 1. Simplified illustration of miRNA biogenesis and function. miRNA genes are first transcribed to pre-miRNA, and then proceeded to mature miRNAs. Upon binding to these miRNAs through sequence complementarity, the messengerRNAs (mRNAs), which are called the targets of miRNAs, will be either degradated or translation of the targets will be inhibited.
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
395
It is suggested that miRNAs tend to bind 3‘ UTR (3‘ Untranslated Region) of their target mRNAs (Lee et al., 1993). Further studies have discovered that position 2-8 of miRNAs, which is called ‘seed’ region, has been described as a key specificity determinant of binding, requires good or perfect complementarity (Lewis et al., 2003; Lewis et al., 2005). In Fig. 1, a detailed miRNA-target interaction is showed with a highlighted seed region. 2.2 miRNA databases miRBase: MiRBase is the primary online repository for published miRNA sequence data, annotation and predicted gene targets (Griffiths-Jones et al., 2006; Griffiths-Jones, 2004). It consists of three parts: 1. The miRBase Registry acts as an independent authority of miRNA gene nomenclature, assigning names prior to publication of novel miRNA sequences. 2. The miRBase Sequences is a searchable database for miRNA sequence data and annotation. The latest version (Release 13.0, March 2009) contains 9539 entries representing hairpin precursor miRNAs, expressing 9169 mature miRNA products, in 103 species including primates, rodents, birds, fish, worms, flies, plants and viruses. 3. The miRBase Targets is a comprehensive database of predicted miRNA target genes. The core prediction algorithm currently is miRanda (version 5.0, Nov 2007). It searches over 2500 animal miRNAs against over 400 000 3’UTRs from 17 species for potential target sites. In human, the current version predicts 34788 targets for 851 human miRNAs. Tarbase: Tarbase is a comprehensive repository of a manually curated collection of experimentally supported animal miRNA targets (Sethupathy et al., 2006a; Papadopoulos et al., 2008). It describes each supported target site by the miRNA which binds it, the target genes, the direct and indirect experiments that were conducted to validate it, binding site complementarity and etc. The latest version (Tarbase 5.0, Jun 2008) records more than 1300 experimentally supported miRNA target interactions for human, mouse, rat, zebrafish, fruitfly, worm, plant, and virus. As machine learning methods become more popular, this database provides a valuable resource to train and test for machine learning based target prediction algorithms. 2.3 Pattern recognition Pattern recognition is considered a sub-topic of machine learning. It concerns with classification of data either based on a priori knowledge or based on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements, features or observations, which define data points in an appropriate multidimensional space. Our pattern recognition proceeds in three different stages: feature reduction, classification and cross-validation. Feature reduction: Feature reduction includes feature selection and extraction. Feature selection is the technique of selecting a subset of relevant features for building learning models. In contrast, feature extraction seeks a linear or nonlinear transformation of original variables to a smaller set. The reason why not all features are used is because of performance issues, but also to make results easier to understand and more general. Sequential backward selection is a feature selection algorithm. It starts with entire set, and
396
Machine Learning
then keeps removing one feature at a time so that the entire subset so far performs the best. Principle component analysis (PCA) is an unsupervised linear feature extraction algorithm. It derives new variables in decreasing order of importance that are a linear combinations of the original variables, uncorrelated and retain as much variation as possible (Webb, 2002). Classification: Classification is the process of assigning labels on data records based on their features. Typically, the process starts with a training dataset that has examples already classified. These records are presented to the classifier, which trains itself to predict the right outcome based on that set. After that, a testing set of unclassified data is presented to the classifier, which classifies all the entries based on its training. Finally, the classification is being inspected. The better the classifier, the more good classifications it has made. Linear discriminant classifier (LDC) and quadratic discriminant classifiers (QDC) are two frequently used classifiers which separate measurements of two or more classes of objects or events by a linear or a quadric surface respectively. Cross-validation: Cross-validation is the process of repeatedly partitioning a dataset in a training set and a testing set. When the dataset is partitioned in n parts we call that n-fold cross-validation. After partitioning the set in n parts, the classifier is trained with n-1 parts, and tested on the remaining part. This process is repeated n times, each time a different part functions as the training part. The n results from the folds then can be averaged to produce a single estimation of error. 2.4 Knowledge discovery Knowledge discovery is the process which searches large volumes of data for patterns in order to find understandable knowledge about the data. In our knowledge discovery strategy, decision tree and relational subgroup discovery are applied. Decision tree: The decision tree (Witten & Frank, 1999) is a common machine learning algorithm used for classification and prediction. It represents rules in the form of a tree structure consisting of leaf nodes, decision nodes and edges. This algorithm starts with finding the attribute with the highest information gain which best separates the classes, and then it is split into different groups. Ideally, this process will be repeated until all the leaves are pure. Relational subgroup discovery: Subgroup discovery belongs to descriptive induction (Zelezny & Lavrac, 2006) which discover patterns described in the form of individual rules. Relational subgroup discovery (RSD) is the algorithm which utilizes relational datasets as input, generates subgroups whose class-distributions differ substantially from the complete dataset with respect to the property of interest (Lavrac et al., 2003). The principle of RSD can be simplified as follows; first, a feature is constructed through first-order feature construction and the features covering empty datasets are retracted. Second, rules are induced using weighted relative accuracy heuristics and weighted covering algorithm. Finally, the induced rules are evaluated by employing the combined probabilistic classifications of all subgroups and the area under the receiver operating characteristics (ROC) curve (Fawcett, 2006). The key improvement of RSD is the application of weighted relative accuracy heuristics and weighted covering algorithm, i.e. WRAcc (H←B) = p(B) · ( p(H|B)-p(H) )
(1)
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
397
The weighted relative accuracy heuristics is defined as equation 1. In rule H←B, H stands for Head representing classes, while B denotes the Body which consists of one or a conjunction of first-ordered features. p is the probability function. As shown in the equation, weighted relative accuracy consists of two components: weight p(B), and relative accuracy p(H | B) − p(H). The second term, relative accuracy, is the relative accuracy gain between the conditional probability of class H given that features B is satisfied and the probability of class H. A rule is only interesting if it improves over this default rule H←true accuracy (Zelezny & Lavrac, 2006). In the weighted covering algorithm, the covered positive examples are not deleted from the current training set which is the case for the classical covering algorithm. Instead, in each run of the covering loop, the examples are given decreasing weights while the number of iterations is increasing. In doing so, it is possible to discover more substantial significant subgroups and thereby achieving to find interesting subgroup properties of the entire population.
3. Experimental setups, methods and materials 3.1 Data collection In the interest of including maximally useful data, human miRNAs are chosen as the research focus. The latest version of TarBase (TarBase-V5 released at 06/2008) includes 1093 experimentally confirmed human miRNA-target interactions. Among them, 243 are supposed by direct experiment such as in vitro reporter gene (Luciferase) assay, while the rest are validated by an indirect experimental support such as microarrays. Considering the fact that the indirect experiments could induce the candidates which are in the downstream of the miRNA involved pathways, it is uncertain whether these can virtually interact with miRNA or not. Thus they are excluded and only the miRNAs-target interactions with direct experiment support are used in this study. We observed that some miRNAs are validated as binding the same target. According to this observation, we pair the miRNAs as positive if they bind the same target, and randomly couple the rest as the negative data set. In total, there are 93 positive pairs. After checking the consistency of the name of miRNAs and removing the redundant data (for example, miR-26 and miR-26-1 refer to the same miRNA), 73 pairs are kept and thus another 73 negative pairs are generated. For quality control reasons, the data generation step is repeated 10 times and each set is tested individually in the following analysis. Here we clarify two notions; known miRNAs are those whose function is known and have been validated for having at least one target, unknown miRNAs refer to those for which the targets are unknown. 3.2 Feature collection In the study of miRNA-target interaction, it has been established that this physical binding requires sequence complementarity and thermodynamic stability. Here some of miRNAtarget interaction features are transformed to the study of functionally similar miRNA pairs. We predefine four features: overall sequence (~22 nt) similarity, seed (position 2-8) similarity, non-seed (position 9-end) similarity and genomic distance. Seed has been proven to be an important region in miRNA-target interaction which display an almost perfect match to the target sequence (Karginov et al., 2007), thus we suggest that seed similarity
398
Machine Learning
between miRNAs is a potentially important feature. Additionally, including non-seed and sequence similarity features enables us to investigate the property behaviors of these two regions. Genomic distance is not a well investigated feature which is defined as base pair distance between two genes. The idea of investigating genomic distance between miRNAs is derived from our former study. Previously, through statistical methods and heterogeneous data support, we demonstrated that the genomic location feature plays a role in miRNAtarget interaction for a selection of miRNA families (Zhang et al., 2007). Here we induce this idea to the study of miRNAs relationships based on the genomic distance. In the data preparation, sequence similarity is calculated using the EBI pairwise global sequence alignment tool: i.e. Needle (Sankoff & Kruskal, 1999). Genomic sequence and location are retrieved from the miRBase Sequence Database. The distance between two miRNAs is calculated by genomic position subtraction when they are located on the same chromosome; otherwise it is set to undefined.
Fig. 2. Workflow. miRNA pairs are analyzed by both pattern recognition and knowledge discovery strategies. 3.3 Workflow As showed in Fig. 2, we use two strategies to discover miRNA-miRNA relationships. In pattern recognition strategy, different classifiers are applied in order to discriminate positive and negative miRNA pairs. Then the performance of each classifier is evaluated by crossvalidation. In knowledge discovery, rules are first discovered from three methods with respect to decision tree and relational subgroup discovery techniques. Through combining
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
399
the results, the optimized rules describing functionally alike miRNAs are generated which are used for final targets prediction and validation. Pattern recognition: In this strategy, the first step is feature reduction. Features are selected by sequential backward elimination algorithm and extracted by principle component analysis. As it is known that sequential forward selection adds new features to a feature set one at a time until the final feature set is reached (Webb, 2002). It is simple and fast. The reason it is not applied in our experiment is due to the limitation that the selected features could not be deleted from the feature set once they have been added. This could lead to local optimum. After dimension reduction, classification is performed by both linear and quadratic classifiers. Finally, the performance is examined by 5-fold cross-validation with 10 repetitions. This part was implemented with PRtools (van der Heijden et al., 2004) a plugin for the MatLab platform.
Fig. 3. Detailed experimental design in rule generation stage. Three methods are applied which are Decision tree, Category RSD and Binary RSD. In Category RSD, datasets are first categorized into groups. Subsequently, data with two feature sets, which are with and without overall sequence similarity, are used as the input to RSD algorithm. In Binary RSD, feature values are binariezed using decision tree. Due to the fact that data are sampled 10 times, the cut-offs are then established using max coverage (Max Cov), median and max density (Max Den). Finally, RSD is applied to all 3 conditions in order to find out the feature cut-offs, which lead to the most significant rule sets. Knowledge discovery: In pattern recognition the miRNA is classified through elaborate statistical models; in contrast, in knowledge discovery data patterns are described to allow us to increase our knowledge on the data.This could promote our understanding of functionally similar miRNAs. Furthermore, integration of this knowledge could finally promote target prediction. In this strategy, there are three phases: rule generation illustrated in the framework (dashed) of Fig. 2, target prediction and validation. In the first step, rules are discovered using decision trees and relational subgroup discovery. With the aim to discover the most significant rules, different data structures and feature thresholds are evaluated and compared. Details are explained in the following sections and an overview of this methodology is shown in Fig. 3.
400
Machine Learning
Decision tree learning is utilized as a first step in order to build a classifier discriminating two classes of miRNA pairs. In our experiments, we used the decision tree from the Weka software platform (Witten & Frank, 1999). The features were tested using the J48 classifier and evaluated by 10 fold cross-validation. Due to the fact that not all the determinant features are known at this stage, we are interested in finding rules for subgroups of functionally similar miRNAs with respect to our predefined features. In our experiments, we used the propositionalization based relational subgroup discovery algorithm (Zelezny & Lavrac, 2006). We prefer rules that contain only the positive pairs and portray high coverage. Consequently, the repetitive rules are selected, if their E-value is greater than 0.01 and at the same time the significance is above 10. Both the Category RSD and the Binary RSD reveal feature patterns by utilizing the relational subgroup discovery algorithm. The main difference is that the former analyzes the data in a categorized format, whereas in later algorithm the data is transformed to a binary form.
Fig. 4. Density plot for the four features. The plots of distance and seed similarity match bimodal distribution indicating two main groups in each feature. However it is not straightforward to judge sequence and non-seed similarity distributions. As a pilot experiment for RSD, data is first categorized as follows: the similarity percentage is evenly divided into 5 groups: very low (0-20%], low (20-40%], medium (40-60%], high (6080%], very high (80-100%]; Distance is categorized into 5 regions: 0-1kb1, 1-10kb, 10-100kb, 100kb-end, undef (if miRNAs that are paired are located on a different chromosome). Two 1
The unit of distance on a genome is base pair abbreviated as ‘b’, kb = kilo base pairs.
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
401
relational input tables, which are with and without the overall sequence similarity feature, are constructed and further tested with the purpose of verifying whether the sequence has a global effect or only contributes as the combination of seed and non-seed parts. Through the observation of density graphs of the features, as depicted in Fig. 4, we concluded that distance and seed similarity feature densities match a bimodal distribution. The same conclusion can, however, not be drawn easily for overall and non-seed sequence similarities. Therefore, in this method, we apply a decision tree algorithm to discriminate 4 feature values into binary values. Each feature is calculated individually and only the root classifier value in the tree is used for establishing the cut-off. After that, binary tables are generated according to three criteria: Maximum coverage where the value covers the most positive pairs. Max coverage (distance, sequence, seed, non-seed) = 8947013 b, 56.5%, 71.4%, 53.3% Median. Median (distance, sequence, seed, non-seed) = 3679 b, 65.2%, 71.4%, 60.65% Maximum density which is the region with the highest positive pair density. Max density (distance, sequence, seed, non-seed) = 3679 b, 69.6%, 75%, 64.7%
4. Results 4.1 Classification After application of sequential backward feature selection, features including genomic distance, seed similarity and non-seed similarity are selected as the top 3 informative features. Sequence similarity is the least informative feature because it is highly correlated to seed and non-seed similarities. Scatter plots of two classes of miRNA pairs in the selected feature space are depicted in Fig. 5. As can be seen in the four sub-graphs of Fig. 5, the majority of positive and negative miRNA pairs are overlapping which is an indication for the complexity of the classification. The distribution of negative class is more compact. We observed that the majority of this class located in the area of non-seed80% Dis80% & Nonseed=(60%,80%] Dis=(1 kb,10 kb]
Significance 26.7 14.3 14.1 12.6 11
Label +Overall sequence: YES Rules 2.2 Significance A Seed>80% 26.7 A.1 Dis=undef & Seed>80% 14.3 B Dis80% & Seq=(60%,80%] 11.2 C Dis=(1 kb,10 kb] 11 (b) Table 1. Category RSD results. Rules generated from two data structures: considering overall sequence, seed, non-seed similarities as well as distance (a) and only seed, non-seed similarities and distance (b). Table 2 shows the rules generated by Binary RSD, thereby using three cut-off criteria: Max coverage (a), Median (b) and Max density (c). As can be seen, three rule sets have similar structures but different feature cut-offs which lead to different significance. The main feature groups derived using max coverage, median and max density criteria respectively are Seed>71.4% (A) and Dis71.4% (A), Dis65.2% (C) in rule set 3.2; and Seed>75% (A), Dis69.6% (C) in rule set 3.3. Others are the subsets of these groups. Label A.1 A A.2 B A.3 A.4 A.5 (a)
Max coverage: YES Rules 3.1 Seed>71.4% & Seq>56.5% Seed>71.4% Nonseed>53.3% & Seed>71.4% & Seq>56.5% Dis53.3% & Seed>71.4% Dis>8947013 b & Seed>71.4% & Seq>56.5% Dis>8947013 b & Seed>71.4%
Significance 30 27.2 21.6 19.8 18.2 13.5 12.3
Label A A.1 B B.1 A.2 A.3 A.4 C.1 C (b)
Median: YES Rules 3.2 Seed>71.4% Seed>71.4% & Seq>65.2% Dis60.65% & Seed>71.4% & Seq>65.2% Nonseed>60.65% & Seed>71.4% Nonseed>60.65% & Seq>65.2% Seq>65.2%
Significance 27.2 23.3 23.3 15.9 14.9 13.7 13.7 13.7 12.2
Label A B A.1 C B.1 B.2/C.1 A.2 A.3/C.2 C.3 (c)
Max density: YES Rules 3.3 Seed>75% Dis75% & Seq>69.6% Seq>69.6% Dis75% & Seq>69.6% Nonseed>64.7% & Seq>69.6%
Significance
26.7 23.3 20.8 20.8 18 14.1 11.5 11 11
Table 2. Binary RSD results. Rules generated from 3 sets of parameters are shown in a sequence of Max coverage (a), Median (b) and Max density (c). Furthermore, the rules with similar features but different feature values are compared. The decision on final cut-off is based on the value which results in the highest significance. Therefore the final optimized rules are: Rule 1: IF distance between two miRNAs 71.4%,
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
405
Rule 3: IF sequence similarity between two miRNAs > 69.6% THEN they bind the same target. To evaluate our methods, as a reference, a permutation test is performed. We repeat the learning procedure for each training set with the labels randomly shuffled. Using Max coverage as a cutoff criterion, we obtained that all the rules have the max significance lower than 8. This test therefore demonstrates that the rules derived from the original data are more significant compared to the random situation. 4.3 Target prediction We apply the above rules searching for miRNAs which serve similar functions as the known miRNAs. Rule 1, 2 and 3 discovered 75, 655 and 150 miRNA pairs respectively in each subgroup which highly extends our previous findings (Zhang et al., 2008) based on similar methodology. Among them, 23 miRNA predicted targets which are covered by all of the 3 rules are selected for further validation, since this group has relative small pairs which are easy to validate. Furthermore, as they involve more constraints, it is considered to be more reliable. By further inspection of these 23 miRNA pairs, we found that it consists of 3 confirmed pairs in which both individual miRNAs from each pair are well studied, 15 pairs with both members from the same family which are supposed to have the same targets, and 5 new pairs which have one well-studied miRNA and one functional unknown partner. Therefore, we induce the targets for these 5 unknown miRNAs hsa-miR-18a/ 18b/ 20b /212 /200c from their known partner. Their predicted targets are listed in Table 3. Informatic validation is performed to check the prediction consistency with the existing methods. Table 3 shows validation for the 3 confirmed and 5 predicted miRNA pairs. The miRNAs with confirmed targets are indicated in italic, while the miRNAs in boldface are the unknown ones for which the targets are predicted from their known partners. All of their targets are validated by examining whether they are predicted by TargetScan, miRanda, Pictar, miTarget and RNAhybrid. For example the table can be read as following: whether the target (BCL2) is predicted by the existing methods (TargetScan) for m1 (hsa-miR 15a) or m2 (hsa-miR-16). Consequently, we discover that among our prediction, Retinoblastoma 1 (RB1) for hsa-miR-20b are predicted by TargetScan and Pictar; Circadian Locomoter Output Cycles Kaput (Clock) for hsa-miR-200c is captured by miRanda; Rho GTPase activating protein (RICS) for hsa-miR-212 is detected by Pictar; E2F transcription factor 1 (E2F1) and AIB1 for hsa-miR-18a are identified by miTarget.
406
Machine Learning
Also predicted by Our prediction
TargetScan
miRanda
Pictar
miTarget
miRNA1 (m1)
miRNA1(m2)
Targets
m1
m2
m1
m2
m1
m2
m1
m2
RNAhybridmfe(kcal/mol) m1 m2
hsa-miR-15a
hsa-miR-16
BCL2
√
√
×
×
√
√
×
√
-24.3
-24.1
hsa-miR-17
hsa-miR-20a
E2F1
√
√
√
×
√
√
√
√
-26.8
-24.6
hsa-miR-221
hsa-miR-222
KIT
√
√
×
×
×
×
√
√
-24.9
-26.4
hsa-miR-17
hsa-miR-18a
hsa-miR-106a
E2F1
√
×
√
×
√
×
√
√
-26.8
-26.8
AIB1
-
-
-
-
-
-
√
√
-26.3
-26.6
hsa-miR-18b
RB1
√
×
×
×
√
×
×
×
-23.2
-28.3
hsa-miR-106a
hsa-miR-20b
RB1
√
√
×
×
√
√
×
×
-23.2
-27.2
hsa-miR-132
hsa-miR-212
RICS
×
×
-
-
√
√
-
-
-
-
hsa-miR-141
hsa-miR-200c
Clock
×
×
√
√
×
×
√
×
-22.1
-20.1
Table 3. Informatic validation of confirmed and predicted miRNA pairs. miRNA1 and miRNA2 are the partners in one pair. Target column shows the validated targets for the known miRNAs (in italic) and the predicted targets for the unknown miRNAs (in boldface). m1 and m2 columns denote whether the targets are predicted by the existing methods for miRNA1 (m1) and miRNA2 (m2) respectively.
5. Conclusions and discussion Machine learning is widely used in commercial businesses where vast amounts of data are produced. The life-sciences, molecular oriented research in particular, is a rapidly growing field which has gained a lot of attention lately especially now that the genomes of the major research model species have been sequenced and are publicly available. With the development of more and more large-scale and advanced techniques in biology, the need to discover hidden information triggered the application of machine learning in the field of the life-sciences. But these applications bear a risk, since, first of all, most biological mechanisms are not yet fully understood, and second, some techniques produce too little experimental data due to the limitations of these techniques, thereby making machine learning unreliable. In this chapter, we explained how we integrated different machine learning algorithms and tuned and optimized experimental setups to a growing but not yet mature research field, miRNA target prediction. The innovation of this approach is not only integration and optimization of machine learning algorithms, but also the prediction through new features in miRNA relationship instead of widely studied features of miRNA-target interaction. Existing methods for analysis have shown to be insufficient in identifying targets from this perspective. As illustrated in the methods and results sections, pattern recognition generates models enabling class descriptions. In this case, a rather high misclassification error around 30% is surfacing. In contrast, subgroup discovery aims at discovering statistically unusual patterns of interesting classes (Zelezny & Lavrac, 2006). It discovers three main groups describing only the positive miRNA pairs. One of the disadvantages of pattern recognition method is that the model is not biologically interpretable. Consisting of linear or quadratic transformations of features, the classifiers tell nothing about the mechanisms of miRNA-target binding. However decision tree and
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
407
relational subgroup discovery are descriptive induction algorithms which discover patterns in the form of rules. With these discovered rules, we gain knowledge about miRNA-target interaction which can, subsequently, be used to predict more targets. We compared two main algorithmic approaches used in knowledge discovery. Given the circumstances that not all the targets and useful features are known in advance, the classification of miRNA data using decision trees is not recommended. However, the relational subgroup discovery, an advanced subgroup discovery algorithm, has shown to be suitable for this application domain since it can discover the rules for subgroups of similar function miRNAs with respect to our predefined features. During the rule mining, we also noticed that feature threshold optimization is a crucial procedure which helps revealing the significant rules. We have established that distance, seed and sequence similarities are determinants. The question is whether it makes sense from the biological point of view. It has been reported that many miRNAs appear in clusters on a single polycistronic transcript (Tanzer & Stadler, 2004). They are transcribed together in a long primary transcript, yielding one or more hairpin precursors and finally are cut to multi-mature miRNAs. Tanzer et al. reported that the human mir-17 cluster contains six precursor miRNA (mir-17/ 18/ 19a/ 20/ 19b-1/ 92-1) within a region of about 1kb on chromosome 13 (Tanzer & Stadler, 2004). These observations are similar with the feature embedded in Rule 1 (cf. Section 4.2). Besides the fact that clustered miRNAs can be transcribed together, we further showed that miRNAs that are in close proximity to each other can bind to the same target so as to serve as the regulators for the same goal. In this study, we showed that the genomic location also contributes to miRNA target identification. As for seed similarity, Rule 2 (cf. Section 4.2) describes that the miRNAs with seed similarity above 71.4% share the same targets. This means only a perfect match or one mismatch in the seed is allowed in the process of binding the same targets. This is consistent with the idea that seed is a specific region, in particular it requires a nearly perfect match with the target (Karginov et al., 2007). Moreover, TargetScanS also only requires a 6-nt seed match comprising nucleotides 2-7 of the miRNA. Thus, the rule requiring at least 6 out of 7 nucleotides to be similar in seed region can be considered reasonable. Overall sequence similarity is also a predictor but not as decisive as seed and genomic distance. This means that not only the seed region is important; sometimes two miRNAs with generally similar sequences can also bind to the same target. This is consistent with the finding that some miRNA-target interaction bindings have a mismatch or wobble in the 5’ seed region but compensate through excellent complementarity at the 3’ end, which leads to high average sequence complementarity (Maziere & Enright, 2007). In order to support our findings, we validated the results using five existing algorithms presented in Table 3. Not all of the predicted targets are identified by TargetScan, miRanda, Pictar, miTarget and RNAhybrid, whereas this is the same case for the known targets. Most of the candidates are predicted by at least one of these methods. Both miTarget and our method are based on machine learning techniques; miTarget uses a support vector machine and considers sequence and structure features of miRNA-target duplexes whereas we focus the integration of several machine learning algorithms on the genomic location and sequence features between miRNAs. Moreover, we noticed that miRanda has a relatively low performance for target prediction in human. This may be due to the fact that miRanda was initially developed to predict miRNA targets in Drosophila melanogaster, and later
408
Machine Learning
adapted to vertebrate genomes (Enright et al., 2003). In the application of RNAhybrid tool, a predefined threshold of the normalized minimum free energy (mfe) is lacking, we therefore decided to list the original values. We found that most of our predicted miRNA-target duplexes are more stable illustrated by the lower minimum free energy relative to the known ones. In addition to these encouraging results, we also noticed that only groups of miRNA relationships are discovered by our method. Some miRNAs which are located far apart and whose seed similarity is low still have the same target. This indicated that besides genomic distance, seed and sequence similarities, more features need to be included in order to find more and better patterns shared by functionally alike miRNAs. Grimson et al. uncovered five general features of target site context beyond seed pairing that boost site efficacy (Grimson et al., 2007). In future research we will explore the site context in the miRNA relationship analysis. Additionally, we also consider taking into account miRNA co-expression patterns. In summary, we conclude that genomic distance, seed and sequence similarities are the determinants for describing the relationships of functionally similar miRNAs. Our method is complementary to the approaches that are currently used. It contributes to the improvement of target identification by predicting targets with high specificity. Moreover, it does not require conservation information for classification, so it is free from the limitations of some of the existing methods. In future research, with more biologically validated targets and features available, more rules can be generated from a large dataset, and consequently more targets can be identified to the functionally unknown miRNAs. The methodology can be transferred to a broad range of other species as well.
6. Acknowledgements We would like to thank Dr. Erno Vreugdenhil for discussing some biological implications of the results and Peter van de Putten for suggestions on the use of WEKA. This research has been partially supported by the BioRange program of the Netherlands BioInformatics Centre (BSIK grant).
7. References Bartel, D. P. (2004). MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell, Vol. 116, No. 2, 281-297 Bentwich, I. (2005). Prediction and validation of microRNAs and their targets. FEBS Lett, Vol. 579, No. 26, 5904-5910 Brennecke, J., Hipfner, D. R., Stark, A., Russell, R. B., & Cohen, S. M. (2003). bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the proapoptotic gene hid in Drosophila. Cell, Vol. 113, No. 1, 25-36 Chen, C. Z. (2005). MicroRNAs as oncogenes and tumor suppressors. N Engl J Med, Vol. 353, No. 17, 1768-1771 Enright, A. J. & Griffiths-Jones, S. (2007). miRBase: a database of microRNA sequences, targets and nomenclature. In: microRNAs: From Basic Science to Disease Biology, K.Appasani, S. Altman, & V. R. Ambros (Eds.), pp. 157-171, Cambridge University Press
Specificity Enhancement in microRNA Target Prediction through Knowledge Discovery
409
Enright, A. J., John, B., Gaul, U., Tuschl, T., Sander, C., & Marks, D. S. (2003). MicroRNA targets in Drosophila. Genome Biol, Vol. 5, No. 1 Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recogn.Lett., Vol. 27, 861-874 Griffiths-Jones, S. (2004). The microRNA Registry. Nucleic Acids Res, Vol. 32 Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A., & Enright, A. J. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res, Vol. 34 Grimson, A., Farh, K. K.-H., Johnston, W. K. K., Garrett-Engele, P., Lim, L. P. P., & Bartel, D. P. P. (2007). MicroRNA Targeting Specificity in Mammals: Determinants beyond Seed Pairing. Mol Cell, Vol. 27, No. 1, 91-105 Grun, D. & Rajewsky, N. (2007). Computational prediction of microRNA targets in vertebrates, fruitflies and nematodes. In: microRNAs: From Basic Science to Disease Biology, K.Appasani, S. Altman, & V. R. Ambros (Eds.), pp. 172-186, Cambridge University Press He, L., Thomson, M. M., Hemann, M. T., Hernando-Monge, E., Mu, D., Goodson, S. et al. (2005). A microRNA polycistron as a potential human oncogene. Nature, Vol. 435, No. 7043, 828-833 Johnson, S. M., Grosshans, H., Shingara, J., Byrom, M., Jarvis, R., Cheng, A. et al. (2005). RAS is regulated by the let-7 microRNA family. Cell, Vol. 120, No. 5, 635-647 Karginov, F. V., Conaco, C., Xuan, Z., Schmidt, B. H., Parker, J. S., Mandel, G. et al. (2007). A biochemical approach to identifying microRNA targets. Proceedings of the National Academy of Sciences, 19291-19296 Kim, S. K., Nam, J. W., Rhee, J. K., Lee, W. J., & Zhang, B. T. (2006). miTarget: microRNA target-gene prediction using a Support Vector Machine. BMC Bioinformatics, Vol. 7, Krek, A., Grun, D., Poy, M. N., Wolf, R., Rosenberg, L., Epstein, E. J. et al. (2005). Combinatorial microRNA target predictions. Nature Genetics, Vol. 37, No. 5, 495-500 Lavrac, N., Zelezny, F., & Flach, P. A. (2003). RSD: Relational Subgroup Discovery through First-Order Feature Construction. In Proceedings of the 12th International Conference on Inductive Logic Programming (pp. 149-165). Springer-Verlag Lecellier, C. H., Dunoyer, P., Arar, K., Lehmann-Che, J., Eyquem, S., Himber, C. et al. (2005). A cellular microRNA mediates antiviral defense in human cells. Science, Vol. 308, No. 5721, 795-825 Lee, R. C., Feinbaum, R. L., & Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell, Vol. 75, No. 5, 843-854 Lewis, B. P., Burge, C. B., & Bartel, D. P. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, Vol. 120, No. 1, 15-20 Lewis, B. P., Shih, I. H., Jones-Rhoades, M. W., Bartel, D. P., & Burge, C. B. (2003). Prediction of mammalian microRNA targets. Cell, Vol. 115, No. 7, 787-798 Martin, G. (2007). Prediction and validation of microRNA targets in animal genomes. J Biosci, Vol. 32, No. 6, 1049-1052 Maziere, P. & Enright, A. J. (2007). Prediction of microRNA targets. Drug Discov Today, Vol. 12, No. 11-12, 452-458
410
Machine Learning
Papadopoulos, G. L., Reczko, M., Simossis, V. A., Sethupathy, P., & Hatzigeorgiou, A. G. (2008). The database of experimentally supported targets: a functional update of TarBase. Nucleic Acids Research, Vol. 37, No. Database issue, D155-D158 Rehmsmeier, M., Steffen, P., Hochsmann, M., & Giegerich, R. (2004). Fast and effective prediction of microRNA/target duplexes. RNA, Vol. 10, No. 10, 1507-1517 Reinhart, B. J., Slack, F. J., Basson, M., Pasquinelli, A. E., Bettinger, J. C., Rougvie, A. E. et al. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature, Vol. 403, No. 6772, 901-906 Sankoff, D. & Kruskal, J. (1999). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Center for the Study of Language and Inf Sethupathy, P., Corda, B., & Hatzigeorgiou, A. G. (2006a). TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA, Vol. 12, No. 2, 192-197 Sethupathy, P., Megraw, M., & Hatzigeorgiou, A. G. (2006b). A guide through present computational approaches for the identification of mammalian microRNA targets. Nature Methods, Vol. 3, No. 11, 881-886 Sethupathy, P., Megraw, M., & Hatzigeorgiou, A. G. (2007). Computational approaches to elucidate miRNA biology. In: microRNAs: From Basic Science to Disease Biology, K.Appasani, S. Altman, & V. R. Ambros (Eds.), pp. 188-198, Cambridge University Press Tanzer, A. & Stadler, P. F. (2004). Molecular evolution of a microRNA cluster. J Mol Biol, Vol. 339, No. 2, 327-335 van der Heijden, F., Duin, R., de Ridder, D., & Tax, D. M. J. (2004). Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB. John Wiley & Sons Webb, A. R. (2002). Statistical Pattern Recognition. John Wiley and Sons Ltd., Witten, I. H. & Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Wuchty, S. a. F. W. a. H. I. L. a. S. P. (1999). Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, Vol. 49, 145-165 Zelezny, F. & Lavrac, N. (2006). Propositionalization-based relational subgroup discovery with RSD. Machine Learning, Vol. 62, No. 1-2, 33-63 Zhang, Y., de Bruin, J. S., & Verbeek, F. J. (2008). miRNA target prediction through mining of miRNA relationships. BioInformatics and BioEngineering, 1-6 Zhang, Y., Woltering, J. M., & Verbeek, F. J. (2007). Screen of MicroRNA Targets in Zebrafish Using Heterogeneous Data Sources: A Case Study for Dre-miR-10 and Dre-miR196. International Journal of Mathematical, Physical and Engineering Sciences, Vol. 2, No. 1, 10-18
Extraction Of Meaningful Rules In A Medical Database
411
21 x
Extraction Of Meaningful Rules In A Medical Database Sang C. Suh, Nagendra B. Pabbisetty and Sri G. Anaparthi
Texas A&M University - Commerce U. S. A.
1. Introduction Data Mining has become a prominent approach in recent years for generating rules within databases which concentrates on generating valuable information for decision making. Clustering, in Data Mining is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by portioning the data points into similarity classes. The traditional clustering algorithms which use distance between points define boundaries in clustering. But, these clustering mechanisms don’t apply on boolean and categorical attributes. The clustering is applied on unstructured data to extract knowledge that may take a form of predictive rules. Clustering enhances the value of existing databases by revealing rules in the data. These rules are useful for understanding trends, making predictions of future events from historical data, or synthesizing data records into meaningful clusters. Through clustering are similar data items grouped together to form clusters. Clustering algorithms usually employ a distance metric based (e.g., Euclidean) similarity measure in order to partition the database such that data points in the same partition are more similar than points in different partitions. In this chapter, we study clustering algorithms for data with categorical attributes. Instead of using traditional clustering algorithms that use distances between points for clustering which is not an appropriate concept for boolean and categorical attributes, we propose a novel concept of HAC (Hierarchy of Attributes and Concepts) to measure the similarity/proximity between a pair of data points. Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, the output is a binary tree whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. Hierarchical structures are ubiquitous in the natural world. The idea of hierarchical (also known as agglomerative) clustering is to begin with each point from the input as a separate cluster. We then build clusters by, repeatedly merging the two clusters that are closest in features from the initial points. This gives a hierarchy of containment, as each point in the input belongs to a succession of larger clusters. If we keep merging, we end up with a single cluster that contains all points, and the structure of the hierarchy can be represented as a
412
Machine Learning
(binary) tree. From this tree we can extract a set of k clusters. For example, terminating the merging when only k clusters remain, or when the closest pair of clusters are at a distance exceeding some threshold. The crucial part of this algorithm is to define a metric to measure the distance between two clusters of multiple points. One kind of definition that uses mathematical model is: the smallest distance between a point in one cluster and a point in another; the greatest distance between such points; or the average distance. Each definition has its own advantages and disadvantages. This research focuses on hierarchical conceptual clustering in structured, discrete-valued databases. By structured data, we refer to information consisting of data points and relationships between the data points. This differs from a definition of unstructured data as containing free text and structured data containing feature vectors. Conceptual clustering is an important way of summarizing and explaining data [1, 6]. However, the recent formulation of this paradigm has allowed little exploration of conceptual clustering as a means of improving performance. Furthermore, previous work in conceptual clustering has not explicitly dealt with constraints imposed by real world environments. This chapter presents a clustering using HAC (Hierarchy of attributes and concepts), which is a hierarchical conceptual clustering system that organizes data so as to maximize inference ability. This algorithm uses both the hierarchical and conceptual clustering methods to implement clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, the substructure is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. An important property of the conceptual clustering is that, it can enhances the value of existing databases by revealing patterns in the data. These patterns may be useful for understanding trends, for making predictions of future occurrences from historical evidence, or for synthesizing data records into meaningful clusters. A conceptual clustering system accepts a set of object descriptions (events, observations, facts) and produces a classification scheme over the observations. These systems use an evaluation function to determine classes with "good" conceptual descriptions. A learning of this kind is referred to as learning from observation (as opposed to learning from examples). Typically, conceptual clustering systems assume that the observations are available indefinitely so that batch processing is possible using all observations. In this study, HAC will be used as an aid to represent medical domain knowledge substructures to simplify the generation process of the databases through clustering. As a result, the research will identify interesting relationships and patterns among the data, and represent them in the form of association rules.
2. Related Work Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one
Extraction Of Meaningful Rules In A Medical Database
413
of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorial and differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies slow to occur [2]. Hierarchical clustering is one of the most frequently used methods in unsupervised learning. Given a set of data points, the output is a binary tree (dendogram) whose leaves are the data points and whose internal nodes represent nested clusters of various sizes. The tree organizes these clusters hierarchically, where the hope is that this hierarchy agrees with the intuitive organization of real-world data. Hierarchical structures are ubiquitous in the natural world [5]. There are two general approaches to hierarchical clustering: top-down and bottom-up. The top-down approach starts with a cluster containing all points that is recursively split until enough sub clusters have been created. Heckel et al. [10, 11] use this approach on vector fields where each point has a position and a vector. The cluster with the highest error value will split into two clusters recursively so that the error values of the remaining clusters decrease with each split. The bottom-up approach starts with all points as individual clusters and merges the two clusters with least difference, until one big cluster has been formed from all clusters. Telea and van Wijk [10] use this approach to simplify complex vector fields using an elliptic similarity function. They merge the pair of vectors with the least position, magnitude and direction differences until all vectors have been merged into one single vector. Conceptual clustering is used to summarize the result. It enhances the value of existing databases by revealing patterns in the data. These patterns may be useful for understanding trends, for making predictions of future occurrences from historical evidence, or for synthesizing data records into meaningful clusters. The hybrid conceptual clustering [3] is used to handle both the incremental and non incremental problems for clustering successfully but it is computationally expensive moreover can be applied on small data sets. In past decades, many conceptual clustering algorithms have been proposed which can automatically acquire knowledge or concepts from large amounts of information acquired from experience or observation [1, 7, 8, 9]. Concepts in COBWEB are represented by probabilistic expressions and are acquired by using four learning operators and an evaluation function called category utility. But the category utility used in original COBWEB has a bias to prefer larger size classes in concept hierarchy. This bias produces some spurious intermediate nodes in concept hierarchy (classification tree). These nodes make tree deeper and complex, so we can’t understand concepts within the nodes of tree easily [9]. This chapter presents an efficient non-metric measure called HAC (Hierarchy of Attributes and Concepts) for clustering of categorical as well as non-categorical(quantitative) data through which the proximity and relationships between data items can be identified.
3. Information 3.1 Hierarchy of Attributes and Concepts The present paper introduces a hierarchical description of concepts by attributes, mathematical formalization presents the concepts as matrices whose columns represents terms constructed by attributes. A spherical model is developed to present a vocabulary
414
Machine Learning
(spanned space) of concepts. In the beginning of this section the following definitions are introduced. Attribute: Is a basic characteristic or a feature of a term. Term: Is considered as a set of connected attributes. Concept: Is a language independent meaning associated with at least one term, or set of terms. Vocabulary: Is a set of terms and concepts. HAC is both a hierarchical and conceptual clustering system that organizes data to maximize inference ability. This algorithm implements clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, it is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. HAC accepts database of structured data (concepts) as input. This type of data is naturally represented using a graph or diagrammatical form. The graph representation includes labeled vertices with vertex ID (identification) numbers, attributes on X-axis and downward directed edges (see Fig 1). Each vertex represents a concept and the value of that concept is given by the directed edges (usually map to the attribute’s value on X-axis or to some other concept).
Fig. 1. The Graphical representation of HAC
Extraction Of Meaningful Rules In A Medical Database
415
ID
Disease Name
Causes Id
1
Acute sinusitis
K1,K2,K3,K4
2
Anaphylaxis
K5,K6,K7,K8,K9,K10,K11,K12,K13,K14,K15,K16
3
Penicillin allergy
K6
4
Latex allergy
K17,K18
5
Peanut allergy
K6,K19,K20,K21,K22,K23,K24,K25,K26,K27,K28, K29,K30,K31
6
Mold allergy
K32
7
Nickel allergy
K21,K29,K33,K34,K35,K36,K37,K38,K39
8
Dust mine allergy
K40
9
Asperigillosis
K29,K41,K42,K4,K44
10
Soy allergy
K45,K46
11
Shellfish allergy
K47,K48,K49,K50,K51,K52,K53,K54,K55,K56
12
Wheat allergy
K57,K58,K59,K60
………
……
………….
K6,K7,K8,K9,K10,K11,K23,K24,K45,K46,K47,K4, K49,K50,K51,K52,K53,K54,K56,K57,K58,K59,K60 20 Food allergy ,K62,K63,K64,K65K66,K67,K68,K88,K89,K90,K91 ,K92,K93 Table 1. The HAC attribute Table based on attribute allergies. Using the graph we can create a concept table for every attribute in the database and this table will have all the concepts and their value which are relevant according to the attribute. Ultimately with the help of these concept tables we can get a table where each record will have an attribute name, its value and relationship with other concepts. Each attribute in such a table represents a cluster. For example Table1, which is formed from set of documents within the domain of education, consist of three columns. The second column specifies the attribute name and the third column specifies the attribute values. Now from table 1 we form another table called Concept Table (Table 2). Every row in this table (Table 2) consists of three fields. The first field shows the concept name whereas the second and third attributes represents concept value and concept attributes. For example, the Concept Name C1 whose value is sense is a concept formed from the attributes F1,F2,F12,F17,F21,F36,F54 which represents various Food ID’s. Recall that HAC can be represented as a closed diagramatical entity shown in Fig 2.
416
Machine Learning
Food Category ID
Food Category
Food ID
C1
Dairy
F1,F2,F12,F17,F21,F36,F54
C2
Seafood
F3,F4,F73,F74,F75,F76,F77,F78
C3
Poultry
F18,F40,F48,F26,F10,F30
C4
Grains
F7,F16,F29,F33,F50,F68,F69,F70,F8
C5
Nuts
F5,F90,F84,F91,F93
C6
Fruits
C7
Vegetables
F32,F37,F71,F72,F81,F82,F85,F86,F87,F89,F92,F 95 F6,F53,F55,F58,F67,F80,F83
C8
Bakery
F11,F13,F14,F20,F43,F45,F51
C9
Drinks
F9,F15,F24,F25,F35,F39
C10
Fat items
F22,F28,F41,F5
C11
Seeds
F42,F60,F61
C12
Leafyand vegetables
C13
Junk foods
F19,F23,F27,F31,F38,F49,F44,F46,F47,F94
C14
Spices
F34,F56,F62,F63,F66
Salad
F52,F57,F64,F65,F79,F88
Table 2. The HAC concept Table based on the attribute Food Category. 3.2 Approach HAC is both a hierarchical and conceptual clustering system that organizes data so as to maximize inference ability. This algorithm implement clustering by discovering substructures in database which compress the original data and represent structural concepts in the data. Once a substructure is discovered, the substructure is used to simplify the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific data analysis goals. HAC accepts database of structured data (concepts) as input. This type of data is naturally represented using a graph. The graph representation includes labeled vertices with vertex id numbers, attributes on X-axis and downward directed edges. Each vertex represents a concept and the value of that concept is given by the directed edges (usually map to the attribute’s value on X-axis or to some other concept). With this graph we can make a concept table for every attribute in the database and this table will have all the concepts and their value which are relevant according to the attribute. Ultimately with the help of these
Extraction Of Meaningful Rules In A Medical Database
417
concept tables we can get a table in which each record will have an attribute name, its value and relationship with different concepts. Each attribute in this table represents a cluster.
Fig. 2. The Triangular representation of the HAC.
4. Spherical Model The graph model developed in the previous section is useful for connection purposes. To provide an opportunity to generate and manipulate concepts a matrix form is considered here after. a11 a12 .………….. a1n Ckxn
…………………………………. =
....………………………………. ………………………………….
ak1 ak2 …………… akn Every column in Ckxn represents a term and every aij represents an attribute. Thus every concept could be decomposed to set of independent terms and every term be generated by pivot element. Therefore every concept (matrix) could be transformed to set of linearly independent terms (columns), created by linearly independent attributes (enteries). As a consecuence every concept will have an invertable minor and the dimensions of this minor will be called a rank of the concept. The set of linearly independent terms Ti where i = 1, 2,…, n over the field of real numbers may span the entire set of concepts. T = span {Ti} i=1,…..,m
(1)
If the dimension of the set T is too large the matrix form of concepts presentation would happen to be memory inefficient. To overcome this obstacle a spherical model is developed to represent terms and concepts generation. The central section of the sphere represents the entire set of linearly independent attributes used to build up the basis Ti where i = 1, 2,…,
418
Machine Learning
m. The next part forms the terms and is followed by concepts. Finally the model is ending with only one highest point as shown in figure 3. If the total number of terms, coming from a certain domain, is n then the maximum number of concepts to be generated by (n-1) terms over a numerical field of 1 element (number) is
n . n 1 But if n-2 elements are used over the same numeric field the number of composed concepts would be n . n 2
Fig. 3. The Spherical Model Therefore the total number of concepts to be generated over a field with 1 element and n terms is 2n . If the numeric field consists of j elements (numbers), and the maximum size of a term (column) is k then the total number of concepts to be generated is
j k 2n .
From
theoretical view point it could be a very big number, but in practice the number of concepts would be smaller because some terms could happen to be mutually contradictive and cannot be linked in a single concept.
5. Algorithm On the basis of the theoretical concepts an algorithm is developed to form an HAC. Algorithm Forming_of_HAC_Clusters (A_Name [], A_Values [])
Extraction Of Meaningful Rules In A Medical Database
419
1.
Make concept tables for different attributes using HAC. In this basically we have to make such that each level or concept is meaningful. 2. Now, make a “Cluster Point Table” which will have attribute name, attribute value and relationship with different concepts. Attribute name is basically the name of cluster and each and every attribute name is associated with an attribute value 3. Concept will give the value of cluster which is a combination of different attributes and concepts. 4. Concept values can be generated by using following steps: 5. Let total_num= Total number of attributes values in given table; 6. While I -> -> -> -> -> ->
{[part]} [part name] + {[attributes]} {[attributes]} + [part_name] [color] + [texture] + [shape] + [modifier] {head, neck, tail, wing, back, beak …} {black, brick-red, red, brown, dark grey …} {long, shinny, tiny, conspicuous …}
Words in braces may appear repeatedly; words in square brackets are variable terms which can be further decomposed into other components; words without brackets are final
430
Machine Learning
symbols, i.e., those which found in the original descriptions. The next step is to define the final symbols, i.e., the domain lexicon to reduce the complexity of processing. 2.2 Linguistic Processing As shown in the previous section, the domain lexicon contains five types: [part_name], [color], [texture], [shape] and [modifier]. Since [part_name] contains the domain specific words, it has to be defined first by domain experts. The other four types of lexicon can be derived automatically by applying linguistic processing tools: CKIP AutoTag (CKIP, 2009) and HowNet (Dong & Dong, 2009). In order to derive the domain lexicon, we apply the auto-tagging program, which is developed by CKIP group, to perform word segmentation and obtain the POS (part-ofspeech) tags of each word. The semi-structural corpora are fed to the auto-tagging program, and the resulting POS-tagged words are then processed by HowNet (Dong & Dong, 2009). HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts. For each meaningful word, HowNet provide its semantic attributes. For instance, we can extract words with the attribute “color” easily from HonNet. Thus, by examining all the processed words, we can group those words with attribute “color” together and thus form the [color] domain lexicon. Same procedure can be applied to the [part], [texture] and [modifier] domain lexicons. The derived domain lexicon may contaian words which are too rare or too detailed. It will cause further processing inefficient. Thus, those lexicon need to be refined. For the four types of lexicon, i.e., [part], [color], [texture] and [shape], the refinement processing is described below. The [part] lexicon originally contain more than 130 words, however, they can be reduced to the most foundamental parts. For example, the words {forehead, upper head, backhead, hair, tophead} are reduced to the foundamental form „head“. The [color] lexicon contains 152 color words. Based on the theory of basic color (Berlin & Kay, 1969), they are reduced to 11 foundamental colors: {black, grey, white, ping, red, orange, yellow, green, blue, purple, brown}. The [texture] lexicon is reduced to 16 words: {M-shape, Z-shapre, V-shape, fork-shape, Tshape, triangular, mackerel scale, worm hole, round, wave, point, line, thick spot, thin spot, horizontal, vertical}. The [modifier] lexicon contains those with HowNet attributes „modifier“. Those words are used as emphasized words, such as {striking, shinny, straight, interlaced, ...} 2.3 Vector Encoding Each sentence in the corpora is transformed into two types of vectors, i.e., the lexical vector and the fuzzy vector. The lexical vector concerns the lexical part of the described sentences. Lexical vector encoding is simply binary encoding. The elements of the lexical vector are either 0’s or 1’s. The dimension of lexical vector equals to the number of all reduced lexicon terms. (That’s why we reduced the lexicon terms as mentioned above, i.e., to reduce the dimension for faster processing). For each word in the sentence to be encoded, it causes an 1 in the corresponding dimension of the vector.
Establishing and retrieving domain knowledge from semi-structural corpora
431
The fuzzy vector of a sentence consists of the membership value between sentence-words and lexicon-words. The membership values are divided into three types: part, color and texture. We can use three tables to illustrate how fuzzy vector encoding is done. For fuzzy membership of [part], each detailed-part word is valued by the relationship of how it closes to the fundamental parts. This process is done by averaging several expert’s opinions of the membership values. An example membership table is shown below:
Detailed parts
Forehead Upper beak ...
Head 0.8 0.1
Back 0 0
Tail 0 0
...
...
...
Table 1. Membership table for detailed parts.
Fundamental parts Body Wing 0 0 0 0 ...
...
Ear 0 0
Beak 0 0.9
... ... ...
...
...
...
The [color] membership is obtained by the subjective opinions of 10 tagers. For each detailed-color word, the membership value for the eleven fundamental colors are taged and averaged to produce a membership table. An example membership table is shown below:
Detailed colors
Dark brown Rust ...
Black 0.1
Red 0.4
White 0
0.2 ...
0.5 ...
0 ...
Table 2. Membership table for detailed colors.
Fundamental colors Orange Pink 0 0 0 ...
0 ...
Grey 0
Brown 0.9
... ...
0.1 ...
0.6 ...
... ...
The [texture] membership table can be derived by a similar process. Combining all three membership values, a sentence can then be encoded into a fuzzy vector. 2.4 Vector Similarity Measure In vector space, the similarity of two vectors X and Y can be calculated using five methods (Manning & Schutze, 1999) as shown in the Table 3. Similarity measure Mathcing coefficient Dice coefficient Jaccard (or Tanimoto) coefficient Overlap coefficient Cosine measure Table 3. Definitions of Vector similarity.
Definition
432
Machine Learning
While calculating the similarity of two vectors, most approached used the cosine measure of vector intersection angle. However, since it’s hard to predict the fuzzy degree of object description made by user, fuzzy encoding vectors should not use the same similarity measure as literal vectors did. In this study, we use Cosine similarity measure for lexical vectors and Overlap coefficient for fuzzy vectors. The final similarity of two descriptions is evaluated according to the measure combining the weight of lexical similarity (SLex) and fuzzy similarity (SFuz). The similarity can be evaluated by the following formula: (1) The literal similarity of two vectors can be defined by equation (2). (2) Where, m is the dimension of the literal vector. Let S and T be two-dimensional matrix (table) of two sentences. The fuzzy vector similarity can be expressed as equation (3) below. (3) Where, olp(A,B) represent the overlap coefficient of vector A, B. The equation for olp(A,B) is shown below: (4) where, C=(c1,c2,…cn), ci=min(Ai,Bi)
3. Results and Discussion The training corpus is a popular illustrated handbook (Wang et al, 1991) with detail descriptions of the features of 442 wild birds in Taiwan. The content is highly recommended by the bird watchers in Taiwan. The structures of the descriptions are very similar, however, the sentences may not grammatically valid due to the need of reducing the page amount. There are totally 6257 sentences in the training corpus. The testing material varies from three scopes: 1) content of 40 birds from another illustrated handbook (Wu and Hsu, 1995); 2) descriptions of 20 random chosen birds made by a domain expert; 3) naive people’s descriptions of 20 randomly chosen birds. The testing handbook contains similar description format as the training one, but was published by different group of people. The expert is a senior birdwatcher with experience of bird watching more than eight years. The naive people had no experience or expertise of wild birds. Figure 2 shows the four types of description for a bird named Black-browed Barbet.
Establishing and retrieving domain knowledge from semi-structural corpora
Corpus source
A
B
433
五色鳥 (Black-browed Barbet) Description in Chinese
Description in English
嘴粗厚,黑色,腳鉛灰色。頭 部大致為藍色,額、喉黃色, 眉斑雜有黑色羽毛,眼先有紅 色斑點,前頸亦有紅斑。後 頸、背部鮮綠色,胸以下鮮黃 綠色。 頭部由鮮豔的紅、黃、青、 綠、黑組成,所以才稱為五色 鳥。頭部大致為藍色,額、喉 黃色,眉斑雜有黑色羽毛,眼 先有紅色斑點,前頸亦有紅 斑。後頸、背部均為鮮綠色, 胸以下鮮黃綠色。
Beak is thick, black, foot is lead-grey. Head is almost blue, forehead and throat is yellow, eyebrow contain black feather, red dot in front of eye, fore_neck has red dot too. Back_neck and back is bright green, yellowgreen below chest. Head consist of five bright colors: red, yellow, blue, green and black, that’s why it is named “five-color bird”. Head is almost blue, forehead and throat is yellow, eyebrow contain black feather, red dot in front of eye, fore_neck has red dot too. Back_neck and back is bright green, yellow-green below chest. The whole body is green. Beak-base is thick, iron-grey. Foot is grey-green. Head is blue, forehead and throat is yellow. A red dot before its eye, eyebrow is black.
全身綠色。嘴基粗厚,鐵灰 色。腳灰綠色。頭藍色,額頭 C 跟喉黃色。眼先有紅點,眉斑 黑色。 頭 有 紅 色 、 黃 色 、 藍 色 、 綠 Its head is red, yellow, blue, green and black. 色、黑色。頭部藍色,額頭、 Head is blue, forehead and throat is yellow, D 喉嚨黃色,頸部有紅斑。後頸 neck contain red dot. Back_neck and back is green. 部、背面綠色。 Table 4. Example of descriptions for the Black-browed Barbet. The description comes from the (A) training handbook, (B) testing handbook, (C) expert and (D) naive people, respectively. The experiments were performed with the weight factor α set to 0, 0.1, 0.3, 0.5, 0.7, 0.9 and 1 respectively. The value of α is equal to zero if we want to ignore the lexical score; and is equal to one if we want to ignore the fuzzy score. For each testing bird, the top-N scores are recorded with N=1, 3, 5, and 10. The experimental results are shown in Figure 3 and Figure 4. 1.0
score
0.8
Book1 Precision Book1 Inclusion
0.6 0.4
Expert Precision Expert Inclusion
0.2 0.0 0.0 0.1 0.3 0.5 0.7 0.9 1.0
User Precision User Inclusion
alpha
Fig. 3. The precision and inclusion rates for handbook, expert and naive user with alpha ranged from 0 to 1.
434
Machine Learning
Top-1 Average Score 100 80 60 40 20 0
book1 expert user 0
0.1
0.3
0.5
0.7
0.9
1
Top-5 Average Score 100 80 60 40 20 0
book1 expert user 0
0.1
0.3
0.5
0.7
0.9
1
Top-10 Average Score 100 80 60 40 20 0
book1 expert user 0
0.1
0.3
0.5
0.7
0.9
1
Fig. 4. The averaged score of top-N results for handbook, expert and naive user with alpha ranged from 0 to 1. The first experiment is to compare the precision and inclusion rate of different testing data. Suppose the number of total testing data set is K, the number of correct answers appeared in the top-N candidates is C. The precision rate is defined as: (5) Suppose the total number of top-N candidates is T, the inclusion rate is defined as: (6) The precision rate, as defined in usual cases, tells if the correct answer is retrieved. The inclusion rate shows whether redundant answers are also reported while retrieving the answers. Figure 3 shows that, if the weighting (α) of lexical vector closed to 1.0, the precision
Establishing and retrieving domain knowledge from semi-structural corpora
435
rate will be high and, the inclusion rate will be low. This is because redundant answers with the same similarity scores are also retrieved if consider only the lexical scores. In Figure 4, the average matching score of expert increases as the value of α moving from 0 to 1. This is because that the wording of expert is similar to those in the handbook and thus has higher score while using large α value. (Note that higher α means higher lexical weighting.) On the other hand, the score of naive is higher when choosing smaller α value. That is, the weighting of fuzzy vector affects the similarity score. This result corresponds with the fact that naive people are not familiar with the domain specific wordings, and introducing the fuzzy vector score has the advantage of compensating the mismatch between their wordings to those in the training corpus. Fig. 5 and Fig. 6 show the precision rate and inclusion rate of top-10 results for all three types of testing corpora. The notation in these two figures are:
B2: E: U: _P: _I: _1: _2: _3:
corpos from another illustrated handbook; corpos from a domain expert; corpos from a noive user; precesion ratio; inclusion ratio; use cosine measure for literal and fuzzy similarity; use overlap measure for literal and fuzzy similarity; use cosine for literal similarity and overlap as fuzzy similarity;
Fig. 5. Top-10 precision ratio for all testing corpora.
Fig. 6. Top-10 inclusion ratio for all testing corpora.
436
Machine Learning
Since the training corpus is the descriptions of wild bird in an illustrated handbook, corpus B2 (another illustrated book) had the best average precision ratio; corpus E (domain expert) also achieve good result; however, corpus U (naive user) can only got good result when α is closed to 0. The results showed that to allow user query in spontaneous descripiton, the system should have high weighting in the fuzzy vector instead of literal vector.
4. Conclusion and Future Works In this study, we proposed an approach to establish and retrieve domain knowledge automatically. The domain knowledge is established by combining the method of linguistic processing and frame-based representation. The features of descriptions consist of two major types: literal vectors and fuzzy vectors. The cosine and overlap measure is chosen to compute the similarity between literal vectors and fuzzy vectors respectively. According to our study, several results were observed: 1. The proposed approach for domain knowledge processing is useful for establishing and retrieving eco-knowledge. 2. For some birds, its features maybe marked directly on the figures in the book, a few descriptions may be missed in the text data. This will cause some mismatch in the experiment. 3. If an experienced bird watcher wants to use the inquiry system, the literal weighting should be increased. Experiment results showed that the weighting factor could be set as 0.9. 4. For a naive use to user the inquiry system, the literal weighting should be decreased. The weighting factor could be set as 0.2. For queries made by expert, it seems that only lexical matching is enough. However, for naive people who have no expertise on how to use specialized wording for the description of birds, combining lexical vector score with the fuzzy ones is a good choice. Since color attributes are essential for discrimination of birds, it plays an important role in the visual cognition of birds. Currently, our study adopted only the eleven basic colors, more sophisticate color membership determination should be considered to obtain better results. The further interesting research topic will be discovering the commonality and difference between book-style knowledge and knowledge collected from large amount of spontaneous description about objects.
5. References Berlin B. & Kay P. (1969). Basi Color Terms : Their Universality and Evolution, University of California Press. CKIP, (2009), CKIP AutoTag, available at http://ckipsvr.iis.sinica.edu.tw/ Dong Z. & Dong Q. (2009). HowNet Knowledge Database, http://www.keenage.com. Kolodner J. (1993). Case-Based Reasoning, Morgan Kaufmann, 978-1558602373. Manning C. D. & Schutze H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, 978-0262133609, Cambridge.
Establishing and retrieving domain knowledge from semi-structural corpora
437
Minsky, M. (1975). A framework for representation knowledge. The Psychology of Computer Vision, McGraw-Hill, 978-0070710481, New York. Munn K. & Smith B. (2009). Applied Ontology, An Introduction, Ontos Verlag Transaction Pub, 978-3938793985. Negnevitsky M., (2002). Artificial Intelligence, A Guide to Intelligent Systems, Addison-wesly, 978-0321204660, England. Newell A. & Simon H.A. (1972). Human Problem Solving, Prentice Hall, Englewood Cliffs, 978-0134454030, NJ. Quillian M.R. (1965). Word concepts: a theory and simulation of some basic semantic capabilities, Behavioral Science, Vol. 12, No. 5., pp. 410-430. Quillian M.R. (1968). Semantic Memory, Semantic Information Processing, The MIT Press, Ch. 4, pp.227-270, 978-0262130448. Triantaphyllou E. & Felici G. (2006). Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques (Massive Computing), Springer, 978-0387342948. Wang G.H. et al., (1991). Taiwan Wild Birds, Arthur Books, Taipei. Watson I. (1997). Applying Case-Based Reasoning: Techniques for Enterprise Systems, Morgan Kaufmann, 978-1558604629. Wu T.H. & Hsw W.B. (1995), Guiding Map of Bird Watching in Taiwan, BigTree Culture, Taipei. Uschold, M., Gruninger, M. (1996). Ontologies: Principles, Methods and Applications, The Knowledge Engineering Review, 11, 93-136.
438
Machine Learning